mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-08-21, 16:29   #12
vaughan
 
vaughan's Avatar
 
Jan 2005
Sydney, Australia

5·67 Posts
Default

I had a new SSD drive die within 3 weeks of purchasing it. Fortunately I had backups of all my important files (personal and business) on multiple HDDs on other networked PCs and also on-line (I use a paid version of Carbonite).

I learnt in my early days of computing, way back on an IBM /SY36 minicomputer, backup early and backup often.
vaughan is offline   Reply With Quote
Old 2010-08-21, 16:55   #13
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by Flatlander View Post


I know you guys are the experts but please tell me there are multiple backups of all the NPLB and CRUS results and sieve files.

I've had about 4 HD failures here in 10 years.

All my important data here is on 4 HDs on 3 PCs, then it is automatically backed up online. 678GB so far.
The sieve files and master results files (i.e., anything that's been processed from the server format into sorted manual format stuff for Gary) are backed up to an external hard drive of Gary's on a somewhat regular basis. What's not backed up is the stuff on the server: the stats DB and all the server results files on the noprimeleftbehind.net web site. Right now, we do have a lot of results that are unprocessed so I suppose would lose quite a bit if that drive went kapooey.

Gary, Dave and I are currently discussing backup options via email--stay tuned.
mdettweiler is offline   Reply With Quote
Old 2010-08-21, 17:24   #14
Flatlander
I quite division it
 
Flatlander's Avatar
 
"Chris"
Feb 2005
England

31·67 Posts
Default

Quote:
Originally Posted by vaughan View Post
I had a new SSD drive die within 3 weeks of purchasing it. Fortunately I had backups of all my important files (personal and business) on multiple HDDs on other networked PCs and also on-line (I use a paid version of Carbonite).

I learnt in my early days of computing, way back on an IBM /SY36 minicomputer, backup early and backup often.
I found Carbonite very good but changed to another service because Carbonite restricts the bandwidth down to 1GB per day as you upload more and more.
So not "Unlimited Backups" despite their claims.
(Now with SquirrelSave(!), UK based and no bandwidth limitations.)
Flatlander is offline   Reply With Quote
Old 2010-08-21, 18:33   #15
Oddball
 
Oddball's Avatar
 
May 2010

499 Posts
Default

Quote:
Originally Posted by Flatlander View Post
I've had about 4 HD failures here in 10 years.
Quote:
I had a new SSD drive die within 3 weeks of purchasing it.
How many GB does it take to store all of the LLR residues? Just wondering.

BTW, I've only had one hard drive failure in the past 10 years.
Oddball is offline   Reply With Quote
Old 2010-08-21, 19:29   #16
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

55308 Posts
Default

I've got all resultfiles from every NLPB-server from beginning.
Altogether with all processed/checked data they are about 4GB.
Older results are backuped on 2 different HD's, the newer ones on a stick and stored on another HD, too.

My work-folders contain about 6GB of data: NPLB, aliquot, docs, code, progs and other stuff related to (prime)numbers.
kar_bon is offline   Reply With Quote
Old 2010-08-22, 00:20   #17
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

290410 Posts
Default

So what went wrong during or better after the failure yesterday?

All 660 pairs reserved for my 12 offline-cores (work for 24 hours!) were rejected from port 3000 NPLB-server!

So this means the joblist.txt for that server was damaged or completely deleted?

That's the only reason, why the server sent them again to any user (the pruning was not done because the results were not in the server then, but those pairs are still in the joblist.txt), the results submitted to the server and after that all, I've sent them back 'again' although I've reserved them first!

Please try to figure out why!
kar_bon is offline   Reply With Quote
Old 2010-08-22, 02:59   #18
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

11000011010012 Posts
Default

Quote:
Originally Posted by kar_bon View Post
So what went wrong during or better after the failure yesterday?

All 660 pairs reserved for my 12 offline-cores (work for 24 hours!) were rejected from port 3000 NPLB-server!

So this means the joblist.txt for that server was damaged or completely deleted?

That's the only reason, why the server sent them again to any user (the pruning was not done because the results were not in the server then, but those pairs are still in the joblist.txt), the results submitted to the server and after that all, I've sent them back 'again' although I've reserved them first!

Please try to figure out why!
From what we can tell, the hard drive crash yesterday occured right in the middle of your daily dump. Thus, not only did it "forget" your reservation of 660 pairs that you submitted just now, it also "forgot" that it had accepted much of your last dump. This morning, we had a huge number of duplicated pairs get rejected by the DB because the results files included both your original results and then results from Gary and Vaughan from the same pairs that were erroneously reassigned to them.

That reminds me, I was going to send you a PM earlier today but didn't get the chance to yet: you can expect quite a load of duplicated pairs in the 8/21 results file when you process it, as well as possibly 8/20. Unfortunately, there's not much we can do about this kind of thing except let it work itself out of the system, which it should have done completely by now.

Any additional such large batches of work you reserve from the server should be OK.
mdettweiler is offline   Reply With Quote
Old 2010-08-22, 06:02   #19
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

101000100110112 Posts
Default

Sorry that you lost pair processed for a day Karsten. It looks like all 3 of us ended up losing many pairs processed. For your pairs yesterday, it did not record in joblist that you had sent in the results the first time around (even though the results were actually accepted by the DB) and so handed the pairs back out again to Vaughan and me after they were "expired" from its perspective. We then processed them, it accepted them, and then they came back from AMDave's process as duplicates. Since your results came in first you got credit for them. For your pairs today, the reverse happened. Since it never recorded that you had reserved them in joblist, they were handed to Vaughan and me. We processed them before you did and you ended up having yours rejected.

Sorry about all of the problems guys. The "semi-crash" had much more far-reaching effects than what I expected. I hope the above is the last of it. I also just now saw in the CRUS forum that Max and my personal PRPnet servers as well CRUS PRPnet port 1300 were toasted. Fortunately Max quickly got them reconstructed.

Karsten, this is something that I think I've brought up before. It can be hard on the servers to cache and then receive back so many hundred pairs at once, especially when those pairs come back in all at once over a day later. My suggestion would be to reserve an n=500 or n=1000 manual range in the drive for your own manual processing. It's a little more personal effort to post/send the results but would help prevent such large-scale problems like this.

Note that I will be doing a full backup of Jeepford twice/month now. I know that some backup is already being done of some of the DB stuff by Dave but it is by no means comprehensive of the entire machine.


Gary

Last fiddled with by gd_barnes on 2010-08-22 at 06:03
gd_barnes is online now   Reply With Quote
Old 2010-08-22, 07:46   #20
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

55308 Posts
Default

I have to look in the llrserver process, perhaps the pruning of the joblist or the pruningtime/cycle could be changed to avoid such loss.

The option "prunePeriod" in "llr-serverconfig.txt" gives the timeframe when pruning the pairs.
The internal list of pairs given to any client is written to file only after "pruningPeriod" is over.

This pruning can only be done when a event occurs to the server. This event is the connecting of a client to the server.
So if there're many reservation at once, those reservations are first stored in the internally list of the server (memory). After the pruningPeriod is over, that list will written/updated to a file in that case.
So, assumed there're 100 reservations at once and the next time a client connects to the server after 24 hours (such I do it for the k=300-400 n=1M-2M server), the whole reservations are stored only in memory, not in a file!

I don't know what options are set on all servers, we should have a closer look at them.

Suggestions:
- Set an option not only by timeframe (pruningPeriod in seconds) but also by amount (pruningAmout in number of reservations of all clients), so which option is first, do the pruning.
-Set a special option for the client to call 'llrserver -s' which will force the server to simplify joblist and knpairs files.

Both seems small changes on client and/or server but without changing the whole communication (old clients not affected!).

Minds?
kar_bon is offline   Reply With Quote
Old 2010-08-22, 14:12   #21
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

11000011010012 Posts
Default

Quote:
Originally Posted by kar_bon View Post
I have to look in the llrserver process, perhaps the pruning of the joblist or the pruningtime/cycle could be changed to avoid such loss.

The option "prunePeriod" in "llr-serverconfig.txt" gives the timeframe when pruning the pairs.
The internal list of pairs given to any client is written to file only after "pruningPeriod" is over.

This pruning can only be done when a event occurs to the server. This event is the connecting of a client to the server.
So if there're many reservation at once, those reservations are first stored in the internally list of the server (memory). After the pruningPeriod is over, that list will written/updated to a file in that case.
So, assumed there're 100 reservations at once and the next time a client connects to the server after 24 hours (such I do it for the k=300-400 n=1M-2M server), the whole reservations are stored only in memory, not in a file!

I don't know what options are set on all servers, we should have a closer look at them.

Suggestions:
- Set an option not only by timeframe (pruningPeriod in seconds) but also by amount (pruningAmout in number of reservations of all clients), so which option is first, do the pruning.
-Set a special option for the client to call 'llrserver -s' which will force the server to simplify joblist and knpairs files.

Both seems small changes on client and/or server but without changing the whole communication (old clients not affected!).

Minds?
One thing to note: even though the pruning only can be triggered by an event on the server, all data since the last prune is not just stored in memory; it's also stored in joblist.txt as "appended" data. The prune just removes any unnecessary "appended" data (for instance, cancelling out corresponding entries for reservation and result) and cleans out knpairs.txt.

Normally, that would be enough; in the past if, say, a power outages occurred, then we didn't lose anything as long as nobody was in the middle of talking to the server at the time of outage. In your case, though, your client was in the middle of talking to the server, so the entire communication's worth was lost--which in your particular case was unfortunately quite a bit of pairs.

What PRPnet does to address this is write each reservation to the database as it happens--for instance, in a batch of 100, it hands out #1, writes #1 to DB, hands out #2, writes #2 to DB, etc. By contrast, LLRnet hands our #1, hands out #2, etc., then writes them all in one big bunch at the end. If LLRnet could be modified to behave more like PRPnet in this regard, it would solve the problem: in a case like yours, you theoretically wouldn't have lost any of your cached pairs.
mdettweiler is offline   Reply With Quote
Old 2010-08-22, 14:17   #22
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

BTW @all: we now have a formal external backup process set up for the server machine (jeepford). First, once a day the databases (both stats and PRPnet), web pages, and server files will be backed up to a location on jeepford itself; the last 5 days of backups will be retained. Then, the last 3 days will be copied over the network to another of Gary's machines within his network (humpford, which happens to be our previous server machine). Lastly, Gary will backup the 3 days worth from humpford to an external USB hard drive twice a month.

That should cover us pretty well--if we ever have an even more serious hard drive issue than we had now, we should be able to restore everything exactly the way it was without losing more than a day of processing. Even if jeepford itself is completely fried, we should be able to transplant the backups to another of Gary's machines relatively easily and get everything rolling again within a couple of days.
mdettweiler is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
20th Test of primality and factorization of Lepore with Pythagorean triples Alberico Lepore Alberico Lepore 43 2018-01-17 15:55
Move the 20th (moving to endgame soon) Dubslow Game 1 - ♚♛♝♞♜♟ - Shaolin Pirates 10 2013-03-03 08:59
Rally Feb. 20th-22nd gd_barnes No Prime Left Behind 13 2009-02-20 14:06
Prime95's backups broken? abstractius Software 4 2007-12-18 02:31
New Server Hardware and price quotes, Funding the server Angular PrimeNet 32 2002-12-09 01:12

All times are UTC. The time now is 10:58.


Sat Jul 17 10:58:01 UTC 2021 up 50 days, 8:45, 1 user, load averages: 1.09, 1.14, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.