mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-08-22, 15:18   #23
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

23×3×112 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
One thing to note: even though the pruning only can be triggered by an event on the server, all data since the last prune is not just stored in memory; it's also stored in joblist.txt as "appended" data. The prune just removes any unnecessary "appended" data (for instance, cancelling out corresponding entries for reservation and result) and cleans out knpairs.txt.
Yes, and that's what I can't understand:

I've got those 660 pairs before that issue (see Drive Progress from 20., 19:00-20:00, so these reservations should be stored in the joblist.txt file) and returned all pairs after the issue. The issue occured the hour after the reservation.
So what happened between those two events with the joblist.txt?
The knpairs.txt contain all pairs until the results were returned.

The joblist.txt is the only place where the server can check if a pair is reserved.

The rejected-file says all 660 pairs were returned between 18:00 and 19:00, but there was no issue with the connection, so the failure was before this.

What I think:

During the issue, something went wrong with the joblist.txt: not saved correctly or even lost at all. Because knpairs.txt contained still those pairs, the server offered these pairs to the clients again (for the server the first time, because joblist.txt was empty/corrupted/lost).

It seems, the joblist.txt was lost for this drive only and the server was started with an empty joblist then!

I only want to understand the whole thing here, otherwise I don't know what to make better to avoid this the next time.
kar_bon is offline   Reply With Quote
Old 2010-08-22, 20:32   #24
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

186916 Posts
Default

Quote:
Originally Posted by kar_bon View Post
Yes, and that's what I can't understand:

I've got those 660 pairs before that issue (see Drive Progress from 20., 19:00-20:00, so these reservations should be stored in the joblist.txt file) and returned all pairs after the issue. The issue occured the hour after the reservation.
So what happened between those two events with the joblist.txt?
The knpairs.txt contain all pairs until the results were returned.

The joblist.txt is the only place where the server can check if a pair is reserved.

The rejected-file says all 660 pairs were returned between 18:00 and 19:00, but there was no issue with the connection, so the failure was before this.

What I think:

During the issue, something went wrong with the joblist.txt: not saved correctly or even lost at all. Because knpairs.txt contained still those pairs, the server offered these pairs to the clients again (for the server the first time, because joblist.txt was empty/corrupted/lost).

It seems, the joblist.txt was lost for this drive only and the server was started with an empty joblist then!

I only want to understand the whole thing here, otherwise I don't know what to make better to avoid this the next time.
Yeah, that does sound about right--considering that entire MySQL databases got fried over on the PRPnet side of things, it's not inconceivable that joblist.txt went bye-bye as well. So that probably is what happened.
mdettweiler is offline   Reply With Quote
Old 2010-08-22, 20:50   #25
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1011010110002 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Yeah, that does sound about right--considering that entire MySQL databases got fried over on the PRPnet side of things, it's not inconceivable that joblist.txt went bye-bye as well. So that probably is what happened.
So what can we do, that this won't happen again?

Every server has it's own files for storing results in. The LLRnet server is here very simple:
3 files: joblist.txt, knpairs.txt, results.txt, that's all.

Saving these files every hour of a day with names like 'joblist_xx.txt' with xx the hour of the day, should store them in a different folder. So this would be enough time (24 hours before overwriting old copies) and in case of an issue (current files from server lost/damaged) only 1 or 2 hours of work will be lost!
Another point of view: If for example the knpairs.txt was damaged, it's timeconsuming to create a new one, taking all results done in account.

Also I don't know how the servers take care of their result-files?
The resultfile of the LLRnet server for Drive #11 would be great because of the amount of pairs done. If an issue occure while writing in this file, it could be damaged massively!

Other thoughts here?
kar_bon is offline   Reply With Quote
Old 2010-08-22, 23:23   #26
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by kar_bon View Post
So what can we do, that this won't happen again?

Every server has it's own files for storing results in. The LLRnet server is here very simple:
3 files: joblist.txt, knpairs.txt, results.txt, that's all.

Saving these files every hour of a day with names like 'joblist_xx.txt' with xx the hour of the day, should store them in a different folder. So this would be enough time (24 hours before overwriting old copies) and in case of an issue (current files from server lost/damaged) only 1 or 2 hours of work will be lost!
Another point of view: If for example the knpairs.txt was damaged, it's timeconsuming to create a new one, taking all results done in account.
Hmm...perhaps something a little simpler. What if each time the LLRnet server updated its joblist file, it first copied it to joblist.bak, then edited joblist.txt itself? That way, if it crashes and takes the joblist file down with it, we have a next-newest good copy on hand. A similar thing could be done for knpairs.txt as well.

Quote:
Also I don't know how the servers take care of their result-files?
The resultfile of the LLRnet server for Drive #11 would be great because of the amount of pairs done. If an issue occure while writing in this file, it could be damaged massively!
We have two things happening to the results files (both LLRnet and PRPnet) on a regular basis:

-Every 15 minutes, the server's results.txt file is copied to /todayresults_portnum.txt on the web site. This is used for hourly updates by the DB and grows as the day goes on.
-Every day at 12:00 midnight, the results.txt file is moved to /llrnet/results/results_date_time_server_nplb_port.txt on the web site. This empties the local copy and, by extension, /todayresults_portnum.txt on the web site.

So if the local results.txt file is zapped, it can only take up to a day's worth of results with it.
mdettweiler is offline   Reply With Quote
Old 2010-08-23, 08:47   #27
kar_bon
 
kar_bon's Avatar
 
Mar 2006
Germany

1011010110002 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Hmm...perhaps something a little simpler. What if each time the LLRnet server updated its joblist file, it first copied it to joblist.bak, then edited joblist.txt itself? That way, if it crashes and takes the joblist file down with it, we have a next-newest good copy on hand. A similar thing could be done for knpairs.txt as well.
But to get this working, the llrnet-server LUA-codes has to be changed first and have to be tested to be sure runable for UNIX and WIN!

The easiest way I think of, would be a cron-job (if I'm right, all NPLB servers running under UNIX).

Last fiddled with by kar_bon on 2010-08-23 at 17:21
kar_bon is offline   Reply With Quote
Old 2010-08-23, 16:33   #28
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3×2,083 Posts
Default

Quote:
Originally Posted by kar_bon View Post
But to get this working, the llrnet-server LUA-codes has to be changed first and have to be tested to be shure runable for UNIX and WIN!

The easiest way I think of, would be a cron-job (if I'm right, all NPLB servers running under UNIX).
Yeah, you're right, a cron job would be easier. Okay, I'll see what I can put together.
mdettweiler is offline   Reply With Quote
Old 2010-08-30, 00:04   #29
vaughan
 
vaughan's Avatar
 
Jan 2005
Sydney, Australia

1010011112 Posts
Default

Is the LLRnet server Port=3000 having any problems? My clients are running out of work and the GUI says "sleeping".
vaughan is offline   Reply With Quote
Old 2010-08-30, 03:26   #30
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

624910 Posts
Default

Quote:
Originally Posted by vaughan View Post
Is the LLRnet server Port=3000 having any problems? My clients are running out of work and the GUI says "sleeping".
Looks like the server ran out of work. I'll fill it up again shortly.
mdettweiler is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
20th Test of primality and factorization of Lepore with Pythagorean triples Alberico Lepore Alberico Lepore 43 2018-01-17 15:55
Move the 20th (moving to endgame soon) Dubslow Game 1 - ♚♛♝♞♜♟ - Shaolin Pirates 10 2013-03-03 08:59
Rally Feb. 20th-22nd gd_barnes No Prime Left Behind 13 2009-02-20 14:06
Prime95's backups broken? abstractius Software 4 2007-12-18 02:31
New Server Hardware and price quotes, Funding the server Angular PrimeNet 32 2002-12-09 01:12

All times are UTC. The time now is 10:58.


Sat Jul 17 10:58:02 UTC 2021 up 50 days, 8:45, 1 user, load averages: 1.09, 1.14, 1.24

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.