![]() |
Thanks Gary, I'm still working out the numbers to take (total done for the day) - (the previous hours total), so I can post the real total amount of knpairs returned each hour so you don't have to do the math. This way you can find out if you have boxen not working :wink:
If that doesn't work out, then I'll just leave it showing the total knpairs returned for each user and let you folks do the math, until I can figure something else out. Starting the web page stats 2 minutes after midnight will simplify the math. :wink: If no knpairs are returned within the 1st 2 minutes of moving results.txt, then those users won't show up until the next run at 0100 hrs. Make sense? The web page will update at 2 minutes after midnight. Results.txt will be moved out of there and processed, with .csv sent to the import directory at midnight. I'll have to go over to that Server and edit crontab so it processes the .csv sometime after midnight as well. edit: crontab done |
[quote=IronBits;150457]Thanks Gary, I'm still working out the numbers to take (total done for the day) - (the previous hours total), so I can post the real total amount of knpairs returned each hour so you don't have to do the math. This way you can find out if you have boxen not working :wink:
If that doesn't work out, then I'll just leave it showing the total knpairs returned for each user and let you folks do the math, until I can figure something else out. Starting the web page stats 2 minutes after midnight will simplify the math. :wink: If no knpairs are returned within the 1st 2 minutes of moving results.txt, then those users won't show up until the next run at 0100 hrs. Make sense? The web page will update at 2 minutes after midnight. Results.txt will be moved out of there and processed, with .csv sent to the import directory at midnight. I'll have to go over to that Server and edit crontab so it processes the .csv sometime after midnight as well. edit: crontab done[/quote] Hum...I'm having the same problem again...with the date showing on 3 lines as well as multiple days showing. It's strange that it was correct for a while and then went back to the old way. |
I can not make any of the columns (in FF 3+ or IE7+) resize by shrinking the size of the browser window.
Fine, I'll removed the DATE column then. :wink: [url]http://nplb.ironbits.net/progress_400.html[/url] This is only temporary until I can figure out how to get it into a database, or/and get php to parse a .csv properly... |
Would it create a problem if I renumbered the servers here? Port 5000 should be near the bottom; perhaps #4; and the others moved up.
Gary |
[quote=gd_barnes;150567]Would it create a problem if I renumbered the servers here? Port 5000 should be near the bottom; perhaps #4; and the others moved up.
Gary[/quote] No, no problem--as far as I know, there isn't anything that depends on the numbers of the servers in this thread. :smile: |
C443 has left ~9200 pairs although the work available is behind G4000 one (with 43k pairs left) I advise people to move a few cores to the latter. Thank you.
Carlos |
[quote=em99010pepe;150721]C443 has left ~9200 pairs although the work available is behind G4000 one (with 43k pairs left) I advise people to move a few cores to the latter. Thank you.
Carlos[/quote] I'm not sure what you mean. Are there ranges in C443 that are both lower and higher than G4000? I just now checked and C443 is currently handing out n=569344 so it has quite a bit to go to get to n=570K. Gary |
[quote=gd_barnes;150723]I'm not sure what you mean. Are there ranges in C443 that are both lower and higher than G4000?
I just now checked and C443 is currently handing out n=569344 so it has quite a bit to go to get to n=570K. Gary[/quote] All ranges are lower, I just though we could move a few cores to help G4000. Not too much, it's only 9 days on my Q6600. |
[quote=em99010pepe;150724]All ranges are lower, I just though we could move a few cores to help G4000.
Not too much, it's only 9 days on my Q6600.[/quote] Oh I see. IB400 has a huge amount of work in it right now and is still well below the other 2 servers so my machines are staying there for a while. You know I don't like gaps. :smile: |
And I wonder who's fault it is that IB400 has such a huge pile of knpairs to chew through? :razz:
:rant: :lol: |
[quote=IronBits;150757]And I wonder who's fault it is that IB400 has such a huge pile of knpairs to chew through? :razz:
:rant: :lol:[/quote] lol and I sent that less than 3 mins. before it went down assuming that Lennart was going to stay the entire rally on it. Not a problem...we'll chew through it quickly enough. It's near n=560K now, which means there is n=7K to go. We frequently reserve n=5K ranges for the servers anyway on this drive so there is only a somewhat larger range left than we'd normally start a new loaded range with. Hey...one thing I just now noticed on your latest hourly stats page: You show the hours up through 23 and then start over at hour 00 for the new day without actually showing the total for the entire day. Just thought I'd mention it. Gary |
i am just guessing but i reckon mdettweiler probably wants his stats combining with Anonymous:smile:
|
Good guess :wink:
Just something else that needs to be done to the database records. Along with removing the TeamName from the UserNames. Speaking of which, someone needs to get a hold of BlisteringSheep, his TeamName is back in the UserName. He must have a few stray boxen without the updated name change. |
[quote=henryzz;150832]i am just guessing but i reckon mdettweiler probably wants his stats combining with Anonymous:smile:[/quote]
[quote=IronBits;150848]Good guess :wink: Just something else that needs to be done to the database records. Along with removing the TeamName from the UserNames. Speaking of which, someone needs to get a hold of BlisteringSheep, his TeamName is back in the UserName. He must have a few stray boxen without the updated name change.[/quote] Yep, you guys guessed right. :smile: About BlisteringSheep's stray client(s): I think he mentioned that he has LLRnet running on his mother's computer and that he only has access to it sporadically, and that for the time being it would be stuck with the team name in the username. Presumably that's what's causing this? |
Well, hopefully he will find it and take care of it tomorrow then :wink:
It just started back up a few days ago... |
David,
I don't seem to have access to the most recent results file. Gary |
fixed
|
[QUOTE=IronBits;150848]Speaking of which, someone needs to get a hold of BlisteringSheep, his TeamName is back in the UserName.
He must have a few stray boxen without the updated name change.[/QUOTE] Sorry about that. :redface: I had changed the dedicated box, but forgot to change my octocore when I restarted some. I just fixed all of my llr-clientconfig.txt files on every machine, even ones not running llrnet right now, so it shouldn't be an issue again. I've been so busy with work & life that I don't get to the/any forums very often. If you do need to get in touch with me, a PM here will send me an e-mail, or you should be able to e-mail directly from [URL="http://www.mersenneforum.org/member.php?u=2760"]my user page[/URL]. |
Excellent! I knew you would get a round toit one of these days :smile:
|
Max, GB4000 may have had a slight problem.
Earlier today I posted finding a confirmed prime 861*2^572599-1. It has not posted yet on your stats page or on the Primes found page. I did find it on yesterday's log just before the daily cleanup. user=MyDogBuster [2008-11-29 06:43:14] 615*2^572587-1 is not prime. Res64: B6F5C537D8023AC6 Time : 449.0 sec. user=MyDogBuster [2008-11-29 06:43:18] 619*2^572587-1 is not prime. Res64: D60DBED8F4928B9D Time : 449.0 sec. user=MyDogBuster [2008-11-29 06:45:47] 861*2^572559-1 is prime! Time : 6796.0 sec. user=MyDogBuster [2008-11-29 06:45:49] 895*2^572559-1 is not prime. Res64: 6C90AA8402FC1737 Time : 6798.0 sec. user=MyDogBuster [2008-11-29 06:45:56] 713*2^572546-1 is not prime. Res64: 58FD9F95322C534D Time : 10028.0 sec. |
[quote=MyDogBuster;151284]Max, GB4000 may have had a slight problem.
Earlier today I posted finding a confirmed prime 861*2^572599-1. It has not posted yet on your stats page or on the Primes found page. I did find it on yesterday's log just before the daily cleanup. user=MyDogBuster [2008-11-29 06:43:14] 615*2^572587-1 is not prime. Res64: B6F5C537D8023AC6 Time : 449.0 sec. user=MyDogBuster [2008-11-29 06:43:18] 619*2^572587-1 is not prime. Res64: D60DBED8F4928B9D Time : 449.0 sec. user=MyDogBuster [2008-11-29 06:45:47] 861*2^572559-1 is prime! Time : 6796.0 sec. user=MyDogBuster [2008-11-29 06:45:49] 895*2^572559-1 is not prime. Res64: 6C90AA8402FC1737 Time : 6798.0 sec. user=MyDogBuster [2008-11-29 06:45:56] 713*2^572546-1 is not prime. Res64: 58FD9F95322C534D Time : 10028.0 sec.[/quote] Ah-ha! I think I know what caused that problem. The copy-off script runs at 6:59 AM CST daily, and the status page script (which handles keeping track of primes) runs every 15 minutes. However, the way it's set up, when it does its run for 7:00 AM (actually, it runs at 7:01 AM), it runs *after* the copy-off script has essentially blanked out the files it works with. Hence, it will miss any prime that just happens to be found between 6:45 AM and 6:59 AM. I'll go and fix this shortly so that the copy-off script doesn't run until *after* the status page script has had a chance to run. :smile: I'll also retroactively add the missed prime into the recurring log of primes. |
[quote=mdettweiler;151286]Ah-ha! I think I know what caused that problem. The copy-off script runs at 6:59 AM CST daily, and the status page script (which handles keeping track of primes) runs every 15 minutes. However, the way it's set up, when it does its run for 7:00 AM (actually, it runs at 7:01 AM), it runs *after* the copy-off script has essentially blanked out the files it works with. Hence, it will miss any prime that just happens to be found between 6:45 AM and 6:59 AM.
I'll go and fix this shortly so that the copy-off script doesn't run until *after* the status page script has had a chance to run. :smile: I'll also retroactively add the missed prime into the recurring log of primes.[/quote] Okay, I've fixed the crontab so that the status page script runs at every :00, :15, and :45 of every hour, and the copy-off script runs at 7:01 AM. That should close the gap. I'll fix the primes list shortly... |
[QUOTE]Okay, I've fixed the crontab so that the status page script runs at every :00, :15, and :45 of every hour, and the copy-off script runs at 7:01 AM. That should close the gap. I'll fix the primes list shortly...
[/QUOTE] Nice job. I figured it had something to do with the turnover. I just didn't want anyone to miss a prime. Aren't 'puters fun? |
C443 currently processing at n= [COLOR=#0000ff]~577.21K[/COLOR]
|
Max,
Please load some more pairs in port 4000 within a day or so. I think we had calculated that Ian/you were processing 4200 pairs/day there so loading ~30000 pairs or an n=3K range, which would take 7-8 days, would be a good amount for now. We should probably also send a file to David for port 400 within 1-2 days too. At about 8000 pairs/day, we could send about an n=5K file on it for the time being. Gary |
[quote=gd_barnes;151761]Max,
Please load some more pairs in port 4000 within a day or so. I think we had calculated that Ian/you were processing 4200 pairs/day there so loading ~30000 pairs or an n=3K range, which would take 7-8 days, would be a good amount for now. We should probably also send a file to David for port 400 within 1-2 days too. At about 8000 pairs/day, we could send about an n=5K file on it for the time being. Gary[/quote] Okay, yep. I'll do that within the next few hours. :smile: |
Will C443 get a stats page as there ins one for IB400/G4000?
|
I've given him lots of code, but haven't heard back from Carlos and, I'm not sure he has a web presence.
I can work something up over here for him, it will just be delayed by at least a day's worth of work. |
Max,
I noticed that [URL]http://nplb-gb1.no-ip.org/llrnet/[/URL] is down at the moment so not only can we not view the status of port 4000, I can't access any of my machines remotely. I hope the machine that the server is on is not down. I checked my k/n pairs per hour on port 400 and it appears that there might have been a drop of one quad a few hours ago...Or it could be just some processing glitch in something. I'm not sure. (BTW, I added 2 more cores to port 4000 on Thurs. afternoon after my Riesel base 256 effort finished early Thurs. morning.) I think I have the temps regulated pretty well on my machines now but unfortunately "crunchford", the machine that runs the server, is still one of the warmer running ones. (~70-71 C I think.) You might remember me mentioning that it was not one of the best choices for the server. When you get a chance this morning, can you check things and make sure that you can get back on the above link and that port 4000 is working OK? I likely will be on next around 1 PM CST. (7 PM GMT) Ian, you might also check and make sure you're still processing work on port 4000. Thanks, Gary |
G4000 has been down for the last 5 hours.
|
[quote=em99010pepe;152050]G4000 has been down for the last 5 hours.[/quote]
Damn. Of all the luck. I'm sure the machine is down then. And what's worse is that I'm sure that it is the ONLY one of my 10 machines that is down. It is one of the 8 machines plus 3 more cores that were running port 400 at the point that it must have gone down and I see a drop of ~10-12% in k/n pairs processed on that port around 4-5 hours ago, which would equate to 4 cores out of 35 and coincide with Carlos saying that it has been down 5 hours. Max, it looks like we're screwed on port 4000 until late next Tuesday when I get back and see what is wrong with it unless you can somehow make the "master machine" another one of my machines remotely and then somehow set up port 4000 on it instead. (That sounds like a huge headache to me.) I think most of the others are running <= 68 C but if you do set it up on one of the others, please check the temps first. Like I said, Crunchford was definitely the warmest of my AMD machines. Everyone, if you were on port 4000, please move to port 400. It can easily handle the load. Sorry. This shouldn't delay us in total by more than 1/2-day to 1 day on finishing this drive if people move their machines within a day. Port 400 will just do more of the work and later on, I may need to move a few of my machines to port 4000 to clear it out more quickly once we get it going again. Gary |
C443 is also available with a lots of work to process.
|
[QUOTE]Everyone, if you were on port 4000, please move to port 400. It can easily handle the load. Sorry.
[/QUOTE] Just woke up. I'll start moving stuff shortly. Looks like it was down about 1AM EST last night. I have got to find an easy way of changing servers. Switching 25 cores will not be fun. |
Halfway into switching the ports, my service provider went down. It hasn't been down since August. This just ain't my day. Time to go back to bed.
Okay it's back up and the switch is finished. |
what happened to hour 24 yesterday
[URL]http://nplb.ironbits.net/progress_400.html[/URL] edit could all the new stats pages be added to the first post of this thread |
[quote=gd_barnes;152049]Max,
I noticed that [URL]http://nplb-gb1.no-ip.org/llrnet/[/URL] is down at the moment so not only can we not view the status of port 4000, I can't access any of my machines remotely. I hope the machine that the server is on is not down. I checked my k/n pairs per hour on port 400 and it appears that there might have been a drop of one quad a few hours ago...Or it could be just some processing glitch in something. I'm not sure. (BTW, I added 2 more cores to port 4000 on Thurs. afternoon after my Riesel base 256 effort finished early Thurs. morning.) I think I have the temps regulated pretty well on my machines now but unfortunately "crunchford", the machine that runs the server, is still one of the warmer running ones. (~70-71 C I think.) You might remember me mentioning that it was not one of the best choices for the server. When you get a chance this morning, can you check things and make sure that you can get back on the above link and that port 4000 is working OK? I likely will be on next around 1 PM CST. (7 PM GMT) Ian, you might also check and make sure you're still processing work on port 4000. Thanks, Gary[/quote] Ouch. However, I have some good news: based on what I'm seeing in the IB400 results file for today, most likely some or all of the rest of your machines are still up and crunching. :smile: I doubt that thermal problems are the issue here; even if it went all the way up to 80 C, it should still run, though it would crunch somewhat slower. (A while ago I had my dualcore hovering at 83 C for about a month or two and I still used it as my primary machine.) I'm thinking something more along the lines of a power flicker (which can, depending on the duration, take out some machines and not others). Hmm...if only I knew crunchford's MAC address I could try feeding it a Wake on LAN signal through port 4000 (since that port is already open on your router). Though even if I could do that, it's a tossup as to whether that would actually make the machine start up. (I can't get it to work on my machines, either.) Anyway, though, when you get it restarted I'll see about getting the MAC addresses of all your machines written down (I'll be able to obtain that information once I can get SSH access) in case we ever need to try a Wake on LAN in the future. Max :mellow: |
[quote=mdettweiler;152085]Ouch. However, I have some good news: based on what I'm seeing in the IB400 results file for today, most likely some or all of the rest of your machines are still up and crunching. :smile:
I doubt that thermal problems are the issue here; even if it went all the way up to 80 C, it should still run, though it would crunch somewhat slower. (A while ago I had my dualcore hovering at 83 C for about a month or two and I still used it as my primary machine.) I'm thinking something more along the lines of a power flicker (which can, depending on the duration, take out some machines and not others). Hmm...if only I knew crunchford's MAC address I could try feeding it a Wake on LAN signal through port 4000 (since that port is already open on your router). Though even if I could do that, it's a tossup as to whether that would actually make the machine start up. (I can't get it to work on my machines, either.) Anyway, though, when you get it restarted I'll see about getting the MAC addresses of all your machines written down (I'll be able to obtain that information once I can get SSH access) in case we ever need to try a Wake on LAN in the future. Max :mellow:[/quote] English please. lol You say you have good news? Didn't I just state that all of my machines were likely up except Crunchford in the first 2 paras. of my post and provide stats from port 400 to prove it? Are you skimming my posts again? (lmao) On my AMD's, for the ones that previously ran consistently above about 74-75 C, the motherboard eventually shot craps so I'm just speculating on this one. Hopefully it was just a power flicker. It's oddly coincidental that it happened to the warmest and most important machine of the group. Regardless, is it possible that you can switch the 'master machine' over to another one of my machines so at least I can remotely view the other machines before next Tuesday? For CRUS, I have Sierp base 256 and Sierp base 16 running on a couple of them. Thanks, Gary |
[quote=nuggetprime;151795]Will C443 get a stats page as there ins one for IB400/G4000?[/quote]
[quote=IronBits;151878]I've given him lots of code, but haven't heard back from Carlos and, I'm not sure he has a web presence. I can work something up over here for him, it will just be delayed by at least a day's worth of work.[/quote] Too busy with real life! I have to see that again with IB. Carlos |
[quote=gd_barnes;152114]English please. lol
You say you have good news? Didn't I just state that all of my machines were likely up except Crunchford in the first 2 paras. of my post and provide stats from port 400 to prove it? Are you skimming my posts again? (lmao)[/quote] LOL--yes, I was skimming your post, I must admit. :rolleyes: [quote]On my AMD's, for the ones that previously ran consistently above about 74-75 C, the motherboard eventually shot craps so I'm just speculating on this one. Hopefully it was just a power flicker. It's oddly coincidental that it happened to the warmest and most important machine of the group. Regardless, is it possible that you can switch the 'master machine' over to another one of my machines so at least I can remotely view the other machines before next Tuesday? For CRUS, I have Sierp base 256 and Sierp base 16 running on a couple of them.[/quote] Unfortunately, I can't do anything until crunchford is back online again--all remote access into your network is through that machine. If it was just a power flicker, then all it needs is a reboot and I can get back in and get everything running again; however, if crunchford *did* blow its motherboard, then we can't recover all the LLRnet server stuff until you get it fixed. (If that does turn out to be the case, I'd recommend switching the hard drive into a machine with a good motherboard, so that we can at least get it online long enough for me to grab the LLRnet files and switch the "master machine" over to another box.) After you get back and I can get in again, I'll see about setting up a "secondary master" so that if the master ever goes down again, we can still get in through an alternate port to a different machine. In the meantime, maybe you could have your ex-wife stop by and reboot crunchford like you used to do before we got the remote desktop thing set up? :smile: Then, assuming it still works, I could get in and re-start the server stuff (and back it up, and set up a secondary master while I'm at it). Max :smile: |
[quote=mdettweiler;152129]LOL--yes, I was skimming your post, I must admit. :rolleyes:
Unfortunately, I can't do anything until crunchford is back online again--all remote access into your network is through that machine. If it was just a power flicker, then all it needs is a reboot and I can get back in and get everything running again; however, if crunchford *did* blow its motherboard, then we can't recover all the LLRnet server stuff until you get it fixed. (If that does turn out to be the case, I'd recommend switching the hard drive into a machine with a good motherboard, so that we can at least get it online long enough for me to grab the LLRnet files and switch the "master machine" over to another box.) After you get back and I can get in again, I'll see about setting up a "secondary master" so that if the master ever goes down again, we can still get in through an alternate port to a different machine. In the meantime, maybe you could have your ex-wife stop by and reboot crunchford like you used to do before we got the remote desktop thing set up? :smile: Then, assuming it still works, I could get in and re-start the server stuff (and back it up, and set up a secondary master while I'm at it). Max :smile:[/quote] The danger about having Sherri go by and turn it on is that is how I fried a motherboard before myself. That is...the fact that it shut itself down was a 'warning' sign that something was amiss. I turned it back on, started crunching again and a few days later it went off again. I did it again and it went off again in about a day. That was it...it had fried itself at that point. I'm not going to turn it on and start crunching on it until I verify temps and stuff. Well, I suppose I could have her turn it on but not start crunching on it (assuming it will even come on; which I suspect there is < 50% chance of). Since the server actually does no crunching, it shouldn't heat up the machine. I'll see if she can do it. I hate to burden her with messing with stuff again though. She's already been by my house twice to make sure everything is OK and I told her that should be enough. Oh well, I'll see what I can do. If the machine won't turn on, yes, I will swap hard drives with another machine after I get back to the coolest running machine so that we can make sure the server is on likely the most stable machine that I have. Actually, I've done that twice already based on the priority of stuff that was running on a machine that went down, even after you got the remote access set up. You just didn't know it. lol Stupid machines! Gary |
Lennart,
You have cores doing duplicated work on C443. Please check them. Meanwhile I moved 4 cores to IB400 to help to clean the lower ranges, 3 cores are still on C443. Carlos |
[quote=gd_barnes;152201]The danger about having Sherri go by and turn it on is that is how I fried a motherboard before myself. That is...the fact that it shut itself down was a 'warning' sign that something was amiss. I turned it back on, started crunching again and a few days later it went off again. I did it again and it went off again in about a day. That was it...it had fried itself at that point.
I'm not going to turn it on and start crunching on it until I verify temps and stuff. Well, I suppose I could have her turn it on but not start crunching on it (assuming it will even come on; which I suspect there is < 50% chance of). Since the server actually does no crunching, it shouldn't heat up the machine. I'll see if she can do it. I hate to burden her with messing with stuff again though. She's already been by my house twice to make sure everything is OK and I told her that should be enough. Oh well, I'll see what I can do. If the machine won't turn on, yes, I will swap hard drives with another machine after I get back to the coolest running machine so that we can make sure the server is on likely the most stable machine that I have. Actually, I've done that twice already based on the priority of stuff that was running on a machine that went down, even after you got the remote access set up. You just didn't know it. lol Stupid machines! Gary[/quote] Ah, I see...in that case, don't worry about getting it restarted just yet. Since all the clients working on G4000 have been moved to other servers by now, it shouldn't hurt if it's down just a few more days--rather than taking the risk of frying yet another motherboard. The main significant thing that is waiting on the server coming back online is about 500 results that I've got sitting on my computer that I had pulled down from G4000 and crunched with manual LLR, but that can wait. :smile: |
Looks like Bliss is back. Hey Flatlander, only 10 tests per hour? Those 6 cores run too slow...lol I can't see either Henryzz and Max on the stats...
|
[B]NPLB LLRnet server #2 (updated 2008-12-02 08:00 GMT):[/B]
maintained by em99010pepe Short identification: C443 server = "nplb.dynip.telepac.pt" port = 443 k-range: 401 <= k <= 1001 n-range: 577K-578K, 587.4K-588K, 598K-600K currently processing at n= [COLOR=#0000ff]~598.23K[/COLOR] |
[QUOTE] The main significant thing that is waiting on the server coming back online is about 500 results that I've got sitting on my computer that I had pulled down from G4000 and crunched with manual LLR, but that can wait. :smile:
[/QUOTE] I also have about 300 results that were in my caches and complete. I saved the tosend files before I switched to IB400. Will their reservations expire when you bring GB4000 back up? |
[quote=em99010pepe;152213]Looks like Bliss is back. Hey Flatlander, only 10 tests per hour? Those 6 cores run too slow...lol I can't see either Henryzz and Max on the stats...[/quote]
i was finishing a gnfs factorization this morning in safe mode so i had access to more memory so i didnt have access to the internet then out of interest i started the postprocessing using ggnfs which worked straight away once i had gone into safe mode to free up memory once i finally got to starting the linear algebra its estimate was about 10.5 hours with one core if i had done that it would be still going i thought that was way too long to bother with as msieve is faster i ran msieve and it wouldnt solve the matrix as the weight was way too low about 15 per cycle i fiddled for ages and eventurely got it to do it by removing some of the relations so it was less oversieved and doing the linear algebra with msieve took 1 hour and 40 mins with four cores instead of one somewhat an improvement over ggnfs although not being able to rediculously oversieve with msieve drives me crazy sometimes |
We set a record yesterday with 10,583 knpairs completed. :smile:
|
[quote=MyDogBuster;152226]I also have about 300 results that were in my caches and complete. I saved the tosend files before I switched to IB400. Will their reservations expire when you bring GB4000 back up?[/quote]
Yes, technically, they would expire. BUT...no, as long as they are not handed out to someone else, you should be OK. What I'll do is tell you exactly when I expect to bring it back up and coordinate it with you being online. When I bring it back up, you can then connect to the server and it will immediately send those results and not hand them back out to someone else or even "funnier"; try to hand them back out to YOU again. (lol) Max, is that your understanding about what will happen if Ian connects to port 4000 after I bring it up and no one else is connected to it? Gary |
[QUOTE]
When I bring it back up, you can then connect to the server and it will immediately send those results and not hand them back out to someone else or even "funnier"; try to hand them back out to YOU again. (lol) [/QUOTE] What I'll do is gather all of them up into 1 tosend file so that I only have to connect once when it comes back up. That way it should go lots faster and not have to wait till I get to all the cores. |
[quote=MyDogBuster;152226]I also have about 300 results that were in my caches and complete. I saved the tosend files before I switched to IB400. Will their reservations expire when you bring GB4000 back up?[/quote]
[quote=gd_barnes;152283]Yes, technically, they would expire. BUT...no, as long as they are not handed out to someone else, you should be OK. What I'll do is tell you exactly when I expect to bring it back up and coordinate it with you being online. When I bring it back up, you can then connect to the server and it will immediately send those results and not hand them back out to someone else or even "funnier"; try to hand them back out to YOU again. (lol) Max, is that your understanding about what will happen if Ian connects to port 4000 after I bring it up and no one else is connected to it? Gary[/quote] [quote=MyDogBuster;152285]What I'll do is gather all of them up into 1 tosend file so that I only have to connect once when it comes back up. That way it should go lots faster and not have to wait till I get to all the cores.[/quote] As Gary said--yes, they probably would be expired by the time I get the server started up, but I'm planning to, before restarting the server, set jobMaxTime to 20 days or so to give everyone a chance to return their old results. Then, after Ian and I can both confirm that we've tied up any loose ends, I can change it back to the normal settings of 5 days. :smile: |
[quote=mdettweiler;152292]As Gary said--yes, they probably would be expired by the time I get the server started up, but I'm planning to, before restarting the server, set jobMaxTime to 20 days or so to give everyone a chance to return their old results. Then, after Ian and I can both confirm that we've tied up any loose ends, I can change it back to the normal settings of 5 days. :smile:[/quote]
How about setting it to 3 days like port 400 after we get Ian's (and other's) already processed results returned to the server? |
[quote=gd_barnes;152340]How about setting it to 3 days like port 400 after we get Ian's (and other's) already processed results returned to the server?[/quote]
Well, I usually like to have it set to 5 days so that I can manually cache k/n pairs from the server and crunch it manually. So far I haven't had any problems with bottlenecked k/n pairs; though, yes, I will keep an eye on things and be sure to decrease the time a bit if it seems necessary. |
[quote=mdettweiler;152362]Well, I usually like to have it set to 5 days so that I can manually cache k/n pairs from the server and crunch it manually. So far I haven't had any problems with bottlenecked k/n pairs; though, yes, I will keep an eye on things and be sure to decrease the time a bit if it seems necessary.[/quote]
OK, but you better be nice to me or I'll set it to 1 day at random times just for my own entertainment! lol |
[quote=gd_barnes;152406]OK, but you better be nice to me or I'll set it to 1 day at random times just for my own entertainment! lol[/quote]
LOL :wink: Though of course if you started doing that then I could just start running the server from the max username instead of the gary username that it's being run from now, so you wouldn't be able to change anything...hey, wait a minute, why am I telling you this stuff? :missingteeth: |
Sherri went over to my place tonight. The computer that has been hosting the server is down. She turned it on and the green light came on but nothing else. She confirmed that the other 9 machines are still running as well as my slower Windows desktop and very slow borrowed laptop upstairs that I have running various low-intensity CRUS efforts. Murphy's law rules as usual: The server machine is the only one that went down. 11 total are running just fine.
As usual, it looks like another bad mobo. My 5th now. Once this one is replaced, assuming that is the issue, it should put all of my machines under 70 C after I reapply the thermal goo so I HOPE that will be the last problem I have with them. Max, when I get back on Tuesday night, I'll swap hard drives with another good machine and I'll order whatever part has gone down on the bad machine. I'll let you know what machine it is so you can get the server running again. To all: Assuming that Max can do this within a day, I would anticipate that port 4000 will be running again by Weds. night. I'll likely move most of my machines over to it to process the lower n-range at that time. Gary |
[quote=gd_barnes;152462]Sherri went over to my place tonight. The computer that has been hosting the server is down. She turned it on and the green light came on but nothing else. She confirmed that the other 9 machines are still running as well as my slower Windows desktop and very slow borrowed laptop upstairs that I have running various low-intensity CRUS efforts. Murphy's law rules as usual: The server machine is the only one that went down. 11 total are running just fine.
As usual, it looks like another bad mobo. My 5th now. Once this one is replaced, assuming that is the issue, it should put all of my machines under 70 C after I reapply the thermal goo so I HOPE that will be the last problem I have with them. Max, when I get back on Tuesday night, I'll swap hard drives with another good machine and I'll order whatever part has gone down on the bad machine. I'll let you know what machine it is so you can get the server running again. To all: Assuming that Max can do this within a day, I would anticipate that port 4000 will be running again by Weds. night. I'll likely move most of my machines over to it to process the lower n-range at that time. Gary[/quote] Okay, cool. As for notifying me which machine you swap the hard drive with, that won't be necessary; that's because the "heart and soul" of crunchford, so to speak, are in its hard drive; and thus, when you swap the hard drive, the new machine essentially becomes crunchford. It will have an IP address of 192.168.2.100 and should automatically fit right in to crunchford's previous role as gateway machine. :smile: As for how fast I can get port 4000 up and running again: no problem, that shouldn't take long at all. I can probably have it going within 5 minutes of receiving word that you've got the machine back online. :smile: Max :smile: |
[quote=mdettweiler;152508]Okay, cool. As for notifying me which machine you swap the hard drive with, that won't be necessary; that's because the "heart and soul" of crunchford, so to speak, are in its hard drive; and thus, when you swap the hard drive, the new machine essentially becomes crunchford. It will have an IP address of 192.168.2.100 and should automatically fit right in to crunchford's previous role as gateway machine. :smile:
As for how fast I can get port 4000 up and running again: no problem, that shouldn't take long at all. I can probably have it going within 5 minutes of receiving word that you've got the machine back online. :smile: Max :smile:[/quote] Well, duh. I've swapped a couple of hard drives already and knew that. lol Brain fart again. I'm thinking I'll be home by 8 PM my time Tuesday so should have the hard drive swapped out by 10 PM after unpacking and stuff. I'll let you know. In the mean time, please quickly send another n=5K file to port 400. We have less than 2 days work in it right now. I'll pull all of my machines off to dry port 4000 when it comes back up and after that, arrange them as necessary to dry any remaining ranges as needed in the various servers. Gary |
Back from my business trip now...
Max, it'll be an hour or so before I get the hard drive swapped out of the bad machine. |
[quote=gd_barnes;152596]I'll pull all of my machines off to dry port 4000 when it comes back up and after that, arrange them as necessary to dry any remaining ranges as needed in the various servers.
Gary[/quote] Per my subsequent comments in another thread, I'll leave all my machines on port 400 and Ian will move his back to port 4000 since he was running that one originally. I'll post in this thread when port 4000 is up and running again. Gary |
I'm going to assume you are having problems getting G4000 up again.
I'll stay on IB400 till I get the word. Going to take a nap. |
[quote=MyDogBuster;152712]I'm going to assume you are having problems getting G4000 up again.
I'll stay on IB400 till I get the word. Going to take a nap.[/quote] No, no problems. I didn't get home from my trip until close to 10 CST. I'm just now swapping out the hard drive between the bad machine and a good one. Max said he was going to bed an hour or so ago. So it will likely be the morning or early afternoon on Weds. My bad machine, once again, is a bad mobo. I see the popped tops. On the last 2 mobos that went bad, there was no physical problem that I could see with them so I had to test several things. At least I don't have to do a bunch of testing to know where the problem lies this time. Gary |
Okay, I'm off to bed then. Sorry to hear about the mobo. At least quad boards aren't as expensive as they used to be.
|
If you get G4000 up I'll move 7 cores to it (~1600 candidates per day).
|
I have now swapped the hard drives so that the server is on one of my coolest and most stable machines. Max will get it running sometime later today (Weds).
Gary |
[quote=em99010pepe;152716]If you get G4000 up I'll move 7 cores to it (~1600 candidates per day).[/quote]
Hold off until after Ian has stated that he has connected to it and returned all of his processed pairs. I think he said he had 300+ of them. We don't want a bunch of double-work done on them. After that, fire away with your cores if you want. |
[quote=gd_barnes;152722]Hold off until after Ian has stated that he has connected to it and returned all of his processed pairs. I think he said he had 300+ of them. We don't want a bunch of double-work done on them.
After that, fire away with your cores if you want.[/quote] Actually, no need to worry about that. I'll be setting jobMaxTime to 20 days until everyone's given me the OK on their queued results. :smile: Going to go work on it right now... |
Okay, G4000 is now officially back online! :banana:
I've submitted all me queued k/n pairs; Ian, let me know when you're all set so I can then change the deadline back to 5 days. (As mentioned in my last post, it's currently set to 20 days.) Carlos and others: feel free to connect as soon as you wish. With the deadline set to 20 days there's no danger of anybody's results expiring for the time being. :smile: |
Moved 3 cores, more 4 later when I get home.
|
[quote] Ian, let me know when you're all set so I can then change the deadline back to 5 days.
[/quote] Max, I've submitted them (some of them twice) and they don't show up anywhere. I just tried one again about 5 minutes ago, just before the 15 minute deadline and nothing. ???????. The one 1 submitted twice doesn't even show up as a rejected pair. |
Okay, right out of an episode of the Twilight Zone, da da da da,
I tried again and this time they were accepted. So I guess I'm all caught up. I haven't got a clue as to why they worked the second time and not the first. |
I'm full power on G4000.
Ian, good to know the results were accepted. |
[QUOTE]Ian, good to know the results were accepted
[/QUOTE] Thanks, I just wish I knew what I did right the second time. 260 results were just too much to throw away. |
Hey Lennart, a little push on G4000?
|
We aren't going to add any more work to G4000 with this drive. Having Carlos and Ian on it should be sufficient. If it looks like IB400 is going to finish before G4000, I'll start moving machines. Regardless of that, we still have the final n=1.8K range that we'll load into IB400 before I would move any of my machines.
As long as you can calculate that port G4000 will finish by Dec. 23rd-24th with its current work and cores, then we won't need anyone more on it. If later, I can always move a few machines. Edit: What do people think about a rally at this point? I'm thinking now that one won't be necessary. It might even be overly risky. With port G4000 back up, we should have stable ports until the 1st drive is complete (estimated at Dec. 23rd-24th somewhere ~1 week ago) and we have plenty of resources to get there now. The estimate had assumed that Sheep wouldn't be doing any processing and I see now that he is doing ~1500 pairs/day. Edit 2: Perhaps a rally to kick off the new n>600K drives and to get them rolling fast would be a good option; perhaps the last weekend of Dec. or first weekend of Jan. Gary |
[QUOTE]We aren't going to add any more work to G4000 with this drive.
[/QUOTE] Let me know if and when I have to move again. I kinda like it over here though. I see no problem with 2 servers running. It gives everyone another choice of where to put their assets. 2 servers doing different slices of the pie (not the same drive) would be nice. |
[quote=MyDogBuster;152781]Let me know if and when I have to move again. I kinda like it over here though. I see no problem with 2 servers running. It gives everyone another choice of where to put their assets. 2 servers doing different slices of the pie (not the same drive) would be nice.[/quote]
Cool. That works for us. If you like different servers on different drives, you'll like our n>600K effort. We'll have 3 servers; one each on k=400-600, k=600-800, and 800-1001. Likely they'll be at different n-ranges at different times. Initially we'll push k=400-600 more in rallies and at other times since they are lower k's so they will generally be at a higher search range but people will still be free to search any of the k-ranges. Also, we'll have k=1005-2000 for n=50K-100K to start with and n=350K-500K later on as well as our double-check drive for k<=1001 and n=100K-260K. I stated in an RPS thread that I'll take k=27-31 from the double-check drive after our 1st drive is complete, due to the missing primes found for k=31 for n>400K. Completing all k=300-1001 to n=600K is just the beginning! :smile: Gary |
I will stay on G4000 for three, four days, then I have to finish my manual range and help C443.
BTW, Glenn will only join us with more power if we do a rally. |
[QUOTE]If you like different servers on different drives, you'll like our n>600K effort. We'll have 3 servers; one each on k=400-600, k=600-800, and 800-1001. Likely they'll be at different n-ranges at different times. Initially we'll push k=400-600 more in rallies and at other times since they are lower k's so they will generally be at a higher search range but people will still be free to search any of the k-ranges.
Also, we'll have k=1005-2000 for n=50K-100K to start with and n=350K-500K later on as well as our double-check drive for k<=1001 and n=100K-260K. I stated in an RPS thread that I'll take k=27-31 from the double-check drive after our 1st drive is complete, due to the missing primes found for k=31 for n>400K. [/QUOTE] Great ideas. And here I was thinking about resting a bit on our laurels. Looks like enough work to choke a horse. |
[quote=MyDogBuster;152758]Okay, right out of an episode of the Twilight Zone, da da da da,
I tried again and this time they were accepted. So I guess I'm all caught up. I haven't got a clue as to why they worked the second time and not the first.[/quote] Hmm...that's odd. I don't know why it would do that. Anyway, now that you've got your results in, I'll go and change jobMaxTime to 5 days again... |
Hi all,
In case you're all wondering what this new "port 8000/NPLB 7th Drive" thing that's just showed up on the [URL]http://nplb-gb1.no-ip.org/llrnet/[/URL] status page is, that's a new, empty server that we will eventually be loading work for k=800-1001, n>600K into after we're done with the 1st Drive, as Gary, David and I had discussed via PM. I just got all the behind-the-scenes stuff for it all taken care of ahead of time so that when the time comes, all I have to do is pop in the knpairs and let 'er rip. :smile: Anyway, just wanted to let you guys know, because otherwise I'm sure there would have been somebody posting a message wondering what the heck this new thing was. :smile: In the meantime, this has helped me spot a previously-unnoticed bug that causes the status page to display "-1 remaining knpairs" for a brand-new server. :smile: Max :smile: |
Hmm...G4000 apparently crashed at least two times within the past hour or so. Here's what the console output said:
[code]llrnet: net.cxx:138: static void* net_Server_t::connection_thread(void*): Assertion `thd->socket >= 0' failed. net_signal called with code 6 shutting down listening socket accept: errorno=22 waiting thread #0 exits shutdown: errno 9 thread #2 exited[/code] The console output is specifically different than last time I had problems with the server continually crashing; though it may very well be a corrupted executable again. I'll swap in a fresh binary and see if that fixes anything. David, Carlos--any of you server gurus have any idea what's happening here? |
[quote=mdettweiler;152809]Hmm...G4000 apparently crashed at least two times within the past hour or so. Here's what the console output said:
[code]llrnet: net.cxx:138: static void* net_Server_t::connection_thread(void*): Assertion `thd->socket >= 0' failed. net_signal called with code 6 shutting down listening socket accept: errorno=22 waiting thread #0 exits shutdown: errno 9 thread #2 exited[/code]The console output is specifically different than last time I had problems with the server continually crashing; though it may very well be a corrupted executable again. I'll swap in a fresh binary and see if that fixes anything. David, Carlos--any of you server gurus have any idea what's happening here?[/quote] Okay, I've got the binary swapped out. Hopefully it won't crash any more... |
[quote=mdettweiler;152810]Okay, I've got the binary swapped out. Hopefully it won't crash any more...[/quote]
Hmm...it's still crashing. This is weird; I have absolutely no idea why it's doing this. The only thing I do know is that it seems to crash whenever it prunes the joblist and knpairs files. Anyone know why this is happening? |
[quote=MyDogBuster;152791]Great ideas. And here I was thinking about resting a bit on our laurels.
Looks like enough work to choke a horse.[/quote] The pressure will definitely be off. I'll likely drop to about 6-7 quads on NPLB, which is what I had on it prior to the last 1-2 months, from my current 8-9 after the 1st drive is done. There are several CRUS efforts that I'd like to put more CPU power into. NPLB and CRUS like most prime-search efforts are infinite projects. (CRUS was finite to start with, although extremely huge, when we were only including bases <= 32 but others and now me have started processing much larger bases because they are fun.) It's all a matter of how big of a piece we choose to process at any one time. We prefer to process large #'s of k's at once because there are great efficiency gains in processing and it's far easier to set goals for entire swaths of k and n-ranges. The challenge is keeping it fun for everyone so we have to vary our efforts enough and have enough different ones going at a time to make things interesting. At this particular point in time, we likely have less different efforts going at once then we will have for the foreseeable future. Gary |
[quote=mdettweiler;152812]Hmm...it's still crashing. This is weird; I have absolutely no idea why it's doing this. The only thing I do know is that it seems to crash whenever it prunes the joblist and knpairs files.
Anyone know why this is happening?[/quote] Is the server completely down? |
[quote=gd_barnes;152814]Is the server completely down?[/quote]
No, it's just going down every 15 minutes (when it prunes the joblist and knpairs files), but I'm manually restarting it every time as soon as I see it. (I've got a VNC window open in the background so I can monitor it.) I'm currently working on a workaround to make it automatically restart the server every time it goes down (to serve as sort of a band-aid fix). |
[quote=mdettweiler;152815]No, it's just going down every 15 minutes (when it prunes the joblist and knpairs files), but I'm manually restarting it every time as soon as I see it. (I've got a VNC window open in the background so I can monitor it.) I'm currently working on a workaround to make it automatically restart the server every time it goes down (to serve as sort of a band-aid fix).[/quote]
Okay, I think I've got a temporary workaround (using "while" loops in bash syntax to continually re-start the server every time it exits) that should keep things going without me restarting it manually every 15 minutes. I'll continue to monitor it, of course, to make sure that the workaround functions properly. |
I'm typing from the machine right now. Is there anything I can do in the next 10 mins. to monitor it?
This has been, by far, my most stable machine. It's never been turned off and never run above 68 C (67 C right now) since I started running it in early May. Taking the cover off to swap out the hard drive was the first time I even had the cover off of it. |
[quote=gd_barnes;152817]I'm typing from the machine right now. Is there anything I can do in the next 10 mins. to monitor it?
This has been, by far, my most stable machine. It's never been turned off and never run above 68 C since I started running it in early May. Taking the cover off to swap out the hard drive was the first time I even had the cover off of it.[/quote] Okay...the workaround seems to be holding. Should be OK as a band-aid fix until we can figure out the root of the problem. As for what's causing this: I honestly don't know. I doubt it could have anything to do with swapping the hard drive; I'm thinking something more along the lines of a messed-up binary. But then again, I don't know--it could be anything. In the meantime, nobody should need to move any machines off G4000; the workaround should keep it running OK for now. |
Check the perms chmod 664 *.txt should fix it.
Happens when a lot of folks are trying to get in, and it doesn't come up cleanly trying to lock onto the socket. When folks are hitting mine real heavy, and I take it down to give it some more pairs, it might take me 15 tries to get it to come back up. Sometimes I just let it sit for a minute or two then try again to. |
When I run out of work on the 18th (possibly 19th), assuming the drive is still running and I haven't lined up new files yet, is IB400 probably the most stable and best for me to run on?
|
[quote=IronBits;152820]Check the perms chmod 664 *.txt should fix it.
Happens when a lot of folks are trying to get in, and it doesn't come up cleanly trying to lock onto the socket. When folks are hitting mine real heavy, and I take it down to give it some more pairs, it might take me 15 tries to get it to come back up. Sometimes I just let it sit for a minute or two then try again to.[/quote] Okay, thanks--I shut down the server, ran "chmod 664 *.txt", and restarted it. Hopefully that will do the trick. :smile: As for it needing a few more minutes before it can be restarted: yeah, I've had that happen a few times, especially today with needing to restart the server so much. I've been doing the same thing you suggested--letting it sit for a few minutes and then trying again. :smile: |
[quote=Mini-Geek;152822]When I run out of work on the 18th (possibly 19th), assuming the drive is still running and I haven't lined up new files yet, is IB400 probably the most stable and best for me to run on?[/quote]
Probably. If we time everything right, it will be the last 1st Drive server running at the end. |
[quote=mdettweiler;152823]Okay, thanks--I shut down the server, ran "chmod 664 *.txt", and restarted it. Hopefully that will do the trick. :smile:
As for it needing a few more minutes before it can be restarted: yeah, I've had that happen a few times, especially today with needing to restart the server so much. I've been doing the same thing you suggested--letting it sit for a few minutes and then trying again. :smile:[/quote] Hmm...it crashed again, just as before. I've put it back on the while-loop workaround that I used before. Maybe a "chmod 777 *" would work better? I'll try that momentarily... Edit: Okay, I've tried "chmod 777 *". We'll see if it still crashes after that. :smile: (Just to play it safe, though, I've enabled the while-loop thing once again, to ensure that the server isn't down for long periods of time if I'm not right at the computer when it crashes.) |
[quote=mdettweiler;152827]Hmm...it crashed again, just as before. I've put it back on the while-loop workaround that I used before.
Maybe a "chmod 777 *" would work better? I'll try that momentarily... Edit: Okay, I've tried "chmod 777 *". We'll see if it still crashes after that. :smile: (Just to play it safe, though, I've enabled the while-loop thing once again, to ensure that the server isn't down for long periods of time if I'm not right at the computer when it crashes.)[/quote] Thanks for your attention to detail on this Max. Sounds like a mess. Any luck with stopping the crashing with this last attempt? |
[quote=Mini-Geek;152822]When I run out of work on the 18th (possibly 19th), assuming the drive is still running and I haven't lined up new files yet, is IB400 probably the most stable and best for me to run on?[/quote]
I'll add to what Max said here: Yes, IB400 is the most stable at this point. But the question will be: Where should you connect at that point? Likely it will be IB400 but it could be one of the others if they are not dried before IB400 is. Around the 16th or 17th, keep checking the threads. We'll post what needs to be finished off by that point. I'll attempt to balance my machines such that IB400 is the last remaining server with pairs for the 1st drive but I can't guarantee it. On another note: You know what the most cool thing about this is?: Actually having a general idea of when we will complete something! How many projects are out there that can estimate when they will complete an entire effort that is being run by 10+ people?! This is great! :smile: It allows for excellent forward planning on future efforts. Gary |
[quote=gd_barnes;152837]Thanks for your attention to detail on this Max. Sounds like a mess.
Any luck with stopping the crashing with this last attempt?[/quote] I just checked the server...and it looks like all is working well now! :grin: I'll still leave it on the while-loop thingie so that it will automatically restart if it does go down for whatever reason, but it looks like it should be good now. :smile: Thanks David for your help! :smile: |
| All times are UTC. The time now is 23:26. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.