mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2008-12-05, 11:06   #419
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

32·13·89 Posts
Default

Quote:
Originally Posted by em99010pepe View Post
G4000 has been down for the last 5 hours.
Damn. Of all the luck. I'm sure the machine is down then. And what's worse is that I'm sure that it is the ONLY one of my 10 machines that is down.

It is one of the 8 machines plus 3 more cores that were running port 400 at the point that it must have gone down and I see a drop of ~10-12% in k/n pairs processed on that port around 4-5 hours ago, which would equate to 4 cores out of 35 and coincide with Carlos saying that it has been down 5 hours.

Max, it looks like we're screwed on port 4000 until late next Tuesday when I get back and see what is wrong with it unless you can somehow make the "master machine" another one of my machines remotely and then somehow set up port 4000 on it instead. (That sounds like a huge headache to me.) I think most of the others are running <= 68 C but if you do set it up on one of the others, please check the temps first. Like I said, Crunchford was definitely the warmest of my AMD machines.

Everyone, if you were on port 4000, please move to port 400. It can easily handle the load. Sorry.

This shouldn't delay us in total by more than 1/2-day to 1 day on finishing this drive if people move their machines within a day. Port 400 will just do more of the work and later on, I may need to move a few of my machines to port 4000 to clear it out more quickly once we get it going again.


Gary

Last fiddled with by gd_barnes on 2008-12-05 at 11:09
gd_barnes is online now   Reply With Quote
Old 2008-12-05, 11:08   #420
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

283010 Posts
Default

C443 is also available with a lots of work to process.
em99010pepe is offline   Reply With Quote
Old 2008-12-05, 14:31   #421
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

285210 Posts
Default

Quote:
Everyone, if you were on port 4000, please move to port 400. It can easily handle the load. Sorry.
Just woke up. I'll start moving stuff shortly. Looks like it was down about 1AM EST last night.

I have got to find an easy way of changing servers. Switching 25 cores will not be fun.
MyDogBuster is offline   Reply With Quote
Old 2008-12-05, 15:34   #422
MyDogBuster
 
MyDogBuster's Avatar
 
May 2008
Wilmington, DE

22×23×31 Posts
Default

Halfway into switching the ports, my service provider went down. It hasn't been down since August. This just ain't my day. Time to go back to bed.

Okay it's back up and the switch is finished.
MyDogBuster is offline   Reply With Quote
Old 2008-12-05, 16:53   #423
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

7×292 Posts
Default

what happened to hour 24 yesterday
http://nplb.ironbits.net/progress_400.html

edit could all the new stats pages be added to the first post of this thread

Last fiddled with by henryzz on 2008-12-05 at 16:54
henryzz is online now   Reply With Quote
Old 2008-12-05, 17:09   #424
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

141518 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
Max,

I noticed that http://nplb-gb1.no-ip.org/llrnet/ is down at the moment so not only can we not view the status of port 4000, I can't access any of my machines remotely. I hope the machine that the server is on is not down. I checked my k/n pairs per hour on port 400 and it appears that there might have been a drop of one quad a few hours ago...Or it could be just some processing glitch in something. I'm not sure. (BTW, I added 2 more cores to port 4000 on Thurs. afternoon after my Riesel base 256 effort finished early Thurs. morning.)

I think I have the temps regulated pretty well on my machines now but unfortunately "crunchford", the machine that runs the server, is still one of the warmer running ones. (~70-71 C I think.) You might remember me mentioning that it was not one of the best choices for the server.

When you get a chance this morning, can you check things and make sure that you can get back on the above link and that port 4000 is working OK? I likely will be on next around 1 PM CST. (7 PM GMT)

Ian, you might also check and make sure you're still processing work on port 4000.


Thanks,
Gary
Ouch. However, I have some good news: based on what I'm seeing in the IB400 results file for today, most likely some or all of the rest of your machines are still up and crunching.

I doubt that thermal problems are the issue here; even if it went all the way up to 80 C, it should still run, though it would crunch somewhat slower. (A while ago I had my dualcore hovering at 83 C for about a month or two and I still used it as my primary machine.) I'm thinking something more along the lines of a power flicker (which can, depending on the duration, take out some machines and not others).

Hmm...if only I knew crunchford's MAC address I could try feeding it a Wake on LAN signal through port 4000 (since that port is already open on your router). Though even if I could do that, it's a tossup as to whether that would actually make the machine start up. (I can't get it to work on my machines, either.) Anyway, though, when you get it restarted I'll see about getting the MAC addresses of all your machines written down (I'll be able to obtain that information once I can get SSH access) in case we ever need to try a Wake on LAN in the future.

Max
mdettweiler is offline   Reply With Quote
Old 2008-12-05, 20:12   #425
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1041310 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
Ouch. However, I have some good news: based on what I'm seeing in the IB400 results file for today, most likely some or all of the rest of your machines are still up and crunching.

I doubt that thermal problems are the issue here; even if it went all the way up to 80 C, it should still run, though it would crunch somewhat slower. (A while ago I had my dualcore hovering at 83 C for about a month or two and I still used it as my primary machine.) I'm thinking something more along the lines of a power flicker (which can, depending on the duration, take out some machines and not others).

Hmm...if only I knew crunchford's MAC address I could try feeding it a Wake on LAN signal through port 4000 (since that port is already open on your router). Though even if I could do that, it's a tossup as to whether that would actually make the machine start up. (I can't get it to work on my machines, either.) Anyway, though, when you get it restarted I'll see about getting the MAC addresses of all your machines written down (I'll be able to obtain that information once I can get SSH access) in case we ever need to try a Wake on LAN in the future.

Max

English please. lol

You say you have good news? Didn't I just state that all of my machines were likely up except Crunchford in the first 2 paras. of my post and provide stats from port 400 to prove it? Are you skimming my posts again? (lmao)

On my AMD's, for the ones that previously ran consistently above about 74-75 C, the motherboard eventually shot craps so I'm just speculating on this one. Hopefully it was just a power flicker. It's oddly coincidental that it happened to the warmest and most important machine of the group.

Regardless, is it possible that you can switch the 'master machine' over to another one of my machines so at least I can remotely view the other machines before next Tuesday? For CRUS, I have Sierp base 256 and Sierp base 16 running on a couple of them.


Thanks,
Gary

Last fiddled with by gd_barnes on 2008-12-05 at 20:22
gd_barnes is online now   Reply With Quote
Old 2008-12-05, 20:19   #426
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

2·5·283 Posts
Default

Quote:
Originally Posted by nuggetprime View Post
Will C443 get a stats page as there ins one for IB400/G4000?
Quote:
Originally Posted by IronBits View Post
I've given him lots of code, but haven't heard back from Carlos and, I'm not sure he has a web presence.
I can work something up over here for him, it will just be delayed by at least a day's worth of work.
Too busy with real life! I have to see that again with IB.

Carlos
em99010pepe is offline   Reply With Quote
Old 2008-12-05, 21:04   #427
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

3·2,083 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
English please. lol

You say you have good news? Didn't I just state that all of my machines were likely up except Crunchford in the first 2 paras. of my post and provide stats from port 400 to prove it? Are you skimming my posts again? (lmao)
LOL--yes, I was skimming your post, I must admit.

Quote:
On my AMD's, for the ones that previously ran consistently above about 74-75 C, the motherboard eventually shot craps so I'm just speculating on this one. Hopefully it was just a power flicker. It's oddly coincidental that it happened to the warmest and most important machine of the group.

Regardless, is it possible that you can switch the 'master machine' over to another one of my machines so at least I can remotely view the other machines before next Tuesday? For CRUS, I have Sierp base 256 and Sierp base 16 running on a couple of them.
Unfortunately, I can't do anything until crunchford is back online again--all remote access into your network is through that machine. If it was just a power flicker, then all it needs is a reboot and I can get back in and get everything running again; however, if crunchford *did* blow its motherboard, then we can't recover all the LLRnet server stuff until you get it fixed. (If that does turn out to be the case, I'd recommend switching the hard drive into a machine with a good motherboard, so that we can at least get it online long enough for me to grab the LLRnet files and switch the "master machine" over to another box.)

After you get back and I can get in again, I'll see about setting up a "secondary master" so that if the master ever goes down again, we can still get in through an alternate port to a different machine.

In the meantime, maybe you could have your ex-wife stop by and reboot crunchford like you used to do before we got the remote desktop thing set up? Then, assuming it still works, I could get in and re-start the server stuff (and back it up, and set up a secondary master while I'm at it).

Max
mdettweiler is offline   Reply With Quote
Old 2008-12-06, 10:55   #428
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

32×13×89 Posts
Default

Quote:
Originally Posted by mdettweiler View Post
LOL--yes, I was skimming your post, I must admit.


Unfortunately, I can't do anything until crunchford is back online again--all remote access into your network is through that machine. If it was just a power flicker, then all it needs is a reboot and I can get back in and get everything running again; however, if crunchford *did* blow its motherboard, then we can't recover all the LLRnet server stuff until you get it fixed. (If that does turn out to be the case, I'd recommend switching the hard drive into a machine with a good motherboard, so that we can at least get it online long enough for me to grab the LLRnet files and switch the "master machine" over to another box.)

After you get back and I can get in again, I'll see about setting up a "secondary master" so that if the master ever goes down again, we can still get in through an alternate port to a different machine.

In the meantime, maybe you could have your ex-wife stop by and reboot crunchford like you used to do before we got the remote desktop thing set up? Then, assuming it still works, I could get in and re-start the server stuff (and back it up, and set up a secondary master while I'm at it).

Max

The danger about having Sherri go by and turn it on is that is how I fried a motherboard before myself. That is...the fact that it shut itself down was a 'warning' sign that something was amiss. I turned it back on, started crunching again and a few days later it went off again. I did it again and it went off again in about a day. That was it...it had fried itself at that point.

I'm not going to turn it on and start crunching on it until I verify temps and stuff. Well, I suppose I could have her turn it on but not start crunching on it (assuming it will even come on; which I suspect there is < 50% chance of). Since the server actually does no crunching, it shouldn't heat up the machine. I'll see if she can do it. I hate to burden her with messing with stuff again though. She's already been by my house twice to make sure everything is OK and I told her that should be enough. Oh well, I'll see what I can do.

If the machine won't turn on, yes, I will swap hard drives with another machine after I get back to the coolest running machine so that we can make sure the server is on likely the most stable machine that I have. Actually, I've done that twice already based on the priority of stuff that was running on a machine that went down, even after you got the remote access set up. You just didn't know it. lol

Stupid machines!


Gary
gd_barnes is online now   Reply With Quote
Old 2008-12-06, 12:00   #429
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

283010 Posts
Default

Lennart,

You have cores doing duplicated work on C443. Please check them.

Meanwhile I moved 4 cores to IB400 to help to clean the lower ranges, 3 cores are still on C443.

Carlos

Last fiddled with by em99010pepe on 2008-12-06 at 12:04
em99010pepe is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRPnet servers for NPLB mdettweiler No Prime Left Behind 228 2018-12-26 04:50
Servers for NPLB gd_barnes No Prime Left Behind 0 2009-08-10 19:21
LLRnet servers for CRUS gd_barnes Conjectures 'R Us 39 2008-07-15 10:26
NPLB LLRnet server discussion em99010pepe No Prime Left Behind 229 2008-04-30 19:13
NPLB LLRnet server #1 - dried em99010pepe No Prime Left Behind 19 2008-03-26 06:19

All times are UTC. The time now is 21:03.


Fri Aug 6 21:03:26 UTC 2021 up 14 days, 15:32, 1 user, load averages: 2.37, 2.52, 2.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.