![]() |
|
|||||||
| View Poll Results: Should we buy a backup server? | |||
| Yes |
|
12 | 40.00% |
| No |
|
18 | 60.00% |
| Voters: 30. You may not vote on this poll | |||
![]() |
|
|
Thread Tools |
|
|
#12 |
|
Jan 2003
Altitude>12,500 MSL
101 Posts |
There is no need for a backup server. And the crashes had nothing whatsoever to do with client loading.
Here's why: The PrimeNet/Mersenne.org server, hosted at a Level 3 datacenter (now independently of Entropia), can handle well over 50 URL hits/sec, probably much more. It's been through a lot of load testing to verify multi-threaded safety and capacities: Prime95 has special compile flags for this purpose, creating accounts and turning around dummy factoring and LL results instantly to enable dozens of parallel instances to hammer hard at the server. The current box is a high end dual-CPU Dell series server with redundant power supplies, redundant controllers and network, and a RAID 5 disk array - about $7000 when I bought it new, and literally about 20x stronger than PrimeNet needs for client loading. Even the hourly database reports drive only 60% capacity utilization for a few minutes. Here's a photo of PrimeNet's rack just prior to its Level 3 installation: http://mersenne.org/primenet/primenet.jpg As far as non-Prime95 clients go, PrimeNet v4 already supports networked Linux and Unix MPrime versions, as well as OS/2 clients. The server's manual testing web forms support an even broader variety of Mac and other clients - manual testing only because the folks working on those clients were unwilling or unable to perform the requisite work to add the necessary minimal network protocol support to talk to PrimeNet, a non-trivial effort. A second major challenge was security, because the v4 server was designed for a trusted-binary client model, not an open-source client model having transaction throttling/braking controls to block rogue or malicious clients - a situation that unfortunately happens too often, and will be addressed anew in the v5 design. Having 'failover' servers is a small part of the v5 implementation planning. The purpose of a failover server is mainly for geographically distributed disaster risk management, not necessarily fault-tolerance if the 'main' server goes offline. While failover capability sounds desirable, the added complexity of synchronizing mulitple servers for real-time client cutovers will in all likelihood be a big headache that the new v5 team will not immediately want to face. Remember, the client software is quite intentionally designed to reconnect and synchronize automatically if the server is for any reason unavailable -- and to keep busy in the interim period. The fact that the work units in GIMPS are nominally days if not weeks long makes this strategy particularly convenient. PrimeNet has been operating for nearly 7 years; the degree of concern expressed lately about it seems quite out of proportion with circumstance, and the amount of proposed technical infrastructure is far in excess of what GIMPS requires or is likely to require for the foreseeable future. Let's just watch those new server status icons and see ... |
|
|
|
|
|
#13 |
|
Aug 2002
A Dyson Sphere
32·7 Posts |
The main reason that I am concerned right now is the possiblity that the server will be down at the exact moment that the new prime hits the news. If the server is down at that time thousands of people may come to the site, see that the server is down, and leave thinking that it is always like that. I agree that a second server is unnecessary, but is there any way to make the server less likely to crash for about one day after the new prime is announced?
|
|
|
|
|
|
#14 |
|
Jan 2003
Altitude>12,500 MSL
101 Posts |
Hey, I said I was rolling up my sleeves with my toolkit... ! I made two fixes to the server Thursday AM. Several of the 'outages' you may have seen Wed-Fri were caused by my debugger breakpoints trapping.
I've been holding off proclaiming success as I don't know the previous MTBF (mean time between failures) and several multiples of that interval are necessary to acquire confidence of a correct fix. Nonetheless, the server has been running the fixed code for a few days without incident. The evidence suggests we don't need to worry about downtime when the M40 press brings new folks in over the next month or so. But I'm waiting for the Monday client traffic to see what happens - not for the load, but rather a broader variety of clients and transaction conditions. Feels like old times, but long ago PrimeNet went through much greater growing pains while membership grew with discoveries of M37, M38 and M39. Having been through this several times before I am quite at ease about GIMPS growing and the v4 system holding up just fine. |
|
|
|
|
|
#15 |
|
Jun 2003
The Computer
23×72 Posts |
I was thinking we could use the extra server in conjunction with the regular one. Consider the attached file. In the background, there are a lot of servers, but they are racked and all do the same task as opposed to two different servers. In other words, it would be like Prime95 where one computer calculates for everyone else in a circle.
|
|
|
|
|
|
#16 | |
|
Sep 2003
5·11·47 Posts |
Quote:
I have one minor suggestion: currently summary.txt is generated "in place", which means that every hour between N:00 and N:04 the file is usually nonexistent or truncated. How about generating it under a different temporary name, and only when complete rename the new file to summary.txt? |
|
|
|
|
|
|
#17 | |
|
Sep 2003
Borg HQ, Delta Quadrant
2×33×13 Posts |
Quote:
|
|
|
|
|
|
|
#18 |
|
Oct 2002
Lost in the hills of Iowa
26×7 Posts |
One master server, the rest "proxy" servers, is a model that works well for Distributed.Net - the master assigns "subblocks" of work to each subserver, and the individual subservers only assign work out of their specific subblocks.
Given that the Primenet server HAS seen outages in the 1-2 week range, and that the *default* in Prime95 is to only keep "30 days" work on hand - which is ONE EXPONENT for some LL work - I'd say that either the Prime default needs to go up some (60-90 days) for clients doing LL work, or the server needs to get more redundant. I personally have had cases where one of my machines finished an exponent, started on a second exponent, found a factor in less than a day in the TF stage, and needed a 3'd exponent - all in less than a 24 hour period. If I had left those machines at the 30 day default, they WOULD have run out of work - and one of those cases happened during a PrimeNet outage of some days length. My view might be skewed somewhat by the fact that all the LL work I do is on 10,000,000+ exponents - I don't know how long the current "leading edge" LL tests would take on any of my machines. |
|
|
|
|
|
#19 |
|
Jan 2003
Altitude>12,500 MSL
10110 Posts |
There will be no more such outages, barring a natural disaster.
The server's design will be revealed in the v5 Server Development forum next month. |
|
|
|
|
|
#20 |
|
Banned
"Luigi"
Aug 2002
Team Italia
32·5·107 Posts |
WHEEEE!!! That's what I like of this forum! Luigi |
|
|
|
|
|
#21 |
|
Aug 2002
Texas
5×31 Posts |
The anticipation is killing me
me so excited ![]() A second Christmas Last fiddled with by Complex33 on 2003-12-05 at 02:04 |
|
|
|
|
|
#22 | |
|
Sep 2003
Borg HQ, Delta Quadrant
10101111102 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Need backup solution advice | jasong | jasong | 17 | 2013-04-22 03:30 |
| Please recommend a backup solution for my computer | jasong | jasong | 3 | 2013-01-05 09:24 |
| Anyone using a cloud backup/sharing solution? | petrw1 | Lounge | 9 | 2012-04-18 15:17 |
| PrimeNet Database backup? | Dubslow | PrimeNet | 26 | 2011-12-20 03:39 |
| Backup Files | Unregistered | Information & Answers | 1 | 2008-05-30 03:30 |