mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet

Closed Thread
 
Thread Tools
Old 2003-01-21, 04:41   #155
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default Re: ok now what?

Aga: Thanks for your insights into MySQL replication limitations.

Quote:
Originally Posted by aga
I was thinking that 2 to 5 servers should be ok, with 3 optimal. But data structures should allow up to 30 servers - that should cover any imaginable practical need (remember, stats servers might biggyback on core servers without being included into RAIS), but wll not increase database size by more than 4 bytes per record (message).
I was thinking 3 or 4 servers. We pretty much agree that one server can handle the load, the added servers are to insure high availability and better geographic distribution. Bandwidth will limit the maximum number of servers.
Prime95 is online now  
Old 2003-01-21, 04:47   #156
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default Re: ok now what?

Quote:
Originally Posted by aga
1) Decide which information stored online, and how long does it maintained there. This includes already tested exponents, returns history (to have information useful to estimate how fast and trustful computer is, which influences if small exponents get assigned to the computer).
All results data will be maintained on the servers. Forever. The only exception might be LL results that are proven incorrect and LL results when a factor is later found.

I can see some stats data being aggregated and deleted. For example, we might keep every CPU credit record for a year so that you can run queries about your CPU contribution over the last week, month, or year.
After a year the individual CPU credit records could be deleted and we just remember your CPU contribution for all of 2002, 2003, 2004, etc.
Prime95 is online now  
Old 2003-01-21, 18:41   #157
Old man PrimeNet
 
Old man PrimeNet's Avatar
 
Jan 2003
Altitude>12,500 MSL

101 Posts
Default database & statefulness

One big lesson I learned building PrimeNet v1-v4 was where to put statefulness & related client handling decisions. Despite the notion that we 'just hand out exponents', there are about a dozen real-life corner cases to worry about (I'll enumerate these later).

For a long time I wrote system service or stored procedure code to handle questions of exponent fitness for testing by checking certain builds of Prime95 or CPU capability, or when it was ok to 'override' an assignment by an unexpected move to a different machine, when it was poaching, when to reclaim an exponent as 'probably lost' vs. just a slow machine, etc.

What I'd recommend we do this time is to put more of this into the database tables, and keep the logic above it as stateless & decisionless as possible. PrimeNet presently has 3 main SQL tables, exponents, users and machines. Instead of the SQL & C++ spaghetti to decide if a particular exponent can be assigned to a particular machine, I think we just need a table to join against the exponents and the machines tables that is masked for each machine and for each build of Prime95. Maintaining the masking tables is then operations data management.

We'd have a single SQL statement to get assignments (like today) except we wouldn't need to review the fitness for testing of each exponent, and go back to the database for the next one, etc.
Old man PrimeNet is offline  
Old 2003-01-21, 19:22   #158
aga
 
Oct 2002

25 Posts
Default

What I was thinking about, is to apply some kind of rating for exponents and computers. Rating is represented by an integer number 0..100 (or 0..255). Smaller exponents get higher rating; faster computers get higher rating; longer a particular computer is around, higher rating it gets and so on.

So while assigning an exponent, core of algirithm is 'having computer with rating N requesting exponent, assign it a smallest (or 'almost smallest') exponent that have rating below or equal N'. With suitable db tables structure that can be implemented with a single SQL query (tho 'single SQL query' is not a natural goal, more like heuristic).

All the ratings don't require any particular precision. Each server can run a script once a week or month recalculating ratings of untested exponents and active computers. Note that this update does not require replication over servers, there are no problems if ratings at each server somewhat differ.

So all 'spaghettied' code will be in a non-realtime script, and if someone in mood playing with weights and formulas, the script could be run several times a day. Even if the script will screw something up assigning weird ratings to numbers and computers, it will not cause disaster. All that's needed to ensure overall processing stability, is few powerful computers running GIMPS client-side software, with highest (255) rating assigned. I think I saw George runs 2-way P4 1.6GHz box; that should be sufficient - the box will mostly run small exponents, and thus will be able to process number of exponents per month ensuring smooth overall GIMPS rate. Formulae and weight should be choosen in a way that there were at least 2 dozens of fast computers with weight above 200.
aga is offline  
Old 2003-01-21, 19:23   #159
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

In this thread on poaching, http://www.teamprimerib.com/gimps/viewtopic.php?t=308 , I discuss server upgrades necessary to reduce this blight.

In short, the smallest double-checks and first-time tests are handed out to machines that have previously returned results and can complete the assignment in a reasonable time (say 1 to 3 months). This reduces a poachers temptation to grab an exponent from a slow machine that is holding up a milestone.

We also inform users that the server only guarantees an exponent reservation for one year. The server has smarts that help slow machines get assignments that will complete in a year and are unlikely to hold up a milestone. Paradoxically, this means giving slower machines larger exponents! Giving a part-time Celeron machine an 11M exponent that takes 8 months to complete is more likely to hold up a milestone than giving the same machine a 17M exponent that takes a year to complete.

Scott is right that these time & speed parameters belong in a database so that they can be fine tuned easily.

P.S. If you want to debate this proposal or other aspects of poaching please post in the poaching thread rather than this thread.
Prime95 is online now  
Old 2003-01-21, 20:06   #160
Old man PrimeNet
 
Old man PrimeNet's Avatar
 
Jan 2003
Altitude>12,500 MSL

11001012 Posts
Default view from 50,000 ft

I believe we need to proceed in discrete baby steps to avoid a big code-n-go approach that is going to be hard to make work. Taking a more sweeping perspective, let me poke at stick at a few things that have more leverage over design decisions than other things:

1. Migration - First, are we decided that we should migrate clients from the old PrimeNet to the new one? What makes sense?
(a) Migrate the server 'under' the v22 (and earlier) clients into part of a new federation of 3 to 4 servers, and later build a new client and upgrade as many clients as possible?
(b) Or just build a new federation of 3 to 4 servers, and build new client code at the same time, then upgrade as many clients as possible, leaving the old PrimeNet to starve off by slow machine attrition?

Migration path (b) sounds like less work to me if we agree we need a new network protocol (and thus client build) anyway. I believe I will recover ownership of PrimeNet from Entropia, so we have the time for option (b).

2. Federated Servers - What do we need vs. what do we want with redundant servers? Given the fact that currently only one server can handle everything, is load balancing a strong requirement if failover solves the availability issue?
(a) Federated servers with parallel shared database state is cool, but more complicated and adds cross-server update locking latency, which pure mirroring to a standby database does not add. It would make cutting over to a 'plan B' server instantaneous (clients simply redirect themselves to an available alternate server). I confess I like this option but it gives me pause for risk concerns. The only way I personally know how to make this work well w/o pricey OTS productware is via a messaging queue.
(b) A failover database can take advantage of SQL-level replication, but is usually limited to peer database versions; and cutting over from server 'A' to 'B' may require a manual reconfiguration step to activate 'B'. Low risk, but not as cool & flexible moving forward. Only one server is 'online' at any time, but availability is still augmented ok.

3. Client code upgrades or support to run other apps - Are we thinking of addressing this for opt-in users? This can be a big can-o-worms, but I have experience building these a few times already (tho I need to stay clear of my Entropia patents).

I think the 'half-life' of a Prime95 client on a typical GIMPS machine affects this decision. For example, if Prime95 FFTs are good for about 3 years before testing work progresses past its utility, then maybe its better to let it alone and drop from attrition when the machine is upgraded or tossed out. What got me thinking on this was GIMPS has stayed flat with number of parallel CPUs applied but has grown with Moore's Law from 1 teraflop in Y2K to 5+ teraflops today. Yet around 30K machines ran in parallel any given day most of that time, many of those being new installs.

Comments? More on other requirements coming up...
Old man PrimeNet is offline  
Old 2003-01-21, 21:39   #161
QuintLeo
 
QuintLeo's Avatar
 
Oct 2002
Lost in the hills of Iowa

26·7 Posts
Default

It doesn't look like load balancing is important for the near term - if PrimeNet growth turns exponential, perhaps in a while.

Server availability seems to be the primary concern among those of us that server downtime annoys....
QuintLeo is offline  
Old 2003-01-21, 22:20   #162
Old man PrimeNet
 
Old man PrimeNet's Avatar
 
Jan 2003
Altitude>12,500 MSL

101 Posts
Default parallel SQL

Is there a mySQL or an above-ODBC/JDBC-level freeware equivalent of Microsoft DTC or IBM's MQ Series DB2 add-on to keep databases in sync?
Old man PrimeNet is offline  
Old 2003-01-22, 00:24   #163
Old man PrimeNet
 
Old man PrimeNet's Avatar
 
Jan 2003
Altitude>12,500 MSL

101 Posts
Default availability requirements

Availability (& Redundancy) Requirements

Looking back, PrimeNet has always had occasional availability outages. Prime95 by design queues up work to do and synchronizes itself to the server when connectivity resumes, meaning unattended Prime95 clients automatically continue uninterrupted productivity in the vast majority of cases... unless the outage is sustained for periods of several days or more.

Let's examine whether the annoyance of an unavailable server is centered upon
(a) those wanting to see immediate or frequent web-site feedback on their account's progress or the overall progress of the search, or
(b) by whomever stages testing work and collects results from the server, or
(c) outages that last longer than the average work-to-do queue.

I'm sure all 3 are true, but what's the true pain point?

(c)? Sustained outages of PrimeNet v4 are perhaps in large part due to the fact that only one Entropia employee supports its operation, and upgrades and bug-fixing have been utterly frozen, with few exceptions, since I started growing Entropia's business in late 1999. This is expected to change soon, so at least one source of annoyance (c) should be mitigated.

(b)? I can understand this one readily. But George has lived with this thorn for years so its hard for me to imagine it being more pressing now, except that when Brad takes off to Europe for a month to install more grid customers (like he just did again Saturday) then PrimeNet's support probably suffers.

(a)? I suspect this is significant. The web site reflects PrimeNet's state in a highly visible way. If the server is down, the best we can do is show partial or stale data on the web site.

We probably agree it would be great if PrimeNet would simply run without any unplanned downtime. Where does it come from? Here's my semi-informed guess/recollection/knowledge:

Sources of PrimeNet v4 downtime (of say, 20 outages):
1. entropia.com proxy dead or jammed up 55%
2. PrimeNet component dead or jammed up 25%
3. ISP line out / long-term power failure 5%
4. DoS attacks / Internet 'storms' 5%
5. web forms abuse 5%
6. planned maintenance 5%

Note that data corruption / loss is not even listed (though the v1 & v2 servers had this issue). My point here is that perhaps data integrity via replication is not the main concern to focus upon. Important to manage as a risk factor, but not a burning hot spot.

[Sidebar - Note item (5) has been a point of some headaches in the past, too - it seems impossible to provide a web form for manual database use w/o some party eventually writing a script that irresponsibly 'mines' PrimeNet's database; this issue will reappear among the open-source requirements issues to resolve.]

My second observation is that about 4 in 5 outages is due to failure that could be fixed by our efforts here. Note also that except for disaster risk and control being centralized, a fixed-up single server could avoid 85% of the outages (items 1,2,5).

Therefor, a single server with one or two mirrored failover secondaries would suffice - At this point I want to reiterate support for an open source PrimeNet solution and its shared operating responsibility. After this cursory analysis, however, I believe that a simple fail-over system having mirrored data would be sufficient. Anything beyond that may only be cool or planning ahead, but a want and not a need.

Execute arbitrary app code? - The only thing I can think of that would truely require a parallel shared / federated server would be one requiring high scalability, or low-latency client resyncs (or both), or arbitrary application code deployment & execution - are we going there with this design? Did I miss anything for the analysis?
Old man PrimeNet is offline  
Old 2003-01-22, 01:22   #164
aga
 
Oct 2002

25 Posts
Default Re: view from 50,000 ft

Quote:
Originally Posted by Old man PrimeNet
(a) Migrate the server 'under' the v22 (and earlier) clients into part of a new federation of 3 to 4 servers, and later build a new client and upgrade as many clients as possible?
I think it should be pretty simple to implement compatible http interface for old clients. It may even be the primary interface.

I'm not sure about old RPC interface. Are there still noticeable amount if RPC-communication clients? I.e. can they be forgotten about (or left at primenet v4 server), or there will be need for RPC->HTTP gaetway? In latter case, how much work we are going to face?

Quote:
Originally Posted by Old man PrimeNet
(a) Federated servers with parallel shared database state is cool, but more complicated and adds cross-server update locking latency,
Locking latency? All interserver communication should be asynchronous, all messages go into queue and then replication thread(s) handle it independently of front end.

And regarding cool... yeah, that would be cool. Isn't GIMPS was started because it's damn cool? :) At least, noone yet explained what the found mersenne primes can be practically used for.

Quote:
Originally Posted by Old man PrimeNet
(b) A failover database can take advantage of SQL-level replication, but is usually limited to peer database versions; and cutting over from server 'A' to 'B' may require a manual reconfiguration step to activate 'B'. Low risk, but not as cool & flexible moving forward. Only one server is 'online' at any time, but availability is still augmented ok.
This is a good note, configuring failover servers is not going to be trivial, there are still a bunch of problems to resolve. As all fruits are high engough, why target at smaller one?

Quote:
Originally Posted by Old man PrimeNet
3. Client code upgrades or support to run other apps - Are we thinking of addressing this for opt-in users? This can be a big can-o-worms, but I have experience building these a few times already
This might be implemented, but no need to include it into core servers. Doing centralized upgrades need nothing from GIMPS database, not even list of accounts (it would be different if GIMPS client-side software was commercial).
aga is offline  
Old 2003-01-22, 01:40   #165
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1D7116 Posts
Default

Migration:

I think we should aim for a server that accepts v22 and earlier CGI requests. The protocol is known and shouldn't be hard to support. The new server will require a new client with new or improved messages. One particularly painful migration will be from the current userid scheme to the new not-yet-designed userid/team scheme. Note that v22 contacts mersenne.org directly whereas v21 goes through entropia.com.

Outages:

The biggest annoyance is the length of the outages. With only one Entropia contact it was not uncommon for the server to be down for an entire weekend or worse if Brad was overseas. A single server with the ability for any of a half dozen people given the authority to reboot it remotely would solve our problems.

Multiple servers / Mirrored servers:

We do not have an immediate need for load-balancing or even failover servers. Mirroring through replication is a great way for us to avoid writing code to do oodles of logging and disk backups. It also would let us have an emergency backup machine should the main server have a hardware problem that takes extended time to fix. There is some concern that a single server might not have adequate CPU power once the stats reports are beefed up but that remains to be seen. Even though we only need a single server we must design the server code with multiple servers in mind.

Arbitrary apps:

I would like to add ECM factoring to the server. It would be nice if other small math projects could be added easily. The entire stats engine should work without modification for these other projects.
Prime95 is online now  
Closed Thread



Similar Threads
Thread Thread Starter Forum Replies Last Post
Report of monitoring primenet server unavailability Peter Nelson PrimeNet 13 2005-10-18 11:17
Is Entropia in trouble? ekugimps PrimeNet 1 2005-09-09 16:18
mprime stalls if primenet server is unavailable :( TheJudger Software 1 2005-04-02 17:08
Primenet Server Oddity xavion PrimeNet 28 2004-09-26 07:56
PrimeNet server replacement PrimeCruncher PrimeNet 10 2003-11-19 06:38

All times are UTC. The time now is 14:56.


Mon Aug 2 14:56:03 UTC 2021 up 10 days, 9:25, 0 users, load averages: 2.73, 3.03, 3.44

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.