![]() |
Report of monitoring primenet server unavailability
Recently I wrote a little program to retrieve my stats from primenet using the "Quickstats" link.
I made this run as a cron job in Linux EVERY 15 MINUTES. It displays my stats on an LCD screen which is cool. Now, occasionally the primenet server is down. What I did not realise at the time (but what is a useful side effect) is that WHENEVER THE JOB FAILS, it sends email to me reporting the time and the fact that the program could not obtain the HTTP connection to primenet for stats. It turns out this is nice information. I will post it below. All times are in UTC (ie Greenwich mean time, uncorrected for summer time). At these times and dates the 15 minute polling was NOT successful. Conversely at all other 15 minute attempts my computer could connect successfully. It is possible that outage could be result of some outage at MY isp, but they are reliable and on occasions when I noticed it down I was able to successfully able to surf to eg mersenneforum but not the primenet server. |
For the period 26 September to 9 October
The primenet server APPEARED to be down at these 15 minute polling times. date hour minute 26/09/2005 16 30 18 0 29/09/2005 10 45 11 0 11 15 11 30 11 45 12 0 12 15 12 45 30/09/2005 14 30 02/10/2005 12 45 13 0 13 15 13 45 14 0 03/10/2005 8 15 8 30 8 45 9 0 9 15 9 45 10 0 10 15 10 30 10 45 11 0 11 15 11 30 11 45 12 0 12 15 12 30 12 45 13 0 13 15 13 45 05/10/2005 0 45 1 15 1 30 1 45 3 30 12 30 12 45 06/10/2005 15 15 21 30 22 0 22 15 22 30 22 45 23 0 23 15 07/10/2005 7 45 8 0 8 30 8 45 9 15 9 45 10 0 10 15 10 30 10 45 11 0 11 15 11 30 11 45 12 0 12 15 12 30 12 45 |
Comment on downtime
Although there has been anecdotal info such as "I couldn't connect for most of yesterday", I believe this is the first objective analysis of primenet uptime.
Well, what do you think? I am not terribly impressed by the amount of times it looks down, and several of these are for extended periods of time. For new project participants they would not be able to get exponents etc at these times. I wonder if there are any thoughts on the reasons for such problems? Is the lan at the hosting location periodically being hit by some backup or downloading traffic? Does the internet link to it get saturated? Does the server run out of memory and need the process restarting? I don't know the answer but I hope the above is useful in tracking it down. I would be interested if there was any scheduled maintainance occuring during these dates period as it would explain some of the unavailability. |
Just for info you can see the route of my traffic (on level3 network most of the way).
Tracing route to [url]www.mersenne.org[/url] [64.66.6.250] over a maximum of 30 hops: MY COMPUTER TRYING TO REACH PRIMENET [n.n.n.n] 1 <1 ms <1 ms <1 ms MY LOCAL ROUTER [n.n.n.n] 2 28 ms 29 ms 29 ms gandhi-dsl1.wh.zen.net.uk [62.3.83.5] 3 30 ms 30 ms 30 ms bolzano-ge-0-0-1-3.wh.zen.net.uk [62.3.80.229] 4 31 ms 29 ms 31 ms 195.16.169.89 5 30 ms 31 ms 29 ms so-6-0-0.mp2.Manchesteruk1.Level3.net [4.68.113.114] 6 171 ms 168 ms 169 ms as-0-0.mp2.SanDiego1.Level3.net [4.68.128.149] 7 169 ms 171 ms 170 ms so-10-0.hsa2.SanDiego1.Level3.net [4.68.113.42] 8 171 ms 170 ms 170 ms unknown.Level3.net [63.212.173.166] 9 372 ms 300 ms 342 ms shawbinary-64-66-6-250.4d.net [64.66.6.250] (that's the primenet server) Trace complete. If anything the last hope at YOUR end looks a bit long, given I can reach San Diego across the Atlantic in 170ms. However at the time of this test things were working. |
[QUOTE=Peter Nelson]If anything the last hope at YOUR end looks a bit long[/QUOTE]
You mean hop, or is the Primenet Server really hopeless? :wink: Seriously, I have read several times this is a server issue, something like scripts that freeze and need to be restarted. It may be a memory issue. But in fact I don´t know the Hw/Sw configuration of the server, so it is difficult to say whether that is justifiable or not. But it doesn´t appear to be the network: I have already PINGed the server successfully while it was refusing access to the application (e.g not accepting results nor checking out exponents) |
i believe the server config is a duel p3 and i have also heard that its just scripts that freeze every little bit the server is rated at daily transfer of 500 mbits.
|
[QUOTE=lycorn]
Seriously, I have read several times this is a server issue, something like scripts that freeze and need to be restarted. It may be a memory issue. But in fact I don´t know the Hw/Sw configuration of the server, so it is difficult to say whether that is justifiable or not. But it doesn´t appear to be the network: I have already PINGed the server successfully while it was refusing access to the application (e.g not accepting results nor checking out exponents)[/QUOTE] Lycorn, I agree that most times when its down its a lack of ability to connect to web port 80 to do http. This is for both mersenne.org website/stats and the client when it sends/receives exponent data. I too have found you can ping the icmp stack on the server but you cannot get the web service to respond. I suspect that in the majority or all cases ping would still have worked. However this was not part of the records I kept because the cron job was not intentionally designed to test for server response. Some outage may be due to network, but as you say many times it is the service that has died or similar even though the machine is on and pingable. I know that a stats job runs on the server every hour, but it doesn't look like this extra load adversely affects availability at those particular on the hour times, compared to 15,30,45 mins past. Maybe some process runs on the server and could me made "nice" priority so that it doesn't take cycles from the web based I/O of primenet. If possible, website and stats processing could be lower priority. Or perhaps the server memory could be monitored. Or a utility could monitor availability from the server itself and reload the web service if it stopp[ed responding. I think it would be good if something could be done to improve the situation. |
I have had the opportunity to "restart" the PrimeNet when it goes offline... Network connectivity is rarely a problem... The hardware is reliable and the server has plenty of memory... My guess is the volume of transactions causes problems, especially sometimes when they get queued behind a bit... Plus, I expect there are a lot of bots and script kiddies out there hitting the server with malformed input and stuff...
I'm not a server genius or anything, but it appears to me from a quick glance at how it is set up, that it is a pretty robust and well-designed system... Fortunately, the project is designed where we can deal with outages better than most projects... I always keep a few exponents in reserve just in case the server is unavailable... If you want to check that PrimeNet is up, just run this: [url]http://mersenne.org/cgi-bin/pnHttp.exe?ps&4&.&[/url]. You can use wget and grep/cut to check for a 2250 response... |
[QUOTE=Xyzzy]Plus, I expect there are a lot of bots and script kiddies out there hitting the server with malformed input and stuff...
...[/QUOTE] [B]LOL, what are you trying to say about my scripts?[/B] :whistle: There's nothing malformed about my packets! They do exactly what a web browser sends when someone clicks the quick stats URL (I verified this with ethereal packet analyser). The term script kiddie generally refers to a lamer who runs scripts written by someone else who is knowedgeable (in my case I wrote this myself). Joking aside its quite possible the server gets hit by Denial of Service attacks worms and trojans as any host on the internet. Thanks for the URL above. I had already "sniffed" the packets the client sends to do a "Help About Primenet Server" and wrote these into a program too. The reason I only run my stats job once every quarter of an hour (as opposed to say 5 minutes) is to minimise workload on the server, and as it is it only needs to return a line containing my stats (hardly overworked). I wonder if there is some memory leak occurs in some primenet process which might be measurable on the server, to notice it gets worse. Such leaks are typically caused by a bug in code eg not de-allocating memory after use. Yes, the project as a whole is fairly resilient to downtime by nature, but this is not an excuse for downtime. In particular, new users (who have no work queued up at all) cannot work and will be confused, and anyone interested in their stats will also be similarly inconvenienced. If the server runs say apache for its webserver, is it fully patched up to date? What happens when cgi-bin gets invoked but not respond, does this cause the webserver to eventually stop serving? |
Server downtime report continuation 12 Oct - 17 Oct
Primenet server appeared down at these times (UTC)
Oct 12 2005 1330 1345 2030 2045 2100 2115 2245 2300 Oct 13 2005 0230 0245 1815 1830 1845 1900 1915 1930 2000 2015 2030 2045 2100 2115 2130 2145 2200 2215 2230 2245 2300 2315 2330 2345 Oct 14 2005 0000 0030 0045 0730 0745 0800 0815 0830 0845 0915 0930 0945 1000 1015 1030 1045 1100 1115 1130 1145 1200 1315 1330 1345 1400 1415 1430 1445 1500 1515 1530 1545 1615 1630 1645 1700 1715 1730 1745 1800 Oct 15 2005 0900 0945 1000 1015 1030 1045 1130 1200 1215 1230 1245 Oct 17 2005 1945 2000 2015 2030 2045 2100 2115 2130 2200 2215 |
Acceptable?
Is it me, or does anyone else think this level of downtime is not healthy for the project.
eg as well as primenet being down, looks also like the mersenne.org webserver is offline. Many websites would not tolerate say even 1% downtime or some insist on 99.999% uptime, whilst it appears this one is down rather more often. I know I find it frustrating when I can't look something up on the site. eg how long the verification of recent record breaking primes took. I will have to wait until the server comes back to look it up. |
| All times are UTC. The time now is 06:51. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.