mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   Report of monitoring primenet server unavailability (https://www.mersenneforum.org/showthread.php?t=4809)

Peter Nelson 2005-10-10 01:59

Report of monitoring primenet server unavailability
 
Recently I wrote a little program to retrieve my stats from primenet using the "Quickstats" link.

I made this run as a cron job in Linux EVERY 15 MINUTES.

It displays my stats on an LCD screen which is cool.

Now, occasionally the primenet server is down.

What I did not realise at the time (but what is a useful side effect) is that WHENEVER THE JOB FAILS, it sends email to me reporting the time and the fact that the program could not obtain the HTTP connection to primenet for stats.

It turns out this is nice information.

I will post it below. All times are in UTC (ie Greenwich mean time, uncorrected for summer time).

At these times and dates the 15 minute polling was NOT successful.
Conversely at all other 15 minute attempts my computer could connect successfully.

It is possible that outage could be result of some outage at MY isp, but they are reliable and on occasions when I noticed it down I was able to successfully able to surf to eg mersenneforum but not the primenet server.

Peter Nelson 2005-10-10 02:01

For the period 26 September to 9 October
The primenet server APPEARED to be down at these 15 minute polling times.

date hour minute

26/09/2005
16 30

18 0

29/09/2005
10 45
11 0
11 15
11 30
11 45
12 0
12 15

12 45

30/09/2005
14 30

02/10/2005
12 45
13 0
13 15

13 45
14 0

03/10/2005
8 15
8 30
8 45
9 0
9 15

9 45
10 0
10 15
10 30
10 45
11 0
11 15
11 30
11 45
12 0
12 15
12 30
12 45
13 0
13 15

13 45

05/10/2005
0 45
1 15
1 30
1 45

3 30

12 30
12 45

06/10/2005
15 15

21 30

22 0
22 15
22 30
22 45
23 0
23 15

07/10/2005
7 45
8 0

8 30
8 45

9 15

9 45
10 0
10 15
10 30
10 45
11 0
11 15
11 30
11 45
12 0
12 15
12 30
12 45

Peter Nelson 2005-10-10 02:07

Comment on downtime
 
Although there has been anecdotal info such as "I couldn't connect for most of yesterday", I believe this is the first objective analysis of primenet uptime.

Well, what do you think?

I am not terribly impressed by the amount of times it looks down, and several of these are for extended periods of time.

For new project participants they would not be able to get exponents etc at these times.

I wonder if there are any thoughts on the reasons for such problems?

Is the lan at the hosting location periodically being hit by some backup or downloading traffic? Does the internet link to it get saturated? Does the server run out of memory and need the process restarting?

I don't know the answer but I hope the above is useful in tracking it down.

I would be interested if there was any scheduled maintainance occuring during these dates period as it would explain some of the unavailability.

Peter Nelson 2005-10-10 02:29

Just for info you can see the route of my traffic (on level3 network most of the way).

Tracing route to [url]www.mersenne.org[/url] [64.66.6.250]

over a maximum of 30 hops:

MY COMPUTER TRYING TO REACH PRIMENET [n.n.n.n]

1 <1 ms <1 ms <1 ms MY LOCAL ROUTER [n.n.n.n]

2 28 ms 29 ms 29 ms gandhi-dsl1.wh.zen.net.uk [62.3.83.5]

3 30 ms 30 ms 30 ms bolzano-ge-0-0-1-3.wh.zen.net.uk [62.3.80.229]

4 31 ms 29 ms 31 ms 195.16.169.89

5 30 ms 31 ms 29 ms so-6-0-0.mp2.Manchesteruk1.Level3.net [4.68.113.114]

6 171 ms 168 ms 169 ms as-0-0.mp2.SanDiego1.Level3.net [4.68.128.149]

7 169 ms 171 ms 170 ms so-10-0.hsa2.SanDiego1.Level3.net [4.68.113.42]

8 171 ms 170 ms 170 ms unknown.Level3.net [63.212.173.166]

9 372 ms 300 ms 342 ms shawbinary-64-66-6-250.4d.net [64.66.6.250]
(that's the primenet server)

Trace complete.

If anything the last hope at YOUR end looks a bit long, given I can reach San Diego across the Atlantic in 170ms.

However at the time of this test things were working.

lycorn 2005-10-10 23:22

[QUOTE=Peter Nelson]If anything the last hope at YOUR end looks a bit long[/QUOTE]

You mean hop, or is the Primenet Server really hopeless? :wink:

Seriously, I have read several times this is a server issue, something like scripts that freeze and need to be restarted. It may be a memory issue. But in fact I don´t know the Hw/Sw configuration of the server, so it is difficult to say whether that is justifiable or not. But it doesn´t appear to be the network: I have already PINGed the server successfully while it was refusing access to the application (e.g not accepting results nor checking out exponents)

moo 2005-10-11 03:29

i believe the server config is a duel p3 and i have also heard that its just scripts that freeze every little bit the server is rated at daily transfer of 500 mbits.

Peter Nelson 2005-10-11 04:16

[QUOTE=lycorn]

Seriously, I have read several times this is a server issue, something like scripts that freeze and need to be restarted. It may be a memory issue. But in fact I don´t know the Hw/Sw configuration of the server, so it is difficult to say whether that is justifiable or not. But it doesn´t appear to be the network: I have already PINGed the server successfully while it was refusing access to the application (e.g not accepting results nor checking out exponents)[/QUOTE]

Lycorn, I agree that most times when its down its a lack of ability to connect to web port 80 to do http. This is for both mersenne.org website/stats and the client when it sends/receives exponent data.

I too have found you can ping the icmp stack on the server but you cannot get the web service to respond.

I suspect that in the majority or all cases ping would still have worked. However this was not part of the records I kept because the cron job was not intentionally designed to test for server response.

Some outage may be due to network, but as you say many times it is the service that has died or similar even though the machine is on and pingable.

I know that a stats job runs on the server every hour, but it doesn't look like this extra load adversely affects availability at those particular on the hour times, compared to 15,30,45 mins past.

Maybe some process runs on the server and could me made "nice" priority so that it doesn't take cycles from the web based I/O of primenet. If possible, website and stats processing could be lower priority.

Or perhaps the server memory could be monitored. Or a utility could monitor availability from the server itself and reload the web service if it stopp[ed responding.

I think it would be good if something could be done to improve the situation.

Xyzzy 2005-10-11 14:23

I have had the opportunity to "restart" the PrimeNet when it goes offline... Network connectivity is rarely a problem... The hardware is reliable and the server has plenty of memory... My guess is the volume of transactions causes problems, especially sometimes when they get queued behind a bit... Plus, I expect there are a lot of bots and script kiddies out there hitting the server with malformed input and stuff...

I'm not a server genius or anything, but it appears to me from a quick glance at how it is set up, that it is a pretty robust and well-designed system...

Fortunately, the project is designed where we can deal with outages better than most projects... I always keep a few exponents in reserve just in case the server is unavailable...

If you want to check that PrimeNet is up, just run this:

[url]http://mersenne.org/cgi-bin/pnHttp.exe?ps&4&.&[/url].

You can use wget and grep/cut to check for a 2250 response...

Peter Nelson 2005-10-11 15:15

[QUOTE=Xyzzy]Plus, I expect there are a lot of bots and script kiddies out there hitting the server with malformed input and stuff...

...[/QUOTE]

[B]LOL, what are you trying to say about my scripts?[/B] :whistle:

There's nothing malformed about my packets! They do exactly what a web browser sends when someone clicks the quick stats URL (I verified this with ethereal packet analyser).

The term script kiddie generally refers to a lamer who runs scripts written by someone else who is knowedgeable (in my case I wrote this myself).

Joking aside its quite possible the server gets hit by Denial of Service attacks worms and trojans as any host on the internet.

Thanks for the URL above. I had already "sniffed" the packets the client sends to do a "Help About Primenet Server" and wrote these into a program too.

The reason I only run my stats job once every quarter of an hour (as opposed to say 5 minutes) is to minimise workload on the server, and as it is it only needs to return a line containing my stats (hardly overworked).

I wonder if there is some memory leak occurs in some primenet process which might be measurable on the server, to notice it gets worse. Such leaks are typically caused by a bug in code eg not de-allocating memory after use.

Yes, the project as a whole is fairly resilient to downtime by nature, but this is not an excuse for downtime.

In particular, new users (who have no work queued up at all) cannot work and will be confused, and anyone interested in their stats will also be similarly inconvenienced.

If the server runs say apache for its webserver, is it fully patched up to date?

What happens when cgi-bin gets invoked but not respond, does this cause the webserver to eventually stop serving?

Peter Nelson 2005-10-17 22:50

Server downtime report continuation 12 Oct - 17 Oct
 
Primenet server appeared down at these times (UTC)

Oct 12 2005

1330
1345

2030
2045
2100
2115

2245
2300

Oct 13 2005

0230
0245

1815
1830
1845
1900
1915
1930

2000
2015
2030
2045
2100
2115
2130
2145
2200
2215
2230
2245
2300
2315
2330
2345

Oct 14 2005
0000

0030
0045

0730
0745
0800
0815
0830
0845

0915
0930
0945
1000
1015
1030
1045
1100
1115
1130
1145
1200

1315
1330
1345
1400
1415
1430
1445
1500
1515
1530
1545

1615
1630
1645
1700
1715
1730
1745
1800

Oct 15 2005

0900

0945
1000
1015
1030
1045

1130

1200
1215
1230
1245

Oct 17 2005

1945
2000
2015
2030
2045
2100
2115
2130

2200
2215

Peter Nelson 2005-10-17 22:54

Acceptable?
 
Is it me, or does anyone else think this level of downtime is not healthy for the project.

eg as well as primenet being down, looks also like the mersenne.org webserver is offline.

Many websites would not tolerate say even 1% downtime or some insist on 99.999% uptime, whilst it appears this one is down rather more often.

I know I find it frustrating when I can't look something up on the site. eg how long the verification of recent record breaking primes took. I will have to wait until the server comes back to look it up.


All times are UTC. The time now is 06:51.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.