mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   PrimeNet (https://www.mersenneforum.org/forumdisplay.php?f=11)
-   -   OFFICIAL "SERVER PROBLEMS" THREAD (https://www.mersenneforum.org/showthread.php?t=5758)

chalsall 2014-09-13 17:07

[QUOTE=chalsall;382893]Will keep an eye on it, and let you know if I see anything more (and try to get deeper info on the connectivity situation during the event).[/QUOTE]

Not quite sure what to make of this...

This morning my spiders were again reporting 500 errors. I launched an MTR from GPU72's server to Primenet's, and latency was reasonable (~ 55ms), with 0% packet loss.

Nothing to action; just a data-point. Everything is nominal at the moment.

Madpoo 2014-09-13 22:10

[QUOTE=chalsall;382968]Not quite sure what to make of this...

This morning my spiders were again reporting 500 errors. I launched an MTR from GPU72's server to Primenet's, and latency was reasonable (~ 55ms), with 0% packet loss.

Nothing to action; just a data-point. Everything is nominal at the moment.[/QUOTE]

FYI, we spent a little time analyzing how people are using the manual assignment page over a recent 3 day period.

Something like 92% of 10,300 requests (9512 of them) were from the GPU72 spider, and all of those requests are only asking for 2 exponents at a time (9,512 requests). Another 683 were for 10 exponents at a time, 62 only requested 1, 16 requests for 1000, etc. It gets pretty nitty gritty at that point.

What we're wondering is why the GPU72 is only requesting 2 at a time? You could probably request 100 at a time which would actually take about the same amount of time as only requesting 2, whereas if you needed 100 and made 50 separate calls, it's taking 50 times longer.

It's a lot of the basic HTTP overhead involved, but on the server backend it can handle a reasonable amount (maybe 100) in a decent amount of time.

Basically, your ~9500 requests were to get 19,000 assignments and took the server (on the backend, the "total time taken" measurement in IIS) about 350 ms each for a grand total of 3325 seconds, or 55.4 minutes of server processing time.

Now, if you were requesting 100 at a time, similar requests take 2.05 seconds. If you wanted the same 19,000 assignments and asked for them 100 at a time, it would take 380 seconds, or just 6.3 minutes.

Something to consider... make several larger requests (we think 100 might be a good starting point for a hard limit on the server) instead of a series of small requests.

Thoughts from a GPU72 perspective?

LaurV 2014-09-14 04:41

[QUOTE=Madpoo;382986]What we're wondering is why the GPU72 is only requesting 2 at a time?[/QUOTE]
This may have something to do with getting the smallest exponents available. Just guessing. Exponents appear and evaporate at random intervals. The "ideal" would be to request one exponent every time unit (like every second, minute, or so) and if this is bigger than your reservations, unreserve it. If it is smaller than some of your reservation for which you didn't start working, then unreserve the biggest one, and keep the smaller. For this you have to request as small work as possible, as often as possible. Of course, this extreme example is just extreme, it will flood the server with a lot of unnecessary requests. But you got the idea.
[QUOTE=Madpoo;382986]You could probably request 100 at a time which would actually take about the same amount of time[/QUOTE]
but it will get larger exponents. And in the time you work your 100 expos, maybe 50 smaller expos become available and evaporate again... Unless you propose to request 100 expos, but as often as you was before reserving two... :razz:

Madpoo 2014-09-14 05:14

[QUOTE=LaurV;382999]This may have something to do with getting the smallest exponents available. Just guessing. Exponents appear and evaporate at random intervals. The "ideal" would be to request one exponent every time unit (like every second, minute, or so) and if this is bigger than your reservations, unreserve it. If it is smaller than some of your reservation for which you didn't start working, then unreserve the biggest one, and keep the smaller. For this you have to request as small work as possible, as often as possible. Of course, this extreme example is just extreme, it will flood the server with a lot of unnecessary requests. But you got the idea.

but it will get larger exponents. And in the time you work your 100 expos, maybe 50 smaller expos become available and evaporate again... Unless you propose to request 100 expos, but as often as you was before reserving two... :razz:[/QUOTE]

I see your point, that by checking more often for smaller #'s of exponents, it increases the odds that if a low one was just unreserved/expired, GPU72 has a better chance of grabbing it.

Traffic from GPU72 is somewhat consistent. I broke down the requests into 10 minute intervals, and it's typically making 32 requests in any given 10 minute period, all for 2 exponents, so 64 total, or 160 requests per hour for 320 exponents.

Now, if I revealed *when* the server task does it's hourly run to expire/unassign exponents that have passed their expiration dates, it would make sense that GPU72 could run then and grab as many as it thinks it an snarf on for the next hour.

It really won't help much to check all through the rest of that hour on the odd chance that some client out there decides to connect and unassign some work because things are taking longer than expected, or somebody manually unreserves some exponents. I can't say for sure, but I'm guessing those two scenarios don't happen that often. I'd guess most unassigned exponents come from work that has gone past it's "fresh until" date and got sent back to the pool of free agents by the server task.

It could probably request all 320 at once that it would normally grab over the course of an hour, and the end result would be pretty much the same. Better, in fact, because at the scheduled time, there's going to be a large batch of exponents that got freed up, but GPU72 is only asking for a handful up front, so if there are other lower exponents out there, GPU72 is leaving them on the table for others to grab.

LaurV 2014-09-14 06:06

What you say makes a lot of sense. You have to talk directly with Chris about GPU72--PrimeNet interaction.
Historically, people (read davieddy, and few others) were quite upset that GPU72 grabs all lower LL assignments and some smoke came out of it, but that is not the case anymore since GPU72 does not offer LL and DC assignments anymore (I mean that, maybe, letting some exponents for other people was intentional, again I am guessing)

Madpoo 2014-09-14 15:48

[QUOTE=LaurV;383002]What you say makes a lot of sense. You have to talk directly with Chris about GPU72--PrimeNet interaction.
Historically, people (read davieddy, and few others) were quite upset that GPU72 grabs all lower LL assignments and some smoke came out of it, but that is not the case anymore since GPU72 does not offer LL and DC assignments anymore (I mean that, maybe, letting some exponents for other people was intentional, again I am guessing)[/QUOTE]

I'll have to look back at some of that conversation... sounds like fun. Almost as much fun as when I started a whole ugly poaching conversation over a decade ago. :smile:

FYI, the task that expires exponents that haven't checked in for XX days is a nightly job, not hourly, just to correct myself.

Aramis Wyler 2014-09-14 21:04

You could probably get rid of some of the spiders altogether if you just start assigning hot numbers to Research or whatever spidey acts as. Gpu72 could look at it's numbers daily to use for assignments. Maybe put a limit on the number of assignments that could be concurrently assigned to that account to avoid overflow.

chalsall 2014-09-14 22:46

[QUOTE=Madpoo;383000]I see your point, that by checking more often for smaller #'s of exponents, it increases the odds that if a low one was just unreserved/expired, GPU72 has a better chance of grabbing it.[/QUOTE]

Yes, that is the theory. Basically one of my spiders asks for 10 P-1 assignments, 10 Cat 3 assignments (in batches of 2; see below), and 20 Cat 4 assignments (again, in batches of 2). The quantities (10, 10 and 20) were chosen as being most likely to capture candidates sub-optimally TF'ed before being handed out for LL'ing.

This isn't only to collect those candidates not yet appropriately TFed which are recycled daily at "the magic hour", but also those who are assigned for P-1'ing or TF'ing to others, and which can be returned to Primenet at any time.

[QUOTE=Madpoo;383000]Traffic from GPU72 is somewhat consistent. I broke down the requests into 10 minute intervals, and it's typically making 32 requests in any given 10 minute period, all for 2 exponents, so 64 total, or 160 requests per hour for 320 exponents.[/QUOTE]

The reason for two candidates being requested at a time could be considered a SPE... The manual assignment form limits LL assignments to two candidates per core per request -- I didn't think to simply increase the number of cores (:duh:). This has now been done. As in, to be explicit, every five minutes (except at HH:00 and HH:05) a total of three manual assignment requests are made.

However, I would agree that this is still sub-optimal. If a report could be made available to GPU72 of those candidates not currently assigned (in, let's say, a specified range) below a specified bit-level, then "Spidy" could request this report only, let's say, once an hour, and target its requests for such assignments.

To be very clear, I don't want GPU72 to cause any problems for Primenet. And, it's not the end of the day if a few sub-optimally TF'ed candidates "slip through".

Thoughts?

James Heinrich 2014-09-14 23:30

[QUOTE=chalsall;383045]If a report could be made available to GPU72 of those candidates not currently assigned (in, let's say, a specified range) below a specified bit-level, then "Spidy" could request this report only, let's say, once an hour, and target its requests for such assignments.[/QUOTE]I think that would be a great idea. I could probably set that up for you, unless George or Aaron would prefer to handle it due to their greater familiarity with the database structure. Perhaps you may want to email the three of us and we can discuss implementation.

Madpoo 2014-09-15 00:39

[QUOTE=chalsall;383045]To be very clear, I don't want GPU72 to cause any problems for Primenet. And, it's not the end of the day if a few sub-optimally TF'ed candidates "slip through".

Thoughts?[/QUOTE]

It's not [I]really[/I] a problem for the server as far as I can tell. Since GPU72 is the #1 user of the manual assignment page, from my perspective it would be more about making that experience better for Primenet *and* GPU72. A way to ease the way they work together.

Now that the server has had it's overhaul, it's a little clearer to see where some other improvements can be made. Optimizing the manual assignment page seems like ripe picking, and since GPU72 makes up most of those, it should be pretty easy to work something out, I hope.

LaurV 2014-09-15 03:38

[QUOTE=chalsall;383045]
Thoughts?[/QUOTE]
Well, that seems a very good idea, and of course it would be very nice if PrimeNet could say to GPU72 "hey, I have a sub-optimal TF-ed available, do you want it?", and eventually keep it for some time (2-3 hours?) till either GPU72 or a manual user wants it/requests it. :wink:


All times are UTC. The time now is 23:08.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.