mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

bayanne 2020-03-04 12:05

Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently.
I am wondering whether this is being somewhat counter productive ...

chalsall 2020-03-04 13:06

[QUOTE=bayanne;538855]Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently. I am wondering whether this is being somewhat counter productive ...[/QUOTE]

While it is impossible to guess at what Google's algorithms are weighting, I don't /think/ so. More likely what we're observing is the ebb-and-flow of demand vs. availability.

But to test your theory, change your CPU Worktype to "Disabled" and the CPU won't be used (the CPU payload provided is just a sleep(forever) call in such cases).

James Heinrich 2020-03-04 14:56

Does this make sense? :unsure:
[code]Beginning GPU Trial Factoring Environment Bootstrapping...
Please see https://www.gpu72.com/ for additional details.

20200304_145207: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)...
20200304_145207: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...

20200304_145207: Installing needed packages
20200304_145213: Fetching initial work...
20200304_145213: Running GPU type Tesla K80

20200304_145214: running a simple selftest...
20200304_145218: Selftest statistics
20200304_145218: number of tests 107
20200304_145219: successfull tests 107
20200304_145219: selftest PASSED!
20200304_145219: Bootstrap finished. Exiting.[/code]It has a GPU, but it's not doing any TF... not sure if it's doing P-1 but I don't see any comment about that either.

My other instance I started at the same time is also borkend:[code]Beginning GPU Trial Factoring Environment Bootstrapping...
Please see https://www.gpu72.com/ for additional details.

20200304_145330: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)...
20200304_145330: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"...

20200304_145330: Installing needed packages
20200304_145335: Fetching initial work...
20200304_145336: Running GPU type Tesla T4

20200304_145336: running a simple selftest...
20200304_145340: Selftest statistics
20200304_145340: number of tests 107
20200304_145340: successfull tests 107
20200304_145340: selftest PASSED!
20200304_145340: Bootstrap finished. Exiting.[/code]

chalsall 2020-03-04 15:07

[QUOTE=James Heinrich;538863]Does this make sense?[/QUOTE]

No!!! Grrr...

Please try rerunning the Sections. According to the DB you /were/ issued work.

Edit: Actually, one of your three instance was issued TF work, the other two were only issued P-1 work. This shouldn't happen.

Working theory: a DNS lookup failure could explain this. I'll add a check to the Comms script module to retry if it doesn't successfully get the first batch of work. But simply rerunning your failed sections should fix the issue right now.

James Heinrich 2020-03-04 15:38

[QUOTE=chalsall;538865]But simply rerunning your failed sections should fix the issue right now.[/QUOTE]I restarted both. The one seems normal:[code]20200304_153523: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime
20200304_153538: 100180889 75 to 76 0.1% 1h39m 1102.38 6.236s | 0/4620, 1/960 | 40.81G | 6544.7M/s | 82485 | 0:02
20200304_153538: 100969277 P-1 77 0.00% Stage: 1
20200304_153643: 100180889 75 to 76 1.5% 1h37m 1110.57 6.190s | 52/4620, 14/960 | 40.81G | 6593.3M/s | 82485 | 0:04[/code]The other seems to have a lot more P-1 lines than I expect right on init:[code]20200304_153537: Installing needed packages
20200304_153554: Fetching initial work...
20200304_153556: Running GPU type Tesla T4

20200304_153556: running a simple selftest...
20200304_153605: Selftest statistics
20200304_153605: number of tests 107
20200304_153605: successfull tests 107
20200304_153605: selftest PASSED!
20200304_153605: Starting trial factoring M106899509 from 2^75 to 2^76 (71.58 GHz-days)

20200304_153605: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime
20200304_153620: 106899509 75 to 76 0.1% 55m42s 1848.60 3.485s | 0/4620, 1/960 | 38.25G | 10974.9M/s | 82485 | 0:45
20200304_153620: 100968493 P-1 77 0.00% Stage: 1
20200304_153620: 100968493 P-1 77 2.64% Stage: 1
20200304_153620: 100968493 P-1 77 5.29% Stage: 1
20200304_153620: 100968493 P-1 77 7.94% Stage: 1
20200304_153620: 100968493 P-1 77 10.59% Stage: 1
20200304_153620: 100969361 P-1 77 0.00% Stage: 1
20200304_153722: 106899509 75 to 76 2.4% 55m52s 1801.06 3.577s | 111/4620, 23/960 | 38.25G | 10692.6M/s | 82485 | 0:47[/code]

chalsall 2020-03-04 16:12

[QUOTE=James Heinrich;538869]I restarted both. The one seems normal: ... The other seems to have a lot more P-1 lines than I expect right on init:[/QUOTE]

Thanks for the data...

What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases.

Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:

PhilF 2020-03-04 16:46

[QUOTE=chalsall;538871]two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE]

Maybe that's a new tactic they are employing in order to try to drive us away, lol :spinner:

Chuck 2020-03-04 17:36

[QUOTE=chalsall;538871]Thanks for the data...

What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases.

Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE]

Did I do something wrong that enabled this debug uploading?

chalsall 2020-03-04 17:43

[QUOTE=Chuck;538880]Did I do something wrong that enabled this debug uploading?[/QUOTE]

No... I did.

All you did was stop and restart your Sections. Perfectly reasonable.

But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent...

A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation.

And then, of course, not getting it correct... DWIM!!! :wink:

Chuck 2020-03-04 17:52

[QUOTE=chalsall;538881]No... I did.

All you did was stop and restart your Sections. Perfectly reasonable.

But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent...

A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation.

And then, of course, not getting it correct... DWIM!!! :wink:[/QUOTE]

Sometimes when I restart a session, the little spinning indicator in the upper left corner scrolls up off the screen out of sight when I scroll the window to the bottom of the screen. When this happens, I found that if I scroll back to the top of the window and re-click "Default" on the logging level, it corrects the problem and the indicator does not scroll off the top.

I thought I might have accidentally selected "Verbose".

Chuck 2020-03-04 21:04

Colab restarts
 
My sessions expired after 24 hours as usual. There were three sessions and as I restarted each, it went through the bootstrap process and exited immediately. I restarted the three sessions again and they then ran normally.

Evidently I picked up an extra P-1 assignment as each session is displaying two different P-1 progress lines starting at 0.00%.


All times are UTC. The time now is 23:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.