![]() |
Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently.
I am wondering whether this is being somewhat counter productive ... |
[QUOTE=bayanne;538855]Opting to work on CPU tasks as well has meant that I am now getting GPU instances less frequently. I am wondering whether this is being somewhat counter productive ...[/QUOTE]
While it is impossible to guess at what Google's algorithms are weighting, I don't /think/ so. More likely what we're observing is the ebb-and-flow of demand vs. availability. But to test your theory, change your CPU Worktype to "Disabled" and the CPU won't be used (the CPU payload provided is just a sleep(forever) call in such cases). |
Does this make sense? :unsure:
[code]Beginning GPU Trial Factoring Environment Bootstrapping... Please see https://www.gpu72.com/ for additional details. 20200304_145207: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)... 20200304_145207: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"... 20200304_145207: Installing needed packages 20200304_145213: Fetching initial work... 20200304_145213: Running GPU type Tesla K80 20200304_145214: running a simple selftest... 20200304_145218: Selftest statistics 20200304_145218: number of tests 107 20200304_145219: successfull tests 107 20200304_145219: selftest PASSED! 20200304_145219: Bootstrap finished. Exiting.[/code]It has a GPU, but it's not doing any TF... not sure if it's doing P-1 but I don't see any comment about that either. My other instance I started at the same time is also borkend:[code]Beginning GPU Trial Factoring Environment Bootstrapping... Please see https://www.gpu72.com/ for additional details. 20200304_145330: GPU72 TF V0.42 Bootstrap starting (now with CPU support!)... 20200304_145330: Working as "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"... 20200304_145330: Installing needed packages 20200304_145335: Fetching initial work... 20200304_145336: Running GPU type Tesla T4 20200304_145336: running a simple selftest... 20200304_145340: Selftest statistics 20200304_145340: number of tests 107 20200304_145340: successfull tests 107 20200304_145340: selftest PASSED! 20200304_145340: Bootstrap finished. Exiting.[/code] |
[QUOTE=James Heinrich;538863]Does this make sense?[/QUOTE]
No!!! Grrr... Please try rerunning the Sections. According to the DB you /were/ issued work. Edit: Actually, one of your three instance was issued TF work, the other two were only issued P-1 work. This shouldn't happen. Working theory: a DNS lookup failure could explain this. I'll add a check to the Comms script module to retry if it doesn't successfully get the first batch of work. But simply rerunning your failed sections should fix the issue right now. |
[QUOTE=chalsall;538865]But simply rerunning your failed sections should fix the issue right now.[/QUOTE]I restarted both. The one seems normal:[code]20200304_153523: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime
20200304_153538: 100180889 75 to 76 0.1% 1h39m 1102.38 6.236s | 0/4620, 1/960 | 40.81G | 6544.7M/s | 82485 | 0:02 20200304_153538: 100969277 P-1 77 0.00% Stage: 1 20200304_153643: 100180889 75 to 76 1.5% 1h37m 1110.57 6.190s | 52/4620, 14/960 | 40.81G | 6593.3M/s | 82485 | 0:04[/code]The other seems to have a lot more P-1 lines than I expect right on init:[code]20200304_153537: Installing needed packages 20200304_153554: Fetching initial work... 20200304_153556: Running GPU type Tesla T4 20200304_153556: running a simple selftest... 20200304_153605: Selftest statistics 20200304_153605: number of tests 107 20200304_153605: successfull tests 107 20200304_153605: selftest PASSED! 20200304_153605: Starting trial factoring M106899509 from 2^75 to 2^76 (71.58 GHz-days) 20200304_153605: Exponent TF Level % Done ETA GHzD/D Itr Time | Class #, Seq # | #FCs | SieveRate | SieveP | Uptime 20200304_153620: 106899509 75 to 76 0.1% 55m42s 1848.60 3.485s | 0/4620, 1/960 | 38.25G | 10974.9M/s | 82485 | 0:45 20200304_153620: 100968493 P-1 77 0.00% Stage: 1 20200304_153620: 100968493 P-1 77 2.64% Stage: 1 20200304_153620: 100968493 P-1 77 5.29% Stage: 1 20200304_153620: 100968493 P-1 77 7.94% Stage: 1 20200304_153620: 100968493 P-1 77 10.59% Stage: 1 20200304_153620: 100969361 P-1 77 0.00% Stage: 1 20200304_153722: 106899509 75 to 76 2.4% 55m52s 1801.06 3.577s | 111/4620, 23/960 | 38.25G | 10692.6M/s | 82485 | 0:47[/code] |
[QUOTE=James Heinrich;538869]I restarted both. The one seems normal: ... The other seems to have a lot more P-1 lines than I expect right on init:[/QUOTE]
Thanks for the data... What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases. Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink: |
[QUOTE=chalsall;538871]two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE]
Maybe that's a new tactic they are employing in order to try to drive us away, lol :spinner: |
[QUOTE=chalsall;538871]Thanks for the data...
What is happening here is you now have two P-1 jobs running in parallel. An interesting edge case. There's nothing we can do about this, but it shows me some deltas I need to make to the payloads to handle these kinds of rare (but not impossible) edge cases. Somewhat amusingly, yesterday Chuck had a similar situation. Even though he did a Factory Reset, two of his instances continued uploading debugging information for 24 hours; ~0.6 GB/minute... My /var/ was not amused... :wink:[/QUOTE] Did I do something wrong that enabled this debug uploading? |
[QUOTE=Chuck;538880]Did I do something wrong that enabled this debug uploading?[/QUOTE]
No... I did. All you did was stop and restart your Sections. Perfectly reasonable. But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent... A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation. And then, of course, not getting it correct... DWIM!!! :wink: |
[QUOTE=chalsall;538881]No... I did.
All you did was stop and restart your Sections. Perfectly reasonable. But then I had made an assumption that was incorrect, and my code started misbehaving. Then I had the code send back debugging information in such situations, not realizing just how large the data would be nor how often it would be sent... A classic SPE, working in an environment where it's a "you'd better get this correct, because you have no control over it once it's running" situation. And then, of course, not getting it correct... DWIM!!! :wink:[/QUOTE] Sometimes when I restart a session, the little spinning indicator in the upper left corner scrolls up off the screen out of sight when I scroll the window to the bottom of the screen. When this happens, I found that if I scroll back to the top of the window and re-click "Default" on the logging level, it corrects the problem and the indicator does not scroll off the top. I thought I might have accidentally selected "Verbose". |
Colab restarts
My sessions expired after 24 hours as usual. There were three sessions and as I restarted each, it went through the bootstrap process and exited immediately. I restarted the three sessions again and they then ran normally.
Evidently I picked up an extra P-1 assignment as each session is displaying two different P-1 progress lines starting at 0.00%. |
| All times are UTC. The time now is 23:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.