mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

chalsall 2020-03-04 21:18

[QUOTE=Chuck;538893]My sessions expired after 24 hours as usual. There were three sessions and as I restarted each, it went through the bootstrap process and exited immediately. I restarted the three sessions again and they then ran normally.[/QUOTE]

OK... Thanks for the data. Having audited the code a DNS lookup failure is the only possibility. This is supported by the fact your instances only asked for P-1 work, but not TF work.

I've got a fall-back contingency worked out in my head, which I'll implement shortly.

[QUOTE=Chuck;538893]Evidently I picked up an extra P-1 assignment as each session is displaying two different P-1 progress lines starting at 0.00%.[/QUOTE]

Yes... But the good news is the code is sane (or, at least, not insane).

You'll pick up those initial P-1 assignments in due course. I don't yet have the system expiring abandoned assignments without work done on them. Later.

Chuck 2020-03-04 22:53

I started a CPU only just for fun
 
Just to see what it would look like, I started an additional Colab session not requesting a GPU. Now I can see the scrolling P-1 messages.

I see the code stops and restarts the process every hour when it send the progress message to the server. Interesting to see. I wonder if the session will stop after 24 hours like my other Colab sessions do.

chalsall 2020-03-04 23:17

[QUOTE=Chuck;538898]I see the code stops and restarts the process every hour when it send the progress message to the server. Interesting to see.[/QUOTE]

Yeah... I'm going to have to figure out how to have it *not* do that.

What is happening is the mprime process is getting "new settings" from the Primenet server (through the GPU72 proxy). I need to intercept those messages and ensure the running client doesn't see any changes. Rather wasteful having it stop and restart, particularly during Stage 2...

And, indeed... Please let us know how long your CPU-only session lasts.

Prime95 2020-03-05 00:16

Can you turn off auto benchmarking? (Add AutoBench=0 to prime.txt)

chalsall 2020-03-05 00:25

[QUOTE=Prime95;538906]Can you turn off auto benchmarking? (Add AutoBench=0 to prime.txt)[/QUOTE]

Thanks! Done.

chalsall 2020-03-05 14:59

Kinda cool...
 
1 Attachment(s)
Just to share a screenshot that I think is kinda cool...

It's from my main workstation, where I run four Colab sessions in parallel. For each, I'll often set up reverse SSH and HTTP tunnels in order to observe what's happening in the background.

This is just short of twelve hours of a CPU only run (just about to finish a 100M P-1).

James Heinrich 2020-03-05 16:12

[QUOTE=James Heinrich;538863]Does this make sense? :unsure:[/quote][QUOTE=chalsall;538865]No!!! Grrr...
Please try rerunning the Sections. According to the DB you /were/ issued work.
Working theory: a DNS lookup failure could explain this. I'll add a check to the Comms script module to retry if it doesn't successfully get the first batch of work. But simply rerunning your failed sections should fix the issue right now.[/QUOTE]If you care it just happened again, across 3 instances I started at about the same time.

I waited a couple minutes and restarted each and they all worked this time, each with 2x intial P-1 lines (one halfway through stage1, one at 0%) so presumably I got assigned work on the first attempt but my instance didn't receive it(?)

chalsall 2020-03-05 16:30

[QUOTE=James Heinrich;538950]If you care it just happened again, across 3 instances I started at about the same time.[/QUOTE]

I care very much!!! Thanks for the data! Important.

This means my attempted fix (looping for ten attempts over a minute) didn't work.

Hmmm...

[QUOTE=James Heinrich;538950]I waited a couple minutes and restarted each and they all worked this time, each with 2x intial P-1 lines (one halfway through stage1, one at 0%) so presumably I got assigned work on the first attempt but my instance didn't receive it(?)[/QUOTE]

What this means is your first launch attempts (for all three sessions) successfully got the Bootstrap and the CPU Payload (with the P-1 work). But they /didn't/ ask for TF work (different communications channel).

OK. I'll meditate on this, and come up with an angle of attack... I'm thinking of throwing a fall-back DNS entry into /etc/hosts, and see if that fixes this issue.

The good news is even though you saw the initial status line from the first P-1 run, it should have been cleanly killed. As in, you should only be running a single P-1 job on each instance.

Man, tricky stuff. Loving it!!! :smile:

Chuck 2020-03-05 17:11

Two different progress messages
 
1 Attachment(s)
I am seeing these two different messages on one of the Colab instances — almost like it is processing the same exponent twice at the same time?

chalsall 2020-03-05 17:25

[QUOTE=Chuck;538956]I am seeing these two different messages on one of the Colab instances — almost like it is processing the same exponent twice at the same time?[/QUOTE]

Hmmm... I think you might be correct... Sorry about that...

I'm /pretty/ sure this won't happen with the code served out since last night (~0200 UTC). I introduced several redundant shutdown request vectors which should ensure there are no unwanted parallel runs going on.

It's /probably/ safe to try stopping and restarting your GPU72_TF Section. Or, you could just wait for them to expire when expected. The worst that's happening now is you might only using ~50% of a CPU, rather than 100%.

Uncwilly 2020-03-05 18:35

I have been seeing the same thing as James has been reporting, with the same fix. I have been ill the past few days and haven't felt like posting.


All times are UTC. The time now is 23:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.