mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

James Heinrich 2020-03-02 11:53

[QUOTE=chalsall;538547]OK... Fixed the CRON based uploader issue.[/QUOTE]For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?

bayanne 2020-03-02 13:25

So when the GPU part finishes, how does one keep the CPU part running?

chalsall 2020-03-02 15:58

[QUOTE=James Heinrich;538697]For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?[/QUOTE]

OK... Just so everyone knows, everyone is running this code now. Version 0.42.

However, only those whose Primenet Username the system knows is assigned P-1'ing work. I'll have the form ready to enter this information into the system later today.

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.

I'm "eating my own dog food" with this, and watching the logs closely. But if anyone sees anything strange please let me know. This is the "edge-case" phase.

"Why did they do that?" That's a two part question. The answer to first question is "Why not?" The answer to the second question is "Yes." :wink:

chalsall 2020-03-02 16:06

[QUOTE=bayanne;538720]So when the GPU part finishes, how does one keep the CPU part running?[/QUOTE]

When the GPU Section finishes it means the Instance has been killed. Read: both the GPU and CPU jobs have been terminated.

However, if you're not able to get a GPU backend this can still run CPU only. Just answer "Connect" when asked if you want a CPU only backend, and then run as usual.

Currently, there's a massive amount of debugging output when running in this mode. I'll clean that up later today -- no effect on the background work going on.

My modus operandi with this has been to ask for a GPU Instance, and proceed as usual if I get one. If not, I ask for the CPU Instance and run the Section. Then, every few hours I ask for a GPU from those contexts which are currently running CPU only. If I get one I run the GPU72_TF Section, which then launches the GPU and CPU parallel workers.

Interestingly, I've found that the CPU Instance which is "replaced" by the GPU Instance (as far as the Web GUI is concerned) continues running for up to several hours... :smile:

chalsall 2020-03-02 16:18

[QUOTE=petrw1;538683]Still seeing the sessions die after an hour or two. If the session dies that is running both GPU-TF and CPU-P1 … and I restart it is it safe to assume that both the TF and P1 will be picked up again after the restart?[/QUOTE]

A trick I've found works well for instance longevity...

Once you've been given an instance (CPU or GPU), click on the "RAM / Disk" drop-down menu in the upper right-hand side of the interface, and choose "Connect to hosted runtime".

If you're running a GPU, this will immediately reconnect and then show "Busy". If you're running only a CPU it will then give you the same "Do you want a CPU only?", but then after "Connecting" you're reattached to the same (running) CPU instance.

This seems to basically be a way to tell the system that you know this is going to run for a long time. Don't expect further human interaction.

Someone else figured this out; I can't remember who nor where it was posted. Probably on the long Colab thread.

[QUOTE=petrw1;538683]Any idea if the session would live longer if I ran ONLY P1.
That seems to be more in need anyway.[/QUOTE]

I think the CPU only instances tend to last for 12 hours are so, but the GPUs are coveted and vary considerably in their runtimes and compute kit.

Also, we still need as much GPU TF'ing as we can get. Need to stay ahead of the Cat 3 and 4's! :smile:

Chuck 2020-03-02 23:06

[QUOTE=chalsall;538735]

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.
[/QUOTE]

I'd vote to see the updated P-1 information every ten minutes.

petrw1 2020-03-03 00:21

[QUOTE=chalsall;538735]once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.
[/QUOTE]

I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.

chalsall 2020-03-03 01:01

[QUOTE=petrw1;538773]I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2. When I restarted all the workers this morning they were all stage 1 and no more than 60%. I have NO P-1 completions.[/QUOTE]

There /might/ be something strange going on, for a /few/ people. I haven't figured out what's going on yet -- extremely tricky debugging what I can't see nor access. And, of course, all of my various tests are all running perfectly fine -- exact same code paths as everyone else.

I've added some code to send back to the server the working directory when the checkpointing code doesn't seem sane.

bayanne 2020-03-03 06:26

[QUOTE=petrw1;538773]I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.[/QUOTE]

That has happened to me too

chalsall 2020-03-03 14:57

[QUOTE=bayanne;538789]That has happened to me too[/QUOTE]

OK... I think I've figured out what's going on... The script does not handle stopping and restarting well. I /thought/ that any forked processes get killed when the Notebook is interrupted, but this isn't true.

Working on a fix now...

chalsall 2020-03-03 17:34

[QUOTE=chalsall;538809]Working on a fix now...[/QUOTE]

OK... The Bootstrap module will now SIGINT the CPUWrapper module, which in turn SIGINTs the Payload module, which in turn SIGINTs the mprime process...

The upside of this is after mprime exits the Payload module gives the Checkpointer a chance to upload the just-written checkpoint file.

Please /don't/ stop and then restart a running Colab session to get this new code. But all future runs will pick this up.


All times are UTC. The time now is 23:03.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.