mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2020-03-02, 11:53   #4687
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

2·33·53 Posts
Default

Quote:
Originally Posted by chalsall View Post
OK... Fixed the CRON based uploader issue.
For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?
James Heinrich is offline   Reply With Quote
Old 2020-03-02, 13:25   #4688
bayanne
 
bayanne's Avatar
 
"Tony Gott"
Aug 2002
Yell, Shetland, UK

2·151 Posts
Default

So when the GPU part finishes, how does one keep the CPU part running?
bayanne is offline   Reply With Quote
Old 2020-03-02, 15:58   #4689
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
For those of us who were caught with this problem, there seem to be P-1 assignments languishing, if you have checkpoints available for them please make sure they get processed, or if (as I presume the problem to be) there is no checkpoint file I guess you can throw them back in the pool?
OK... Just so everyone knows, everyone is running this code now. Version 0.42.

However, only those whose Primenet Username the system knows is assigned P-1'ing work. I'll have the form ready to enter this information into the system later today.

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.

I'm "eating my own dog food" with this, and watching the logs closely. But if anyone sees anything strange please let me know. This is the "edge-case" phase.

"Why did they do that?" That's a two part question. The answer to first question is "Why not?" The answer to the second question is "Yes."
chalsall is online now   Reply With Quote
Old 2020-03-02, 16:06   #4690
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by bayanne View Post
So when the GPU part finishes, how does one keep the CPU part running?
When the GPU Section finishes it means the Instance has been killed. Read: both the GPU and CPU jobs have been terminated.

However, if you're not able to get a GPU backend this can still run CPU only. Just answer "Connect" when asked if you want a CPU only backend, and then run as usual.

Currently, there's a massive amount of debugging output when running in this mode. I'll clean that up later today -- no effect on the background work going on.

My modus operandi with this has been to ask for a GPU Instance, and proceed as usual if I get one. If not, I ask for the CPU Instance and run the Section. Then, every few hours I ask for a GPU from those contexts which are currently running CPU only. If I get one I run the GPU72_TF Section, which then launches the GPU and CPU parallel workers.

Interestingly, I've found that the CPU Instance which is "replaced" by the GPU Instance (as far as the Web GUI is concerned) continues running for up to several hours...
chalsall is online now   Reply With Quote
Old 2020-03-02, 16:18   #4691
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by petrw1 View Post
Still seeing the sessions die after an hour or two. If the session dies that is running both GPU-TF and CPU-P1 … and I restart it is it safe to assume that both the TF and P1 will be picked up again after the restart?
A trick I've found works well for instance longevity...

Once you've been given an instance (CPU or GPU), click on the "RAM / Disk" drop-down menu in the upper right-hand side of the interface, and choose "Connect to hosted runtime".

If you're running a GPU, this will immediately reconnect and then show "Busy". If you're running only a CPU it will then give you the same "Do you want a CPU only?", but then after "Connecting" you're reattached to the same (running) CPU instance.

This seems to basically be a way to tell the system that you know this is going to run for a long time. Don't expect further human interaction.

Someone else figured this out; I can't remember who nor where it was posted. Probably on the long Colab thread.

Quote:
Originally Posted by petrw1 View Post
Any idea if the session would live longer if I ran ONLY P1.
That seems to be more in need anyway.
I think the CPU only instances tend to last for 12 hours are so, but the GPUs are coveted and vary considerably in their runtimes and compute kit.

Also, we still need as much GPU TF'ing as we can get. Need to stay ahead of the Cat 3 and 4's!
chalsall is online now   Reply With Quote
Old 2020-03-02, 23:06   #4692
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

2·3·139 Posts
Default

Quote:
Originally Posted by chalsall View Post

Like with Colab TF assignments, once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.

The GPU Bootstrap code how has IPC with the CPU Payload, so at the start of the run and every ~30 minutes a line like "100970xxx P-1 77 19.47% Stage: 1" is displayed. This can be made every ten minutes if people would prefer more frequent reporting.
I'd vote to see the updated P-1 information every ten minutes.
Chuck is offline   Reply With Quote
Old 2020-03-03, 00:21   #4693
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

10000110000112 Posts
Default

Quote:
Originally Posted by chalsall View Post
once a P-1 assignment has been issued it is held by the user until completion. Checkpoint files are uploaded every ten minutes, and the Percent completed and estimated completion is updated every hour.
I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.
petrw1 is offline   Reply With Quote
Old 2020-03-03, 01:01   #4694
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by petrw1 View Post
I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2. When I restarted all the workers this morning they were all stage 1 and no more than 60%. I have NO P-1 completions.
There /might/ be something strange going on, for a /few/ people. I haven't figured out what's going on yet -- extremely tricky debugging what I can't see nor access. And, of course, all of my various tests are all running perfectly fine -- exact same code paths as everyone else.

I've added some code to send back to the server the working directory when the checkpointing code doesn't seem sane.
chalsall is online now   Reply With Quote
Old 2020-03-03, 06:26   #4695
bayanne
 
bayanne's Avatar
 
"Tony Gott"
Aug 2002
Yell, Shetland, UK

2·151 Posts
Default

Quote:
Originally Posted by petrw1 View Post
I could be senile (okay more than typical senility for my age) but I'm quite sure that last night before bed I had a P-1 assignment reporting stage 2.
When I restarted all the workers this morning they were all stage 1 and no more than 60%.
I have NO P-1 completions.
That has happened to me too
bayanne is offline   Reply With Quote
Old 2020-03-03, 14:57   #4696
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

232316 Posts
Default

Quote:
Originally Posted by bayanne View Post
That has happened to me too
OK... I think I've figured out what's going on... The script does not handle stopping and restarting well. I /thought/ that any forked processes get killed when the Notebook is interrupted, but this isn't true.

Working on a fix now...
chalsall is online now   Reply With Quote
Old 2020-03-03, 17:34   #4697
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by chalsall View Post
Working on a fix now...
OK... The Bootstrap module will now SIGINT the CPUWrapper module, which in turn SIGINTs the Payload module, which in turn SIGINTs the mprime process...

The upside of this is after mprime exits the Payload module gives the Checkpointer a chance to upload the just-written checkpoint file.

Please /don't/ stop and then restart a running Colab session to get this new code. But all future runs will pick this up.
chalsall is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 17:26.

Sat Jun 6 17:26:13 UTC 2020 up 73 days, 14:59, 1 user, load averages: 3.04, 2.71, 2.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.