mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2020-03-15, 21:37   #4797
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

10C316 Posts
Default

I can get 2 sessions each day but only once; late in the evening.
Otherwise as soon as I start 1 session it won't let me start another.

Interestingly, for this 1 session I can start the GPU72 session without first starting the "tunnel" session.
It gives the message "No GPU available" still but lets the CPU code run the P1.
petrw1 is offline   Reply With Quote
Old 2020-03-15, 22:09   #4798
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by petrw1 View Post
Interestingly, for this 1 session I can start the GPU72 session without first starting the "tunnel" session.
You keep mentioning the "Tunnel" session. Are you running am Instance Root reverse-tunnel Section? Not needed (but fun for the pretty graphs and other data).
chalsall is online now   Reply With Quote
Old 2020-03-15, 22:24   #4799
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

7×613 Posts
Default

Quote:
Originally Posted by chalsall View Post
You keep mentioning the "Tunnel" session. Are you running am Instance Root reverse-tunnel Section? Not needed (but fun for the pretty graphs and other data).
I didn't realize the rules changed since I started last fall.
1. Start tunnels: sshd.pl
2. Run bootstrap.pl

If step 1 is no longer required why can I not get a GPU without it?
petrw1 is offline   Reply With Quote
Old 2020-03-15, 22:31   #4800
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

2×33×53 Posts
Default

Quote:
Originally Posted by petrw1 View Post
If step 1 is no longer required why can I not get a GPU without it?
I just open https://colab.research.google.com/gi...gpu72_tf.ipynb plop in my NAK and click Play, and I have no trouble getting a GPU (most of the time).
James Heinrich is offline   Reply With Quote
Old 2020-03-15, 22:31   #4801
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by petrw1 View Post
I didn't realize the rules changed since I started last fall. ... If step 1 is no longer required why can I not get a GPU without it?
The sshd.pl Section has /never/ been needed for the GPU72_TF Notebook. It's more of a developer's tool.

I have no idea why you're noticing that correlation. But it shouldn't be causal.

Once you're given a Session (read: Connect to a Backend) you'll have a GPU, or you won't. Running an SSH Section won't magically attach you to a GPU.
chalsall is online now   Reply With Quote
Old 2020-03-17, 04:05   #4802
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

216416 Posts
Default

Hey Chris, I just upgraded to the new colab script yesterday, the one which uses the CPU too, and there seems to be a bug with reporting results for CPU.

First, I got a P-1 starting at 43% of Stage 1 (??). As I didn't do any P-1 before (this is new "notebook" with the ID starting with "b535..."), I assumed that you save the intermediary (full) residues from time to time, just in case colab decides to kick someone's ass unexpectedly, and then you resume next time. But passing me other's guy work (I assume you do it viceversa too?) is wrong, somehow, because assuming I can finish it, I would get the credit for it, therefore robing the person who did the first 43% of the work. You should keep an evidence and assign the continuity of work only to the user who did the first part of work too. Not that I complain too much about free resources given by Google to us...

Secondly, colab indeed kicked me off before succeeding in finishing the Stage 1 of that P-1 (last time at almost 98% ). When resumed (starting new session) I am getting the same exponent, but.... starting at 43%. I am already doing this third time.

"102986021 P-1 77 46.23% Stage: 1 complete."

(the column is confuse there, it looks like the stage 1 is complete, but it is not, the message says that "46% of stage 1 is complete", you should better display as: "Stage 1 complete: 46.xx%", but this is minor, my pain in the butt is now repeating the same work over and over, to no progress. Am I doing something wrong? Do I need to use some "persistent" storage/drive on my side of colab/google_drive/whatever?).

Last fiddled with by LaurV on 2020-03-17 at 04:13
LaurV is offline   Reply With Quote
Old 2020-03-17, 06:48   #4803
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22·2,137 Posts
Default

Ok, today it seems I got a better CPU (?!?), because after 5 hours, it finished Stage 1, tried an unsuccessful GCD, ans moved to Stage 2, which is now ~5.5% done. If the instance is killed at 10 hours as expected (or before), it is clear that it won't finish and report in time.

I just backup'd the checpoint files, in case it crashes I will finish it locally, to avoid doing the same work over and over.

What's the plan B? (you see, we didn't really keep in touch with new "inventions" you did there, and most probably we are doing something wrong...)

Last fiddled with by LaurV on 2020-03-17 at 06:49
LaurV is offline   Reply With Quote
Old 2020-03-17, 17:04   #4804
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

232316 Posts
Default

Quote:
Originally Posted by LaurV View Post
What's the plan B? (you see, we didn't really keep in touch with new "inventions" you did there, and most probably we are doing something wrong...)
OK... I'm /stupidly/ busy at the moment. Getting a company ready to work 100% remotely...

But this should all be sane; many people are using it successfully; including my seven instances running the exact same code as everyone else.

To be clear... The P-1 checkpoint files should be thrown back to the server every ten minutes during the entire run(s). If an instance dies, the last checkpoint is sent out to the next requested instance (that you own, of course).

If you PM me the exponent in question, I can examine the logs and the checkpoint files themselves.
chalsall is online now   Reply With Quote
Old 2020-03-17, 19:35   #4805
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×3×11×61 Posts
Default

Quote:
Originally Posted by chalsall View Post
OK... I'm /stupidly/ busy at the moment. Getting a company ready to work 100% remotely...
Bless you my son. You are doing work that is vital to keeping the world safe. It will be transparent to most people. But we here know that bits don't move by themseleves.

Uncwilly is offline   Reply With Quote
Old 2020-03-18, 04:56   #4806
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22×2,137 Posts
Default

Quote:
Originally Posted by chalsall View Post
But this should all be sane; <...>
If you PM me the exponent in question <...>
The exponent was in the first post. It resumed Stage 2 at ~54% normally today, after last night kick off, only about 7-8 minutes of work lost (it seems as you said, checkpoint time around 10 minutes). You don't need to do anything, but if you have time, you can check the fact that we (colab) did the Stage 1 more than once from ~43% to ~9x% (assuming the reports reached your server, but they probably did, because TF was reported normally in all this time).


Edit: We manually stopped and restarted everything after some time, because we were not satisfied with the K80 we got for TF, and the P-1 Stage 2 resumed again, normally (61%). We are good here.

Last fiddled with by LaurV on 2020-03-18 at 05:06
LaurV is offline   Reply With Quote
Old 2020-03-19, 16:18   #4807
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

11111011101002 Posts
Default

MOD NOTE: BOINC related posts moved here:
https://www.mersenneforum.org/showthread.php?t=25383
Uncwilly is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 17:30.

Sat Jun 6 17:30:15 UTC 2020 up 73 days, 15:03, 1 user, load averages: 1.75, 2.45, 2.28

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.