mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

petrw1 2020-03-15 21:37

I can get 2 sessions each day but only once; late in the evening.
Otherwise as soon as I start 1 session it won't let me start another.

Interestingly, for this 1 session I can start the GPU72 session without first starting the "tunnel" session.
It gives the message "No GPU available" still but lets the CPU code run the P1.

chalsall 2020-03-15 22:09

[QUOTE=petrw1;539807]Interestingly, for this 1 session I can start the GPU72 session without first starting the "tunnel" session.[/QUOTE]

You keep mentioning the "Tunnel" session. Are you running am Instance Root reverse-tunnel Section? Not needed (but fun for the pretty graphs and other data).

petrw1 2020-03-15 22:24

[QUOTE=chalsall;539810]You keep mentioning the "Tunnel" session. Are you running am Instance Root reverse-tunnel Section? Not needed (but fun for the pretty graphs and other data).[/QUOTE]

I didn't realize the rules changed since I started last fall.
1. Start tunnels: sshd.pl
2. Run bootstrap.pl

If step 1 is no longer required why can I not get a GPU without it?

James Heinrich 2020-03-15 22:31

[QUOTE=petrw1;539812]If step 1 is no longer required why can I not get a GPU without it?[/QUOTE]I just open [url]https://colab.research.google.com/github/chalsall/GPU72_CoLab/blob/master/gpu72_tf.ipynb[/url] plop in my NAK and click Play, and I have no trouble getting a GPU (most of the time).

chalsall 2020-03-15 22:31

[QUOTE=petrw1;539812]I didn't realize the rules changed since I started last fall. ... If step 1 is no longer required why can I not get a GPU without it?[/QUOTE]

The sshd.pl Section has /never/ been needed for the GPU72_TF Notebook. It's more of a developer's tool.

I have no idea why you're noticing that correlation. But it shouldn't be causal.

Once you're given a Session (read: Connect to a Backend) you'll have a GPU, or you won't. Running an SSH Section won't magically attach you to a GPU.

LaurV 2020-03-17 04:05

Hey Chris, I just upgraded to the new colab script yesterday, the one which uses the CPU too, and there seems to be a bug with reporting results for CPU.

First, I got a P-1 starting at 43% of Stage 1 (??). As I didn't do any P-1 before (this is new "notebook" with the ID starting with "b535..."), I assumed that you save the intermediary (full) residues from time to time, just in case colab decides to kick someone's ass unexpectedly, and then you resume next time. But passing me other's guy work (I assume you do it viceversa too?) is wrong, somehow, because assuming I can finish it, I would get the credit for it, therefore robing the person who did the first 43% of the work. You should keep an evidence and assign the continuity of work only to the user who did the first part of work too. Not that I complain too much about free resources given by Google to us...

Secondly, colab indeed kicked me off before succeeding in finishing the Stage 1 of that P-1 (last time at almost 98% :rant:). When resumed (starting new session) I am getting the same exponent, but.... starting at 43%. I am already doing this third time.

"102986021 P-1 77 46.23% Stage: 1 complete."

(the column is confuse there, it looks like the stage 1 is complete, but it is not, the message says that "46% of stage 1 is complete", you should better display as: "Stage 1 complete: 46.xx%", but this is minor, my pain in the butt is now repeating the same work over and over, to no progress. Am I doing something wrong? Do I need to use some "persistent" storage/drive on my side of colab/google_drive/whatever?).

LaurV 2020-03-17 06:48

Ok, today it seems I got a better CPU (?!?), because after 5 hours, it finished Stage 1, tried an unsuccessful GCD, ans moved to Stage 2, which is now ~5.5% done. If the instance is killed at 10 hours as expected (or before), it is clear that it won't finish and report in time.

I just backup'd the checpoint files, in case it crashes I will finish it locally, to avoid doing the same work over and over.

What's the plan B? (you see, we didn't really keep in touch with new "inventions" you did there, and most probably we are doing something wrong...)

chalsall 2020-03-17 17:04

[QUOTE=LaurV;539905]What's the plan B? (you see, we didn't really keep in touch with new "inventions" you did there, and most probably we are doing something wrong...)[/QUOTE]

OK... I'm /stupidly/ busy at the moment. Getting a company ready to work 100% remotely...

But this should all be sane; many people are using it successfully; including my seven instances running the exact same code as everyone else.

To be clear... The P-1 checkpoint files should be thrown back to the server every ten minutes during the entire run(s). If an instance dies, the last checkpoint is sent out to the next requested instance (that you own, of course).

If you PM me the exponent in question, I can examine the logs and the checkpoint files themselves.

Uncwilly 2020-03-17 19:35

[QUOTE=chalsall;539961]OK... I'm /stupidly/ busy at the moment. Getting a company ready to work 100% remotely...[/QUOTE][SIZE="3"][FONT="Lucida Sans Unicode"][COLOR="Green"][B]Bless you my son. You are doing work that is vital to keeping the world safe. It will be transparent to most people. But we here know that bits don't move by themseleves.[/B][/COLOR][/FONT][/SIZE]
:awesome:
:bow wave:

LaurV 2020-03-18 04:56

[QUOTE=chalsall;539961]But this should all be sane; <...>
If you PM me the exponent in question <...>[/QUOTE]
The exponent was in the first post. It resumed Stage 2 at ~54% [U]normally[/U] today, after last night kick off, only about 7-8 minutes of work lost (it seems as you said, checkpoint time around 10 minutes). You don't need to do anything, but if you have time, you can check the fact that we (colab) did the Stage 1 more than once from ~43% to ~9x% (assuming the reports reached your server, but they probably did, because TF was reported normally in all this time).


Edit: We manually stopped and restarted everything after some time, because we were not satisfied with the K80 we got for TF, and the P-1 Stage 2 resumed again, normally (61%). We are good here.

Uncwilly 2020-03-19 16:18

[FONT="Arial Black"][COLOR="Red"][SIZE="3"]MOD NOTE: BOINC related posts moved here:[/SIZE][/COLOR][/FONT]
[url]https://www.mersenneforum.org/showthread.php?t=25383[/url]


All times are UTC. The time now is 22:57.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.