mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU to 72 (https://www.mersenneforum.org/forumdisplay.php?f=95)
-   -   GPU to 72 status... (https://www.mersenneforum.org/showthread.php?t=16263)

chalsall 2020-03-24 23:36

[QUOTE=linament;540806]Not a big deal, but I thought I would let you know.[/QUOTE]

Ah... Thanks. Stupid Programmer Error -- I thought I had trapped for that. Fixed.

LaurV 2020-03-25 10:26

Hey Chris, can you spin a "colab toy" that launches [U]two[/U] copies of mfaktc when a K80 is detected? I am pretty sure we are only using half of it on colab. Of course, this has yet to be tested, but it seems that for P4 and P100 we get about 95%-110% of the theoretical performance (probably explainable by the fact that their clocks are not standard), while for T4 and K80 we get less. One colab T4 only gives us about 65% of the theoretical performance, while K80 is capped at about 45%. This also matches James' tables (well.. somehow). I don't know the issue with T4 (it may be indeed running underclocked in colab's servers, or something else may have taking place which we don't know) but for K80 one explanation may be the "dual chip". So, I assume we only use half of it (or only half is made available by colab?). Could you try to play with it? Two folders, "-d 0", "-d 1", whatever (I don't know how that goes under linux). Two instances to try would be interesting. In the worst case, we can get half of the speed in each instance, and we learn that more is not possible... But in the best case we may get few percents of GHzDays/Day more (up to 100% in the best case).

kriesel 2020-03-25 13:52

[QUOTE=LaurV;540839]I assume we only use half of it (or only half is made available by colab?).[/QUOTE]Running nvidia-smi in Colab (in a script not using challsall's code) shows only one gpu device available. This is true regardless of gpu model. I think that it was established early on, that only one half of the physical dual-gpu card was made available by Colab in the VM, just as only one cpu core (with HT) is. The following nvidia-smi output is obtained about 12 seconds after the mfaktc run is launched as a background process, to give it time to get going and show power and memory utilization from a run in progress.

K80 before background process launch, during Colab script startup:
[CODE]Mon Feb 17 14:56:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 68C P8 33W / 149W | 0MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+[/CODE]Gpuowl P-1 run on K80 (might have still been ramping up, note gpu utilization 0 indicated):[CODE]+-----------------------------------------------------------------------------+
Sun Mar 15 19:04:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 50C P0 69W / 149W | 69MiB / 11441MiB | 0% Default |
+-------------------------------+----------------------+----------------------+[/CODE]For comparison, gpuowl P-1 runs on other models:[CODE]Fri Mar 13 08:51:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 38C P0 39W / 75W | 1111MiB / 7611MiB | 67% Default |
+-------------------------------+----------------------+----------------------+[/CODE][CODE]Mon Mar 23 16:57:09 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... Off | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 148W / 250W | 16183MiB / 16280MiB | 99% Default |
+-------------------------------+----------------------+----------------------+[/CODE][CODE]Sat Feb 29 19:47:48 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 48C P0 64W / 70W | 2517MiB / 15079MiB | 100% Default |
+-------------------------------+----------------------+----------------------+[/CODE]I'll increase the 12 seconds and see what shows up in the logs. It could take weeks to get another try on a K80. Script is a version of the first attachment at [URL]https://www.mersenneforum.org/showpost.php?p=537155&postcount=16[/URL]


edit: on a different account, that had already been running with sleep 18 seconds, found one instance of this for gpuowl P-1 at ~97M. Judging by the memory usage that is P-1 stage 2:[CODE]Tue Mar 10 13:43:36 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 147W / 149W | 11406MiB / 11441MiB | 100% Default |
+-------------------------------+----------------------+----------------------+[/CODE]

mrk74 2020-03-27 18:49

I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.

chalsall 2020-03-27 19:03

[QUOTE=mrk74;541087]I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.[/QUOTE]

I've been consistently getting a GPU once per day across each of my eight (8#) front ends (added another one to one of my VPN'ed virtual humans as a test), each lasting between 7 and 7.5 hours.

I've found that Colab seems to settle on this kind of allotment within a day or two. Interestingly, each front end is given a GPU at approximately the same time of the day for each individual (Gmail) account.

Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.

Uncwilly 2020-03-27 21:24

For the GPU72 implementation I keep getting sessions that want to do P-1 on the same exponent at the same time. I noticed this several times. Today were working on 100982867 at the same time.

chalsall 2020-03-27 21:45

[QUOTE=Uncwilly;541100]Today were working on 100982867 at the same time.[/QUOTE]

Hmmm... This should really be over on the GPU72 Status thread, but...

I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes.

Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry.

And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).

James Heinrich 2020-03-27 22:06

[QUOTE=chalsall;541103]Has anyone else noticed this kind of behavior?[/QUOTE]I haven't, but LaurV reported something similar 10 days ago:[QUOTE=LaurV;539899]colab indeed kicked me off before succeeding in finishing the Stage 1 of that P-1 (last time at almost 98% :rant:). When resumed (starting new session) I am getting the same exponent, but.... starting at 43%. I am already doing this third time.[/QUOTE]

Uncwilly 2020-03-27 22:21

[QUOTE=chalsall;541103]Hmmm... This should really be over on the GPU72 Status thread, but...[/quote]For some reason I didn't seem to find the right one. I will look later and move the posts.

[quote]I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes.

Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry.

And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).[/QUOTE]I killed the run that started second. The other one is still up.
[CODE]20200327_221830 ( 3:11): 100982867 P-1 77 65.02% Stage: 2 complete. Time: 411.603 sec.[/CODE] I have seen this happen at least 2 times before.

petrw1 2020-03-27 22:41

[QUOTE=chalsall;541088]Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.[/QUOTE]

5 re-starts in a row P4.
Bad luck ... or do I need to wait a few minutes between restarts?

chalsall 2020-03-27 22:47

[QUOTE=Uncwilly;541105]The other one is still up.
[CODE]20200327_221830 ( 3:11): 100982867 P-1 77 65.02% Stage: 2 complete. Time: 411.603 sec.[/CODE] I have seen this happen at least 2 times before.[/QUOTE]

OK, thanks very much for the report. I am ***not*** seeing any CP files from the first instance, which can only be explained by the cron sub-system not being installed.

I'll look at making this more resilient. Perhaps have the Checkpointer script also be launched from the CPU Payload script, as well as collect some debugging information as to weither the apt install actually works.

Interesting... This is new(ish) behaviour. And/or an extremely rare edge-case.


All times are UTC. The time now is 22:52.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.