mersenneforum.org GPU to 72 status...
 Register FAQ Search Today's Posts Mark Forums Read

2020-03-24, 23:36   #4830
chalsall
If I May

"Chris Halsall"
Sep 2002

5·7·257 Posts

Quote:
 Originally Posted by linament Not a big deal, but I thought I would let you know.
Ah... Thanks. Stupid Programmer Error -- I thought I had trapped for that. Fixed.

 2020-03-25, 10:26 #4831 LaurV Romulan Interpreter     Jun 2011 Thailand 22·2,137 Posts Hey Chris, can you spin a "colab toy" that launches two copies of mfaktc when a K80 is detected? I am pretty sure we are only using half of it on colab. Of course, this has yet to be tested, but it seems that for P4 and P100 we get about 95%-110% of the theoretical performance (probably explainable by the fact that their clocks are not standard), while for T4 and K80 we get less. One colab T4 only gives us about 65% of the theoretical performance, while K80 is capped at about 45%. This also matches James' tables (well.. somehow). I don't know the issue with T4 (it may be indeed running underclocked in colab's servers, or something else may have taking place which we don't know) but for K80 one explanation may be the "dual chip". So, I assume we only use half of it (or only half is made available by colab?). Could you try to play with it? Two folders, "-d 0", "-d 1", whatever (I don't know how that goes under linux). Two instances to try would be interesting. In the worst case, we can get half of the speed in each instance, and we learn that more is not possible... But in the best case we may get few percents of GHzDays/Day more (up to 100% in the best case). Last fiddled with by LaurV on 2020-03-25 at 10:29
2020-03-25, 13:52   #4832
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3,851 Posts

Quote:
 Originally Posted by LaurV I assume we only use half of it (or only half is made available by colab?).
Running nvidia-smi in Colab (in a script not using challsall's code) shows only one gpu device available. This is true regardless of gpu model. I think that it was established early on, that only one half of the physical dual-gpu card was made available by Colab in the VM, just as only one cpu core (with HT) is. The following nvidia-smi output is obtained about 12 seconds after the mfaktc run is launched as a background process, to give it time to get going and show power and memory utilization from a run in progress.

K80 before background process launch, during Colab script startup:
Code:
Mon Feb 17 14:56:29 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
Gpuowl P-1 run on K80 (might have still been ramping up, note gpu utilization 0 indicated):
Code:
+-----------------------------------------------------------------------------+
Sun Mar 15 19:04:07 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    69W / 149W |     69MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
For comparison, gpuowl P-1 runs on other models:
Code:
Fri Mar 13 08:51:49 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    39W /  75W |   1111MiB /  7611MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
Code:
Mon Mar 23 16:57:09 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0   148W / 250W |  16183MiB / 16280MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
Code:
Sat Feb 29 19:47:48 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    64W /  70W |   2517MiB / 15079MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
I'll increase the 12 seconds and see what shows up in the logs. It could take weeks to get another try on a K80. Script is a version of the first attachment at https://www.mersenneforum.org/showpo...5&postcount=16

edit: on a different account, that had already been running with sleep 18 seconds, found one instance of this for gpuowl P-1 at ~97M. Judging by the memory usage that is P-1 stage 2:
Code:
Tue Mar 10 13:43:36 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0   147W / 149W |  11406MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Last fiddled with by kriesel on 2020-03-25 at 14:20

 2020-03-27, 18:49 #4833 mrk74   Jan 2020 2×3×5 Posts I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.
2020-03-27, 19:03   #4834
chalsall
If I May

"Chris Halsall"
Sep 2002

5×7×257 Posts

Quote:
 Originally Posted by mrk74 I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.
I've been consistently getting a GPU once per day across each of my eight (8#) front ends (added another one to one of my VPN'ed virtual humans as a test), each lasting between 7 and 7.5 hours.

I've found that Colab seems to settle on this kind of allotment within a day or two. Interestingly, each front end is given a GPU at approximately the same time of the day for each individual (Gmail) account.

Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.

 2020-03-27, 21:24 #4835 Uncwilly 6809 > 6502     """"""""""""""""""" Aug 2003 101×103 Posts 11111011101002 Posts For the GPU72 implementation I keep getting sessions that want to do P-1 on the same exponent at the same time. I noticed this several times. Today were working on 100982867 at the same time.
2020-03-27, 21:45   #4836
chalsall
If I May

"Chris Halsall"
Sep 2002

5×7×257 Posts

Quote:
 Originally Posted by Uncwilly Today were working on 100982867 at the same time.
Hmmm... This should really be over on the GPU72 Status thread, but...

I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes.

Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry.

And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).

2020-03-27, 22:06   #4837
James Heinrich

"James Heinrich"
May 2004
ex-Northern Ontario

2×33×53 Posts

Quote:
 Originally Posted by chalsall Has anyone else noticed this kind of behavior?
I haven't, but LaurV reported something similar 10 days ago:
Quote:
 Originally Posted by LaurV colab indeed kicked me off before succeeding in finishing the Stage 1 of that P-1 (last time at almost 98% ). When resumed (starting new session) I am getting the same exponent, but.... starting at 43%. I am already doing this third time.

2020-03-27, 22:21   #4838
Uncwilly
6809 > 6502

"""""""""""""""""""
Aug 2003
101×103 Posts

22×3×11×61 Posts

Quote:
 Originally Posted by chalsall Hmmm... This should really be over on the GPU72 Status thread, but...
For some reason I didn't seem to find the right one. I will look later and move the posts.

Quote:
 I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes. Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry. And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).
I killed the run that started second. The other one is still up.
Code:
20200327_221830 ( 3:11): 100982867 P-1    77   65.02%  Stage: 2   complete. Time: 411.603 sec.
I have seen this happen at least 2 times before.

2020-03-27, 22:41   #4839
petrw1
1976 Toyota Corona years forever!

"Wayne"
Nov 2006

7×613 Posts

Quote:
 Originally Posted by chalsall Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.
5 re-starts in a row P4.
Bad luck ... or do I need to wait a few minutes between restarts?

2020-03-27, 22:47   #4840
chalsall
If I May

"Chris Halsall"
Sep 2002

5·7·257 Posts

Quote:
 Originally Posted by Uncwilly The other one is still up. Code: 20200327_221830 ( 3:11): 100982867 P-1 77 65.02% Stage: 2 complete. Time: 411.603 sec. I have seen this happen at least 2 times before.
OK, thanks very much for the report. I am ***not*** seeing any CP files from the first instance, which can only be explained by the cron sub-system not being installed.

I'll look at making this more resilient. Perhaps have the Checkpointer script also be launched from the CPU Payload script, as well as collect some debugging information as to weither the apt install actually works.

Interesting... This is new(ish) behaviour. And/or an extremely rare edge-case.

 Similar Threads Thread Thread Starter Forum Replies Last Post Primeinator Operation Billion Digits 5 2011-12-06 02:35 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52 Uncwilly Operation Billion Digits 22 2005-10-25 14:05 paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 18:27.

Sat Jun 6 18:27:58 UTC 2020 up 73 days, 16:01, 1 user, load averages: 1.35, 1.44, 1.63