mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > PrimeNet > GPU to 72

Reply
 
Thread Tools
Old 2020-03-24, 23:36   #4830
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by linament View Post
Not a big deal, but I thought I would let you know.
Ah... Thanks. Stupid Programmer Error -- I thought I had trapped for that. Fixed.
chalsall is offline   Reply With Quote
Old 2020-03-25, 10:26   #4831
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

22·2,137 Posts
Default

Hey Chris, can you spin a "colab toy" that launches two copies of mfaktc when a K80 is detected? I am pretty sure we are only using half of it on colab. Of course, this has yet to be tested, but it seems that for P4 and P100 we get about 95%-110% of the theoretical performance (probably explainable by the fact that their clocks are not standard), while for T4 and K80 we get less. One colab T4 only gives us about 65% of the theoretical performance, while K80 is capped at about 45%. This also matches James' tables (well.. somehow). I don't know the issue with T4 (it may be indeed running underclocked in colab's servers, or something else may have taking place which we don't know) but for K80 one explanation may be the "dual chip". So, I assume we only use half of it (or only half is made available by colab?). Could you try to play with it? Two folders, "-d 0", "-d 1", whatever (I don't know how that goes under linux). Two instances to try would be interesting. In the worst case, we can get half of the speed in each instance, and we learn that more is not possible... But in the best case we may get few percents of GHzDays/Day more (up to 100% in the best case).

Last fiddled with by LaurV on 2020-03-25 at 10:29
LaurV is offline   Reply With Quote
Old 2020-03-25, 13:52   #4832
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3,851 Posts
Default

Quote:
Originally Posted by LaurV View Post
I assume we only use half of it (or only half is made available by colab?).
Running nvidia-smi in Colab (in a script not using challsall's code) shows only one gpu device available. This is true regardless of gpu model. I think that it was established early on, that only one half of the physical dual-gpu card was made available by Colab in the VM, just as only one cpu core (with HT) is. The following nvidia-smi output is obtained about 12 seconds after the mfaktc run is launched as a background process, to give it time to get going and show power and memory utilization from a run in progress.

K80 before background process launch, during Colab script startup:
Code:
Mon Feb 17 14:56:29 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   68C    P8    33W / 149W |      0MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
Gpuowl P-1 run on K80 (might have still been ramping up, note gpu utilization 0 indicated):
Code:
+-----------------------------------------------------------------------------+
 Sun Mar 15 19:04:07 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   50C    P0    69W / 149W |     69MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
For comparison, gpuowl P-1 runs on other models:
Code:
Fri Mar 13 08:51:49 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    39W /  75W |   1111MiB /  7611MiB |     67%      Default |
+-------------------------------+----------------------+----------------------+
Code:
Mon Mar 23 16:57:09 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0   148W / 250W |  16183MiB / 16280MiB |     99%      Default |
+-------------------------------+----------------------+----------------------+
Code:
Sat Feb 29 19:47:48 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.48.02    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P0    64W /  70W |   2517MiB / 15079MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
I'll increase the 12 seconds and see what shows up in the logs. It could take weeks to get another try on a K80. Script is a version of the first attachment at https://www.mersenneforum.org/showpo...5&postcount=16


edit: on a different account, that had already been running with sleep 18 seconds, found one instance of this for gpuowl P-1 at ~97M. Judging by the memory usage that is P-1 stage 2:
Code:
Tue Mar 10 13:43:36 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.59       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0   147W / 149W |  11406MiB / 11441MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

Last fiddled with by kriesel on 2020-03-25 at 14:20
kriesel is online now   Reply With Quote
Old 2020-03-27, 18:49   #4833
mrk74
 
Jan 2020

2×3×5 Posts
Default

I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.
mrk74 is offline   Reply With Quote
Old 2020-03-27, 19:03   #4834
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by mrk74 View Post
I haven't gotten GPU in a few days. I've had to use TPU. That connects pretty much right away every time.
I've been consistently getting a GPU once per day across each of my eight (8#) front ends (added another one to one of my VPN'ed virtual humans as a test), each lasting between 7 and 7.5 hours.

I've found that Colab seems to settle on this kind of allotment within a day or two. Interestingly, each front end is given a GPU at approximately the same time of the day for each individual (Gmail) account.

Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.
chalsall is offline   Reply With Quote
Old 2020-03-27, 21:24   #4835
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

11111011101002 Posts
Default

For the GPU72 implementation I keep getting sessions that want to do P-1 on the same exponent at the same time. I noticed this several times. Today were working on 100982867 at the same time.
Uncwilly is offline   Reply With Quote
Old 2020-03-27, 21:45   #4836
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5×7×257 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
Today were working on 100982867 at the same time.
Hmmm... This should really be over on the GPU72 Status thread, but...

I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes.

Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry.

And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).
chalsall is offline   Reply With Quote
Old 2020-03-27, 22:06   #4837
James Heinrich
 
James Heinrich's Avatar
 
"James Heinrich"
May 2004
ex-Northern Ontario

2×33×53 Posts
Default

Quote:
Originally Posted by chalsall View Post
Has anyone else noticed this kind of behavior?
I haven't, but LaurV reported something similar 10 days ago:
Quote:
Originally Posted by LaurV View Post
colab indeed kicked me off before succeeding in finishing the Stage 1 of that P-1 (last time at almost 98% ). When resumed (starting new session) I am getting the same exponent, but.... starting at 43%. I am already doing this third time.
James Heinrich is offline   Reply With Quote
Old 2020-03-27, 22:21   #4838
Uncwilly
6809 > 6502
 
Uncwilly's Avatar
 
"""""""""""""""""""
Aug 2003
101×103 Posts

22×3×11×61 Posts
Default

Quote:
Originally Posted by chalsall View Post
Hmmm... This should really be over on the GPU72 Status thread, but...
For some reason I didn't seem to find the right one. I will look later and move the posts.

Quote:
I see this candidate was assigned to you at 19:15 and then again at 21:06 (UTC). The checkpoint file issued was the same for both. Did the first run actually run for more than ten minutes? I see the second run only lasted for about 22 minutes.

Please let me know. If it did actually run, the only explanation would be that the "apt install cron" didn't "take", and thus the "cpoints.pl" script wasn't being launched by the crontab entry.

And, of course, I've never seen this before myself... Has anyone else noticed this kind of behavior? The code hasn't changed for a couple of weeks (not that that necessarily means it's entirely sane).
I killed the run that started second. The other one is still up.
Code:
20200327_221830 ( 3:11): 100982867 P-1    77   65.02%  Stage: 2   complete. Time: 411.603 sec.
I have seen this happen at least 2 times before.
Uncwilly is offline   Reply With Quote
Old 2020-03-27, 22:41   #4839
petrw1
1976 Toyota Corona years forever!
 
petrw1's Avatar
 
"Wayne"
Nov 2006
Saskatchewan, Canada

7×613 Posts
Default

Quote:
Originally Posted by chalsall View Post
Further, I've found that when I'm given a GPU, if I get a K80 or a P4 I can do a "Factory Reset" and after two to five attempts, I will be given a T4 or a P100.
5 re-starts in a row P4.
Bad luck ... or do I need to wait a few minutes between restarts?
petrw1 is offline   Reply With Quote
Old 2020-03-27, 22:47   #4840
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

5·7·257 Posts
Default

Quote:
Originally Posted by Uncwilly View Post
The other one is still up.
Code:
20200327_221830 ( 3:11): 100982867 P-1    77   65.02%  Stage: 2   complete. Time: 411.603 sec.
I have seen this happen at least 2 times before.
OK, thanks very much for the report. I am ***not*** seeing any CP files from the first instance, which can only be explained by the cron sub-system not being installed.

I'll look at making this more resilient. Perhaps have the Checkpointer script also be launched from the CPU Payload script, as well as collect some debugging information as to weither the apt install actually works.

Interesting... This is new(ish) behaviour. And/or an extremely rare edge-case.
chalsall is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Status Primeinator Operation Billion Digits 5 2011-12-06 02:35
62 bit status 1997rj7 Lone Mersenne Hunters 27 2008-09-29 13:52
OBD Status Uncwilly Operation Billion Digits 22 2005-10-25 14:05
1-2M LLR status paulunderwood 3*2^n-1 Search 2 2005-03-13 17:03
Status of 26.0M - 26.5M 1997rj7 Lone Mersenne Hunters 25 2004-06-18 16:46

All times are UTC. The time now is 18:27.

Sat Jun 6 18:27:58 UTC 2020 up 73 days, 16:01, 1 user, load averages: 1.35, 1.44, 1.63

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.