mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-06-06, 04:26   #2278
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,561 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
We will test Centos 7 later tonight.
Centos 7 needs a newer compiler.

g++: error: unrecognized command line option ‘-std=c++17’
Xyzzy is offline   Reply With Quote
Old 2020-06-06, 09:14   #2279
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2×19×29 Posts
Default

Quote:
Originally Posted by kriesel View Post
No joy there either. Thanks for trying.
Please try the new commit, I hope it's fixed this time.
preda is offline   Reply With Quote
Old 2020-06-06, 18:20   #2280
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×29×47 Posts
Default gpuowl-win: v6.11-314-gde84f41 not yet there

Quote:
Originally Posted by preda View Post
Please try the new commit, I hope it's fixed this time.
now line 199 same error message, see attached build log file.
Attached Files
File Type: txt build-log.txt (10.6 KB, 102 views)
kriesel is online now   Reply With Quote
Old 2020-06-06, 20:17   #2281
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

24·7·101 Posts
Default

OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it:
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%  
0    69.0c  162.0W  1373Mhz  1151Mhz  63.92%  manual  250.0W    4%   100%  
1    36.0c  64.0W   1547Mhz   351Mhz  40.0%   manual  250.0W   99%   100%  
2    76.0c  163.0W  1373Mhz  1151Mhz  62.75%  manual  250.0W    4%   100%
I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem:
Code:
2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each
2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor
2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes
2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each
Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night.

To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%  
0    73.0c  158.0W  1373Mhz  1151Mhz  62.75%  manual  250.0W    4%   100%  
1    61.0c  192.0W  1547Mhz  1151Mhz  38.82%  manual  250.0W   96%   100%  
2    77.0c  159.0W  1373Mhz  1151Mhz  61.96%  manual  250.0W    4%   100%
So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit.

Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1.

Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.
ewmayer is offline   Reply With Quote
Old 2020-06-06, 21:33   #2282
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

2·19·29 Posts
Default

Quote:
Originally Posted by ewmayer View Post
OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it:
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%  
0    69.0c  162.0W  1373Mhz  1151Mhz  63.92%  manual  250.0W    4%   100%  
1    36.0c  64.0W   1547Mhz   351Mhz  40.0%   manual  250.0W   99%   100%  
2    76.0c  163.0W  1373Mhz  1151Mhz  62.75%  manual  250.0W    4%   100%
I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem:
Code:
2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each
2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor
2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes
2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each
Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night.

To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load
Code:
GPU  Temp   AvgPwr  SCLK     MCLK     Fan     Perf    PwrCap  VRAM%  GPU%  
0    73.0c  158.0W  1373Mhz  1151Mhz  62.75%  manual  250.0W    4%   100%  
1    61.0c  192.0W  1547Mhz  1151Mhz  38.82%  manual  250.0W   96%   100%  
2    77.0c  159.0W  1373Mhz  1151Mhz  61.96%  manual  250.0W    4%   100%
So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit.

Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1.

Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.
In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.

Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit).

BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).
preda is offline   Reply With Quote
Old 2020-06-06, 22:42   #2283
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2C3016 Posts
Default

Quote:
Originally Posted by preda View Post
In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.
But for gpuowl, based on tests on a variety of GPUs, we know that 2 tasks is the maximum which makes sense from a performance perspective, correct or not?

Quote:
Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit).

BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).
Thanks, I've added that to my setup scripts.

BTW, once you get beyond the 100x-slowdown range, the difference between 'idling' and 'running very slowly' becomes more or less philosophical. :)
ewmayer is offline   Reply With Quote
Old 2020-06-06, 22:57   #2284
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

756110 Posts
Default

Does gpuowl run P-1 automatically or do you have to tell it to?

If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?
Xyzzy is offline   Reply With Quote
Old 2020-06-06, 23:36   #2285
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

328610 Posts
Default

Quote:
Originally Posted by Xyzzy View Post
Does gpuowl run P-1 automatically or do you have to tell it to?

If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?
The later versions of gpuowl are auto-geared for P-1.

The server issues the right code if P-1 is required.

Last fiddled with by paulunderwood on 2020-06-06 at 23:39
paulunderwood is online now   Reply With Quote
Old 2020-06-08, 16:40   #2286
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

11101100010012 Posts
Default

Some thoughts:

ROCm doesn't support the 5500/5600/5700 "Navi" GPUs, so we are forced to use the amdgpu-pro drivers. Hopefully amdgpu-pro will continue to be supported by gpuowl.

Is there a stable branch of gpuowl for people like us who want to be certain that things work properly? (We also don't try to tune gouowl. We just use whatever the defaults are.)

We haven't noticed any "kworker" hijacks yet. Is this something rare?

Our GPU does not have a unique device ID.

We just read this entire thread from start to finish. It is full of great information. Summarizing it all would be a monumental task!

Xyzzy is offline   Reply With Quote
Old 2020-06-08, 16:42   #2287
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·29·47 Posts
Default

Quote:
Originally Posted by preda View Post
No, I would expect that the availability of unique_id depends on the GPU model. RadeonVII has it, others may not have it. If the file /sys/class/drm/cardN/device/unique_id is there it's likely to have the id information, otherwise not.
It turns out that some NVIDIA gpus also have queryable serial numbers and unique ids. Per http://on-demand.gputechconf.com/gtc...ement-Apis.pdf nvidia-smi (command--line) and NVML (interface to C and Python) can access these. Possibly it's available through OpenCL too. NVIDIA-Smi serial number availability applies to all Tesla models and most Quadro models (4000 and above yes; 2000 no); not the popular GTX or RTX though.
kriesel is online now   Reply With Quote
Old 2020-06-08, 16:52   #2288
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×29×47 Posts
Default

Quote:
Originally Posted by ewmayer View Post
OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it
I have a Radeon VII on Windows 10 that has decided after a hang, kill process, and launch new process, to run at 570Mhz, which is below the nominal minimum. It seems to be doing ok there, in an odd sort of ultra-power-saving mode. Indicated power consumption in GPU-Z is 61W on that gpu.
Code:
2020-06-08 11:38:18 asr2/radeonvii 96495283 OK 36200000  37.51%; 1534 us/it; ETA 1d 01:42; 9b8147bf22397183 (check 0.89s)
2020-06-08 11:43:26 asr2/radeonvii 96495283 OK 36400000  37.72%; 1534 us/it; ETA 1d 01:36; 50d84f8ddd0a49b5 (check 0.88s)
Compare to timings prior to the hang at over triple the watts:
Code:
2020-06-08 00:44:19 asr2/radeonvii 96495283 OK 16200000  16.79%;  733 us/it; ETA 0d 16:21; 3535cc0cfc329a0a (check 0.61s)
2020-06-08 00:46:46 asr2/radeonvii 96495283 OK 16400000  17.00%;  732 us/it; ETA 0d 16:17; d829cbbe1e03c698 (check 0.56s)
kriesel is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1618 2020-06-24 00:11
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 18:35.

Tue Jul 14 18:35:16 UTC 2020 up 111 days, 16:08, 0 users, load averages: 1.57, 1.45, 1.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.