mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2020-06-06, 04:26   #2278
Xyzzy

"Mike"
Aug 2002

7,561 Posts

Quote:
 Originally Posted by Xyzzy We will test Centos 7 later tonight.
Centos 7 needs a newer compiler.

g++: error: unrecognized command line option ‘-std=c++17’

2020-06-06, 09:14   #2279
preda

"Mihai Preda"
Apr 2015

2×19×29 Posts

Quote:
 Originally Posted by kriesel No joy there either. Thanks for trying.
Please try the new commit, I hope it's fixed this time.

2020-06-06, 18:20   #2280
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×29×47 Posts
gpuowl-win: v6.11-314-gde84f41 not yet there

Quote:
 Originally Posted by preda Please try the new commit, I hope it's fixed this time.
now line 199 same error message, see attached build log file.
Attached Files
 build-log.txt (10.6 KB, 102 views)

 2020-06-06, 20:17 #2281 ewmayer ∂2ω=0     Sep 2002 República de California 24·7·101 Posts OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it: Code: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 69.0c 162.0W 1373Mhz 1151Mhz 63.92% manual 250.0W 4% 100% 1 36.0c 64.0W 1547Mhz 351Mhz 40.0% manual 250.0W 99% 100% 2 76.0c 163.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem: Code: 2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each 2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor 2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes 2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night. To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load Code: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 73.0c 158.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% 1 61.0c 192.0W 1547Mhz 1151Mhz 38.82% manual 250.0W 96% 100% 2 77.0c 159.0W 1373Mhz 1151Mhz 61.96% manual 250.0W 4% 100% So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit. Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1. Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.
2020-06-06, 21:33   #2282
preda

"Mihai Preda"
Apr 2015

2·19·29 Posts

Quote:
 Originally Posted by ewmayer OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it: Code: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 69.0c 162.0W 1373Mhz 1151Mhz 63.92% manual 250.0W 4% 100% 1 36.0c 64.0W 1547Mhz 351Mhz 40.0% manual 250.0W 99% 100% 2 76.0c 163.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem: Code: 2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each 2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor 2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes 2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night. To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load Code: GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 0 73.0c 158.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% 1 61.0c 192.0W 1547Mhz 1151Mhz 38.82% manual 250.0W 96% 100% 2 77.0c 159.0W 1373Mhz 1151Mhz 61.96% manual 250.0W 4% 100% So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit. Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1. Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.
In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.

Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit).

BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).

2020-06-06, 22:42   #2283
ewmayer
2ω=0

Sep 2002
República de California

2C3016 Posts

Quote:
 Originally Posted by preda In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.
But for gpuowl, based on tests on a variety of GPUs, we know that 2 tasks is the maximum which makes sense from a performance perspective, correct or not?

Quote:
 Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit). BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).
Thanks, I've added that to my setup scripts.

BTW, once you get beyond the 100x-slowdown range, the difference between 'idling' and 'running very slowly' becomes more or less philosophical. :)

 2020-06-06, 22:57 #2284 Xyzzy     "Mike" Aug 2002 756110 Posts Does gpuowl run P-1 automatically or do you have to tell it to? If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?
2020-06-06, 23:36   #2285
paulunderwood

Sep 2002
Database er0rr

328610 Posts

Quote:
 Originally Posted by Xyzzy Does gpuowl run P-1 automatically or do you have to tell it to? If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?
The later versions of gpuowl are auto-geared for P-1.

The server issues the right code if P-1 is required.

Last fiddled with by paulunderwood on 2020-06-06 at 23:39

 2020-06-08, 16:40 #2286 Xyzzy     "Mike" Aug 2002 11101100010012 Posts Some thoughts: ROCm doesn't support the 5500/5600/5700 "Navi" GPUs, so we are forced to use the amdgpu-pro drivers. Hopefully amdgpu-pro will continue to be supported by gpuowl. Is there a stable branch of gpuowl for people like us who want to be certain that things work properly? (We also don't try to tune gouowl. We just use whatever the defaults are.) We haven't noticed any "kworker" hijacks yet. Is this something rare? Our GPU does not have a unique device ID. We just read this entire thread from start to finish. It is full of great information. Summarizing it all would be a monumental task!
2020-06-08, 16:42   #2287
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3·29·47 Posts

Quote:
 Originally Posted by preda No, I would expect that the availability of unique_id depends on the GPU model. RadeonVII has it, others may not have it. If the file /sys/class/drm/cardN/device/unique_id is there it's likely to have the id information, otherwise not.
It turns out that some NVIDIA gpus also have queryable serial numbers and unique ids. Per http://on-demand.gputechconf.com/gtc...ement-Apis.pdf nvidia-smi (command--line) and NVML (interface to C and Python) can access these. Possibly it's available through OpenCL too. NVIDIA-Smi serial number availability applies to all Tesla models and most Quadro models (4000 and above yes; 2000 no); not the popular GTX or RTX though.

2020-06-08, 16:52   #2288
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

3×29×47 Posts

Quote:
 Originally Posted by ewmayer OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it
I have a Radeon VII on Windows 10 that has decided after a hang, kill process, and launch new process, to run at 570Mhz, which is below the nominal minimum. It seems to be doing ok there, in an odd sort of ultra-power-saving mode. Indicated power consumption in GPU-Z is 61W on that gpu.
Code:
2020-06-08 11:38:18 asr2/radeonvii 96495283 OK 36200000  37.51%; 1534 us/it; ETA 1d 01:42; 9b8147bf22397183 (check 0.89s)
2020-06-08 11:43:26 asr2/radeonvii 96495283 OK 36400000  37.72%; 1534 us/it; ETA 1d 01:36; 50d84f8ddd0a49b5 (check 0.88s)
Compare to timings prior to the hang at over triple the watts:
Code:
2020-06-08 00:44:19 asr2/radeonvii 96495283 OK 16200000  16.79%;  733 us/it; ETA 0d 16:21; 3535cc0cfc329a0a (check 0.61s)
2020-06-08 00:46:46 asr2/radeonvii 96495283 OK 16400000  17.00%;  732 us/it; ETA 0d 16:17; d829cbbe1e03c698 (check 0.56s)

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1618 2020-06-24 00:11 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 18:35.

Tue Jul 14 18:35:16 UTC 2020 up 111 days, 16:08, 0 users, load averages: 1.57, 1.45, 1.53