mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

Xyzzy 2020-06-06 04:26

[QUOTE=Xyzzy;547254]We will test Centos 7 later tonight.[/QUOTE]Centos 7 needs a newer compiler.

[C]g++: error: unrecognized command line option ‘-std=c++17’[/C]

preda 2020-06-06 09:14

[QUOTE=kriesel;547256]No joy there either. Thanks for trying.[/QUOTE]

Please try the new commit, I hope it's fixed this time.

kriesel 2020-06-06 18:20

gpuowl-win: v6.11-314-gde84f41 not yet there
 
1 Attachment(s)
[QUOTE=preda;547271]Please try the new commit, I hope it's fixed this time.[/QUOTE]now line 199 same error message, see attached build log file.

ewmayer 2020-06-06 20:17

OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it:
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 69.0c 162.0W 1373Mhz 1151Mhz 63.92% manual 250.0W 4% 100%
1 36.0c 64.0W 1547Mhz 351Mhz 40.0% manual 250.0W 99% 100%
2 76.0c 163.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% [/code]
I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem:
[code]2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each
2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor
2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes
2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each[/code]
Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night.

To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 73.0c 158.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100%
1 61.0c 192.0W 1547Mhz 1151Mhz 38.82% manual 250.0W 96% 100%
2 77.0c 159.0W 1373Mhz 1151Mhz 61.96% manual 250.0W 4% 100% [/code]
So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit.

Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1.

Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.

preda 2020-06-06 21:33

[QUOTE=ewmayer;547307]OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it:
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 69.0c 162.0W 1373Mhz 1151Mhz 63.92% manual 250.0W 4% 100%
1 36.0c 64.0W 1547Mhz 351Mhz 40.0% manual 250.0W 99% 100%
2 76.0c 163.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100% [/code]
I'd seen similar behavior before on my older Haswell-system R7, there a reboot typically solved the problem. (I've never gotten the ROCm --gpureset -d [device id] option to work ... it just hangs.) But this one persisted through multiple reboot/restart-jobs attempts. Had a look at the tail ends of the 2 logfiles for the jobs on this GPU, that revealed the problem:
[code]2020-06-05 22:20:47 95b2786172df888e 105809299 P2 using 253 buffers of 44.0 MB each
2020-06-05 22:21:27 95b2786172df888e 105809299 P1 GCD: no factor
2020-06-05 22:20:31 95b2786172df888e 105809351 P2 using blocks [33 - 999] to cover 1476003 primes
2020-06-05 22:20:32 95b2786172df888e 105809351 P2 using 342 buffers of 44.0 MB each[/code]
Job1's p-1 assignment has just begun stage 2 of the factoring attempt, while firing up stage 2 it also does the stage 1 GCD, that has just completed and reports 'no factor'. Job2 started its own p-1 stage 2 a few minutes before. Not sure why the 2 jobs have ended up using a different number of buffers (chunks of memory storing small even powers of the stage 1 residue - generally the more of these we can fit in memory the faster stage 2 will be, but there are diminishing returns once we get above ~100 such precomputed powers), but those buffers reveal the bug. let's see how much memory they represent: (253+342)*44 = 26180, 26GB of memory, nearly double the 16GB HBM available on the R7. I'm not sure what ROCm does in such circumstances - does it start swapping to regular RAM, or to disk? - but in any event the result is clear, processing on the card basically comes to a halt. If I hadn't noticed the problem shortly before going to bed, I suspect that card would have idled all night.

To test the hypothesis I killed one of the two p-1 jobs, voila, the wall wattage immediately shot back up into the normal range, the ROCm display showed the temp and MCLK settings rising back to normal-under-load
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 73.0c 158.0W 1373Mhz 1151Mhz 62.75% manual 250.0W 4% 100%
1 61.0c 192.0W 1547Mhz 1151Mhz 38.82% manual 250.0W 96% 100%
2 77.0c 159.0W 1373Mhz 1151Mhz 61.96% manual 250.0W 4% 100% [/code]
So that VRAM = 99% in the first display was in fact saying "VRAM is maxed out, we're going on strike". The default stage 2 memalloc of 96% of 16GB is so close to the limit that even a modest additional memalloc puts us over the limit.

Oddly, I'm near-100% sure that I've had such occurrences of two p-1 jobs running on the same card, both in stage 2, before on my Haswell R7 - surprised to only have hit this issue now. The fact that job 1 above, which started stage 2 after job2 did, only alloc'ed 253 buffers indicates to me that some kind of how-much-HBM-is-available check is being done at runtime, but somehow we still ended up over 16GB. Perhaps if two p-1 stage 2s running on the same card start nearly at the same time, could that cause the mem-available computation to be fubar? No, looking at the logs of the 2 jobs, job2 started its stage 2 a full 15 minutes before job1.

Mihai, George, do you have any sense of the stage 2 performance hit from cutting the memalloc to half the current "use all available HBM"? That would mean ~170 buffers for exponents in the above range.[/QUOTE]

In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.

Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit).

BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).

ewmayer 2020-06-06 22:42

[QUOTE=preda;547310]In OpenCL there's no way to get the "free/available memory on the GPU" (there is a mecanism to get the *total* memory, but that's not useful when that total is shared between an unknown number of actors). That lack comes from a philosophical choice OpenCL made to "abstract away" the hardware as much as possible, including the amount of memory available. It acts like this: you want to allocate memory, the GPU doesn't have any anymore, fine we're going to allocate it on the host behind your back and lie to you that the alloc succeeded. The difference you see in the second process is because, after a while of allocating on the host, ROCm finally decided that too much is too much and reported a first alloc failure.[/quote]
But for gpuowl, based on tests on a variety of GPUs, we know that 2 tasks is the maximum which makes sense from a performance perspective, correct or not?

[quote]Anyway, by default P-1 will assume it runs in single-process and attempt to allocate "all it can" for itself, but no more than 16GB. When running 2-process you're supposed to use the -maxAlloc to indicate how much memory *you* allocate to each process. (e.g.: -maxAlloc 7500 for about 7.5G limit).

BTW, the processes were not idling, they were just running extremely slowly because the buffers were on host memory instead of GPU (because that's a wise ROCm choice, oh yes).[/QUOTE]
Thanks, I've added that to my setup scripts.

BTW, once you get beyond the 100x-slowdown range, the difference between 'idling' and 'running very slowly' becomes more or less philosophical. :)

Xyzzy 2020-06-06 22:57

Does gpuowl run P-1 automatically or do you have to tell it to?

If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?

paulunderwood 2020-06-06 23:36

[QUOTE=Xyzzy;547317]Does gpuowl run P-1 automatically or do you have to tell it to?

If you get "Smallest available first-time PRP tests" will that work have p-1 done before it is issued to you?[/QUOTE]

The later versions of gpuowl are auto-geared for P-1.

The server issues the right code if P-1 is required.

Xyzzy 2020-06-08 16:40

Some thoughts:

ROCm doesn't support the 5500/5600/5700 "Navi" GPUs, so we are forced to use the amdgpu-pro drivers. Hopefully amdgpu-pro will continue to be supported by gpuowl.

Is there a stable branch of gpuowl for people like us who want to be certain that things work properly? (We also don't try to tune gouowl. We just use whatever the defaults are.)

We haven't noticed any "kworker" hijacks yet. Is this something rare?

Our GPU does not have a unique device ID.

We just read this entire thread from start to finish. It is full of great information. Summarizing it all would be a monumental task!

:mike:

kriesel 2020-06-08 16:42

[QUOTE=preda;547233]No, I would expect that the availability of unique_id depends on the GPU model. RadeonVII has it, others may not have it. If the file /sys/class/drm/cardN/device/unique_id is there it's likely to have the id information, otherwise not.[/QUOTE]It turns out that some NVIDIA gpus also have queryable serial numbers and unique ids. Per [URL]http://on-demand.gputechconf.com/gtc/2012/presentations/S0238-Tesla-Cluster-Monitoring-Management-Apis.pdf[/URL] nvidia-smi (command--line) and NVML (interface to C and Python) can access these. Possibly it's available through OpenCL too. NVIDIA-Smi serial number availability applies to all Tesla models and most Quadro models (4000 and above yes; 2000 no); not the popular GTX or RTX though.

kriesel 2020-06-08 16:52

[QUOTE=ewmayer;547307]OK, here's a bug - last night noticed my wall wattmeter on the new 3 x R7 build was running ~250W below its normal 750-800W range ... checked status of the 3 cards, saw that device 1 was basically idling, despite 2 active gpuowl processes running on it[/QUOTE]I have a Radeon VII on Windows 10 that has decided after a hang, kill process, and launch new process, to run at 570Mhz, which is below the nominal minimum. It seems to be doing ok there, in an odd sort of ultra-power-saving mode. Indicated power consumption in GPU-Z is 61W on that gpu.
[CODE]2020-06-08 11:38:18 asr2/radeonvii 96495283 OK 36200000 37.51%; 1534 us/it; ETA 1d 01:42; 9b8147bf22397183 (check 0.89s)
2020-06-08 11:43:26 asr2/radeonvii 96495283 OK 36400000 37.72%; 1534 us/it; ETA 1d 01:36; 50d84f8ddd0a49b5 (check 0.88s)[/CODE]Compare to timings prior to the hang at over triple the watts:[CODE]2020-06-08 00:44:19 asr2/radeonvii 96495283 OK 16200000 16.79%; 733 us/it; ETA 0d 16:21; 3535cc0cfc329a0a (check 0.61s)
2020-06-08 00:46:46 asr2/radeonvii 96495283 OK 16400000 17.00%; 732 us/it; ETA 0d 16:17; d829cbbe1e03c698 (check 0.56s)
[/CODE]


All times are UTC. The time now is 23:00.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.