mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-09-21 14:52

[QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways.
[URL]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/URL][/QUOTE]Thanks for the confirmation. It's a known problem. By stalling a prime95 or mprime worker, it can impact more than one core of throughput. See also [URL]https://www.mersenneforum.org/showpost.php?p=525251&postcount=1325[/URL] near end;
[URL]https://www.mersenneforum.org/showpost.php?p=525335&postcount=1334[/URL]
[URL]https://www.mersenneforum.org/showpost.php?p=525346&postcount=1340[/URL]
The last one contains links to possible mitigation approaches.

preda 2019-09-21 15:01

[QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways.
[url]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/url][/QUOTE]

What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it..

preda 2019-09-21 15:10

-time
 
In recent commits I revamped the kernel profiling (enabled with "-time"). The new profiling (which uses OpenCL events to measure the time spent in each kernel execution) should be more precise and with much less overhead when enabled, thus closer to real-life.

I also enabled -time for P-1.

xx005fs 2019-09-21 16:45

[QUOTE=preda;526231]What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it..[/QUOTE]

Since this bug has being around for so long, I seriously doubt Nvidia will listen to its customers complaining about such things and propose a fix for it. So maybe there have to be some other ways to mitigate this.

I found this, a possible workaround proposed by the Hashcat people, don't know if it's possible to implement similar things to GPUOWL and lose no performance.

[CODE]Support to utilize multiple different OpenCL device types in parallel
When I've redesigned the core that handles the workload distribution to multiple different GPUs in the same system, which oclHashcat v2.01 already supported. I thought it would be nice to not just support for GPUs of different kinds and speed but also support different device types. What I'm talking about is running a GPU and CPU (and even FPGA) all in parallel and within the same hashcat session.

Beware! This is not always a clever thing to do. For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago.

Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that).

But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.)

To help mitigate this issue, I've implemented the following behavior:

Hashcat will try to workaround the problem by sleeping for some precalculated time after the kernel was queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on cracking performance.
By default, if hashcat detects both CPU and GPU OpenCL devices in your system, the CPU will be disabled. If you really want to run them both in parallel, you can still set the option --opencl-device-types to 1,2 to utilize both device types, CPU and GPU.
Here's some related information:

[URL="https://devtalk.nvidia.com/default/topic/494659/execute-kernels-without-100-cpu-busy-wait-/"]Execute kernels without 100% CPU busy-wait[/URL]
[URL="https://devtalk.nvidia.com/default/topic/507360/increased-cpu-usage-with-last-drivers-starting-from-270-xx-and-continue-with-285-xx/?offset=5"]Increased CPU usage with last drivers starting from 270.xx[/URL]
[/CODE]

kriesel 2019-09-21 17:08

[QUOTE=preda;526231]What is needed is the equivalent of cudaSetDeviceFlags(cudaDeviceScheduleYield) for OpenCL on Nvidia. Could you maybe open a bug report or feature request with Nvidia, asking them to enable this? (i.e. to offer some way of getting the same result as cudaSetDeviceFlags(cudaDeviceScheduleYield) when using OpenCL). For example, setting some environment variable would be a way. Who knows, maybe Nvidia already offers something like that, only that nobody knows about it..[/QUOTE]As stated back at post 1340, [URL]https://github.com/openmm/openmm/issues/1541[/URL] mentions getting it down to 10% or 4% of a core via a workaround. See lines 1415-1431 of [URL]https://github.com/hashcat/hashcat/blob/2bc65c2c4d5fc2dfd18f14382bef8a1627e9e2e1/src/opencl.c#L1415-L1431[/URL] Note, it's reportedly currently a whole core PER INSTANCE; so if 3 NIVIDIA gpus, 3 gpuowl instances, 3 cpu cores wasted. Sure, it would be better if NVIDIA fixed the opencl performance issue. They've had 8 years and not done it yet. Some expect they never will fix it, since (a) they're less interested in supporting opencl than CUDA, and (b) newer standards will take priority, such as Vulkan. They also created a problem for CUDA compute capability 2.0 cards on Windows way back at driver version 306 and have never fixed that either, leaving users of old NVIDIA gpus to hack the Windows registry and employ wrapper batch files to reduce its impact.

preda 2019-09-22 00:31

-yield
 
In [url]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/url]
I added a new argument -yield
to work around the CUDA busy-wait.

In my testing on AMD it works nicely :), please let me know how it works on Nvidia.

What to watch for:
- time-per-iteration difference when using -yield (i.e. yield could be slower, how much?)
- CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?)
and other possible bugs.

[QUOTE=xx005fs;526217]It looks like when using an Nvidia GPU to do any work with GPUOWL maxes out 1 of my CPU cores, which does have a decent impact on my CPU crunching performance and waste unnecessary heat and power. I have found some links that supposedly fixes the issue (only for CUDA), but I don't know if it is going to be able to work with OpenCL apps like GPUOWL. If one of the CPU core can be freed then that would waste less compute cycles overall.
Here's one of the link anyways.
[url]https://devtalk.nvidia.com/default/topic/755859/cpu-core-is-busy-while-gpu-runs-its-kernel/[/url][/QUOTE]

xx005fs 2019-09-22 01:50

[QUOTE=preda;526245]In [url]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/url]
I added a new argument -yield
to work around the CUDA busy-wait.

In my testing on AMD it works nicely :), please let me know how it works on Nvidia.

What to watch for:
- time-per-iteration difference when using -yield (i.e. yield could be slower, how much?)
- CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?)
and other possible bugs.[/QUOTE]

Awesome! I'll test as soon as I can get on my Linux system. Or if I could somehow figure out how to build for MSYS2 :D

xx005fs 2019-09-22 03:57

Here's the run without the -yield argument, and the CPU usage is as expected to be 100% on 1 single core on an Nvidia Titan V with 1040MHz HBM and 1355MHz core. This is expected performance for this GPU as I am getting the same amount of throughput as I get on Windows with the same clock speed.

[CODE]2019-09-21 21:02:05 gpuowl v6.11-5-g5cca90d
2019-09-21 21:02:05 Note: no config.txt file found
2019-09-21 21:02:05 config: -use ORIG_X2
2019-09-21 21:02:05 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word
2019-09-21 21:02:05 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-21 21:02:05

2019-09-21 21:02:05 OpenCL compilation in 3 ms
2019-09-21 21:02:08 90015581 OK 409500 0.45%; 825 us/sq; ETA 0d 20:32; 8192f49fec60e30e (check 0.50s)
2019-09-21 21:02:41 90015581 450000 0.50%; 825 us/sq; ETA 0d 20:32; b4da35d30644db86
2019-09-21 21:03:23 90015581 OK 500000 0.56%; 825 us/sq; ETA 0d 20:30; 2f704aae47125430 (check 0.50s)
2019-09-21 21:04:04 90015581 550000 0.61%; 825 us/sq; ETA 0d 20:30; be1e1cfa749a826b
2019-09-21 21:04:27 Stopping, please wait..
2019-09-21 21:04:27 90015581 OK 577000 0.64%; 825 us/sq; ETA 0d 20:31; c64f67f7c2a1ca00 (check 0.50s)
2019-09-21 21:04:27 Exiting because "stop requested"
2019-09-21 21:04:27 Bye[/CODE] Now with the -yield workaround with the same GPU with the same clocks. The CPU usage dropped from a core fully maxed out to around 88% with my Ryzen 1700 clocked at 3.85GHz, suggesting that it sorta works. [STRIKE]However, the throughput seemed to improve by around 5%, which is definitely odd since I am actually expecting a reduction in speed instead of an increase.[/STRIKE] Looking forward for someone to test on an older GPU since I won't be dual booting my 1070 system.

[CODE]2019-09-21 21:05:59 gpuowl v6.11-5-g5cca90d
2019-09-21 21:05:59 Note: no config.txt file found
2019-09-21 21:05:59 config: -use ORIG_X2 -yield
2019-09-21 21:05:59 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word
2019-09-21 21:05:59 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-21 21:05:59

2019-09-21 21:05:59 OpenCL compilation in 3 ms
2019-09-21 21:06:02 90015581 OK 578000 0.64%; 824 us/sq; ETA 0d 20:28; 7ad026792b5e2e37 (check 0.50s)
2019-09-21 21:06:20 90015581 600000 0.67%; 824 us/sq; ETA 0d 20:27; 496d287691eb3176
2019-09-21 21:07:01 90015581 650000 0.72%; 823 us/sq; ETA 0d 20:25; e04e150f2e8bee83
2019-09-21 21:07:42 90015581 700000 0.78%; 823 us/sq; ETA 0d 20:25; 818852407c468067
2019-09-21 21:07:55 Stopping, please wait..
2019-09-21 21:07:56 90015581 OK 716000 0.80%; 824 us/sq; ETA 0d 20:26; e38f4cd309745649 (check 0.50s)
2019-09-21 21:07:56 Exiting because "stop requested"
2019-09-21 21:07:56 Bye[/CODE]

kriesel 2019-09-22 04:05

[QUOTE=preda;526245]In [URL]https://github.com/preda/gpuowl/commit/5cca90dab8a817054620cd3eef8a1b016b87d3b8[/URL]
I added a new argument -yield
to work around the CUDA busy-wait.

In my testing on AMD it works nicely :), please let me know how it works on Nvidia.

What to watch for:
- time-per-iteration difference when using -yield (i.e. yield could be slower, how much?)
- CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?)
and other possible bugs.[/QUOTE]
On GTX1080Ti, 226M P-1,

without -yield, 99 seconds between updates;

with -yield, gpu idle, cpu as busy as before, no progress shown in 25 minutes, does not respond to Ctrl-C in a further 10 minutes. Terminate process and restart shows no iterations advance.[CODE]2019-09-21 22:23:37 226000127 P1 30000 1.15%; 10149 us/sq; ETA 0d 07:17; 61772c9af6a02736
2019-09-21 22:23:37 37.42% tailFused : 3706 us/call x 10000 calls
2019-09-21 22:23:37 17.66% carryFused : 3485 us/call x 5021 calls
2019-09-21 22:23:37 15.98% carryFusedMul : 3180 us/call x 4978 calls
2019-09-21 22:23:37 7.44% fftMiddleIn : 737 us/call x 10000 calls
2019-09-21 22:23:37 7.40% fftMiddleOut : 733 us/call x 10000 calls
2019-09-21 22:23:37 7.11% transposeW : 704 us/call x 10000 calls
2019-09-21 22:23:37 6.98% transposeH : 692 us/call x 10000 calls
2019-09-21 22:23:37 Total time 99.049 s
2019-09-21 22:25:20 226000127 P1 40000 1.53%; 10257 us/sq; ETA 0d 07:20; 0bb8613655726c69
2019-09-21 22:25:20 37.45% tailFused : 3726 us/call x 10000 calls
2019-09-21 22:25:20 17.57% carryFused : 3504 us/call x 4989 calls
2019-09-21 22:25:20 16.09% carryFusedMul : 3197 us/call x 5009 calls
2019-09-21 22:25:20 7.45% fftMiddleIn : 741 us/call x 10000 calls
2019-09-21 22:25:20 7.40% fftMiddleOut : 737 us/call x 10000 calls
2019-09-21 22:25:20 7.08% transposeW : 704 us/call x 10000 calls
2019-09-21 22:25:20 6.95% transposeH : 692 us/call x 10000 calls
2019-09-21 22:25:20 Total time 99.499 s
2019-09-21 22:25:28 Stopping, please wait..
2019-09-21 22:25:29 Exiting because "stop requested"
2019-09-21 22:25:29 Bye
2019-09-21 22:25:52 Note: no config.txt file found
2019-09-21 22:25:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield
2019-09-21 22:25:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word
2019-09-21 22:25:52 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-21 22:25:53

2019-09-21 22:25:53 OpenCL compilation in 31 ms
2019-09-21 22:25:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801
2019-09-21 23:02:52 Note: no config.txt file found
2019-09-21 23:02:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield
2019-09-21 23:02:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word
2019-09-21 23:02:53 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-21 23:02:53

2019-09-21 23:02:53 OpenCL compilation in 15 ms
2019-09-21 23:02:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801[/CODE]

preda 2019-09-22 12:14

[QUOTE=kriesel;526256]On GTX1080Ti, 226M P-1,
with -yield, gpu idle, cpu as busy as before, no progress shown in 25 minutes, does not respond to Ctrl-C in a further 10 minutes. Terminate process and restart shows no iterations advance.[/QUOTE]

Yes that seems pretty broken. I'm not sure why yet; I did push a new commit -- could you try it and tell me how it works? (pls check both with/without -time)

There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken.

preda 2019-09-22 12:18

[QUOTE=xx005fs;526254][...] The CPU usage dropped from a core fully maxed out to around 88%[/QUOTE]

I increased the sleep time on yield to attempt to reduce CPU usage more. Could you try again please? (with the newest revision)


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.