![]() |
|
|
#1387 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,437 Posts |
Quote:
https://www.mersenneforum.org/showpo...postcount=1334 https://www.mersenneforum.org/showpo...postcount=1340 The last one contains links to possible mitigation approaches. |
|
|
|
|
|
|
#1388 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
|
|
|
|
|
|
|
#1389 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
In recent commits I revamped the kernel profiling (enabled with "-time"). The new profiling (which uses OpenCL events to measure the time spent in each kernel execution) should be more precise and with much less overhead when enabled, thus closer to real-life.
I also enabled -time for P-1. |
|
|
|
|
|
#1390 | |
|
"Eric"
Jan 2018
USA
22×53 Posts |
Quote:
I found this, a possible workaround proposed by the Hashcat people, don't know if it's possible to implement similar things to GPUOWL and lose no performance. Code:
Support to utilize multiple different OpenCL device types in parallel When I've redesigned the core that handles the workload distribution to multiple different GPUs in the same system, which oclHashcat v2.01 already supported. I thought it would be nice to not just support for GPUs of different kinds and speed but also support different device types. What I'm talking about is running a GPU and CPU (and even FPGA) all in parallel and within the same hashcat session. Beware! This is not always a clever thing to do. For example with the OpenCL runtime of NVidia, they still have a 5-year-old-known-bug which creates 100% CPU load on a single core per NVidia GPU (NVidia's OpenCL busy-wait). If you're using oclHashcat for quite a while you may remember the same bug happened to AMD years ago. Basically, what NVidia is missing here is that they use spinning instead of yielding. Their goal was to increase the performance but in our case there's actually no gain from having a CPU burning loop. The hashcat kernels run for ~100ms and that's quite a long time for an OpenCL kernel. At such a scale, spinning creates only disadvantages and there's no way to turn it off (Only CUDA supports that). But why is this a problem? If the OpenCL runtime spins on a core to find out if a GPU kernel is finished it creates 100% CPU load. Now imagine you have another OpenCL device, e.g. your CPU, creating also 100% CPU load, it will cause problems even if it's legitimate to do that here. The GPU's CPU-burning thread will slow down by 50%, and you end up with a slower GPU rate just by enabling your CPU too (--opencl-device-type 1). For AMD GPU that's not the case (they fixed that bug years ago.) To help mitigate this issue, I've implemented the following behavior: Hashcat will try to workaround the problem by sleeping for some precalculated time after the kernel was queued and flushed. This will decrease the CPU load down to less than 10% with almost no impact on cracking performance. By default, if hashcat detects both CPU and GPU OpenCL devices in your system, the CPU will be disabled. If you really want to run them both in parallel, you can still set the option --opencl-device-types to 1,2 to utilize both device types, CPU and GPU. Here's some related information: Execute kernels without 100% CPU busy-wait Increased CPU usage with last drivers starting from 270.xx |
|
|
|
|
|
|
#1391 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,437 Posts |
Quote:
Last fiddled with by kriesel on 2019-09-21 at 17:18 |
|
|
|
|
|
|
#1392 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
In https://github.com/preda/gpuowl/comm...8a1b016b87d3b8
I added a new argument -yield to work around the CUDA busy-wait. In my testing on AMD it works nicely :), please let me know how it works on Nvidia. What to watch for: - time-per-iteration difference when using -yield (i.e. yield could be slower, how much?) - CPU time taken by gpuowl when using -yield (should be less (than 100%), how much?) and other possible bugs. Quote:
|
|
|
|
|
|
|
#1393 | |
|
"Eric"
Jan 2018
USA
21210 Posts |
Quote:
|
|
|
|
|
|
|
#1394 |
|
"Eric"
Jan 2018
USA
22×53 Posts |
Here's the run without the -yield argument, and the CPU usage is as expected to be 100% on 1 single core on an Nvidia Titan V with 1040MHz HBM and 1355MHz core. This is expected performance for this GPU as I am getting the same amount of throughput as I get on Windows with the same clock speed.
Code:
2019-09-21 21:02:05 gpuowl v6.11-5-g5cca90d 2019-09-21 21:02:05 Note: no config.txt file found 2019-09-21 21:02:05 config: -use ORIG_X2 2019-09-21 21:02:05 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word 2019-09-21 21:02:05 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 21:02:05 2019-09-21 21:02:05 OpenCL compilation in 3 ms 2019-09-21 21:02:08 90015581 OK 409500 0.45%; 825 us/sq; ETA 0d 20:32; 8192f49fec60e30e (check 0.50s) 2019-09-21 21:02:41 90015581 450000 0.50%; 825 us/sq; ETA 0d 20:32; b4da35d30644db86 2019-09-21 21:03:23 90015581 OK 500000 0.56%; 825 us/sq; ETA 0d 20:30; 2f704aae47125430 (check 0.50s) 2019-09-21 21:04:04 90015581 550000 0.61%; 825 us/sq; ETA 0d 20:30; be1e1cfa749a826b 2019-09-21 21:04:27 Stopping, please wait.. 2019-09-21 21:04:27 90015581 OK 577000 0.64%; 825 us/sq; ETA 0d 20:31; c64f67f7c2a1ca00 (check 0.50s) 2019-09-21 21:04:27 Exiting because "stop requested" 2019-09-21 21:04:27 Bye Code:
2019-09-21 21:05:59 gpuowl v6.11-5-g5cca90d 2019-09-21 21:05:59 Note: no config.txt file found 2019-09-21 21:05:59 config: -use ORIG_X2 -yield 2019-09-21 21:05:59 90015581 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.17 bits/word 2019-09-21 21:05:59 OpenCL args "-DEXP=90015581u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x1.c75e516d40cbdp+0 -DIWEIGHT_STEP=0x1.1fd656809b73bp-1 -DWEIGHT_BIGSTEP=0x1.ae89f995ad3adp+0 -DIWEIGHT_BIGSTEP=0x1.306fe0a31b715p-1 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 21:05:59 2019-09-21 21:05:59 OpenCL compilation in 3 ms 2019-09-21 21:06:02 90015581 OK 578000 0.64%; 824 us/sq; ETA 0d 20:28; 7ad026792b5e2e37 (check 0.50s) 2019-09-21 21:06:20 90015581 600000 0.67%; 824 us/sq; ETA 0d 20:27; 496d287691eb3176 2019-09-21 21:07:01 90015581 650000 0.72%; 823 us/sq; ETA 0d 20:25; e04e150f2e8bee83 2019-09-21 21:07:42 90015581 700000 0.78%; 823 us/sq; ETA 0d 20:25; 818852407c468067 2019-09-21 21:07:55 Stopping, please wait.. 2019-09-21 21:07:56 90015581 OK 716000 0.80%; 824 us/sq; ETA 0d 20:26; e38f4cd309745649 (check 0.50s) 2019-09-21 21:07:56 Exiting because "stop requested" 2019-09-21 21:07:56 Bye Last fiddled with by xx005fs on 2019-09-22 at 04:10 |
|
|
|
|
|
#1395 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,437 Posts |
Quote:
without -yield, 99 seconds between updates; with -yield, gpu idle, cpu as busy as before, no progress shown in 25 minutes, does not respond to Ctrl-C in a further 10 minutes. Terminate process and restart shows no iterations advance. Code:
2019-09-21 22:23:37 226000127 P1 30000 1.15%; 10149 us/sq; ETA 0d 07:17; 61772c9af6a02736 2019-09-21 22:23:37 37.42% tailFused : 3706 us/call x 10000 calls 2019-09-21 22:23:37 17.66% carryFused : 3485 us/call x 5021 calls 2019-09-21 22:23:37 15.98% carryFusedMul : 3180 us/call x 4978 calls 2019-09-21 22:23:37 7.44% fftMiddleIn : 737 us/call x 10000 calls 2019-09-21 22:23:37 7.40% fftMiddleOut : 733 us/call x 10000 calls 2019-09-21 22:23:37 7.11% transposeW : 704 us/call x 10000 calls 2019-09-21 22:23:37 6.98% transposeH : 692 us/call x 10000 calls 2019-09-21 22:23:37 Total time 99.049 s 2019-09-21 22:25:20 226000127 P1 40000 1.53%; 10257 us/sq; ETA 0d 07:20; 0bb8613655726c69 2019-09-21 22:25:20 37.45% tailFused : 3726 us/call x 10000 calls 2019-09-21 22:25:20 17.57% carryFused : 3504 us/call x 4989 calls 2019-09-21 22:25:20 16.09% carryFusedMul : 3197 us/call x 5009 calls 2019-09-21 22:25:20 7.45% fftMiddleIn : 741 us/call x 10000 calls 2019-09-21 22:25:20 7.40% fftMiddleOut : 737 us/call x 10000 calls 2019-09-21 22:25:20 7.08% transposeW : 704 us/call x 10000 calls 2019-09-21 22:25:20 6.95% transposeH : 692 us/call x 10000 calls 2019-09-21 22:25:20 Total time 99.499 s 2019-09-21 22:25:28 Stopping, please wait.. 2019-09-21 22:25:29 Exiting because "stop requested" 2019-09-21 22:25:29 Bye 2019-09-21 22:25:52 Note: no config.txt file found 2019-09-21 22:25:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield 2019-09-21 22:25:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word 2019-09-21 22:25:52 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 22:25:53 2019-09-21 22:25:53 OpenCL compilation in 31 ms 2019-09-21 22:25:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801 2019-09-21 23:02:52 Note: no config.txt file found 2019-09-21 23:02:52 config: -device 0 -use ORIG_X2 -user kriesel -cpu dodo/gtx1080ti -maxAlloc 10240 -time -yield 2019-09-21 23:02:52 226000127 FFT 14336K: Width 256x4, Height 256x4, Middle 7; 15.40 bits/word 2019-09-21 23:02:53 OpenCL args "-DEXP=226000127u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0xc.2ae2830a9093p-3 -DIWEIGHT_STEP=0xa.85125811a707p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-09-21 23:02:53 2019-09-21 23:02:53 OpenCL compilation in 15 ms 2019-09-21 23:02:57 226000127 P1 B1=1810000, B2=41630000; 2611059 bits; starting at 40801 Last fiddled with by kriesel on 2019-09-22 at 04:18 |
|
|
|
|
|
|
#1396 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
There's no need to wait 10minutes -- if it doesn't do the usual progress, or does not react to Ctrl-C, it's broken. |
|
|
|
|
|
|
#1397 |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
| Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
| Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
| Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |