![]() |
ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C
I am still a bit scared to upgrade to ROCm 3.1... Well, I tried to upgrade and now I get: [code] ./gpuowl -device 1 2020-03-16 17:10:53 gpuowl v6.11-197-g3886a11 2020-03-16 17:10:53 Note: not found 'config.txt' 2020-03-16 17:10:53 config: -device 1 2020-03-16 17:10:53 device 1, unique id 'f582388172fd5d41' 2020-03-16 17:10:53 f582388172fd5d41 worktodo.txt line ignored: "" 2020-03-16 17:10:53 f582388172fd5d41 999xxxxx FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word Segmentation fault [/code] [code] uname -a Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux [/code] |
[QUOTE=paulunderwood;539853]ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C
I am still a bit scared to upgrade to ROCm 3.1... Well, I tried to upgrade and now I get: [code] ./gpuowl -device 1 2020-03-16 17:10:53 gpuowl v6.11-197-g3886a11 2020-03-16 17:10:53 Note: not found 'config.txt' 2020-03-16 17:10:53 config: -device 1 2020-03-16 17:10:53 device 1, unique id 'f582388172fd5d41' 2020-03-16 17:10:53 f582388172fd5d41 worktodo.txt line ignored: "" 2020-03-16 17:10:53 f582388172fd5d41 999xxxxx FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word Segmentation fault [/code] [code] uname -a Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux [/code][/QUOTE] I am willing to try a different distro for ROCm 3.1 -- any suggestions? |
[QUOTE=preda;539852]ROCm 3.1, sclk 3, mem 1180, FFT 5M: 708us/it. (150W)[/QUOTE]
[QUOTE=paulunderwood;539853]ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C[/QUOTE] Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10. |
[QUOTE=ewmayer;539872]Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10.[/QUOTE]
I wrestled back the ROCm 2.10.0 driver... With 2 gpuowl instances (with same settings (sclk 3 etc, but with 5 extra Watts and gpuowl v6.11-197-g3886a11-dirty)) I am getting ~1475 us/it each. Thanks for prompting me Ernst -- a great speed-up :tu: |
[QUOTE=paulunderwood;539862]I am willing to try a different distro for ROCm 3.1 -- any suggestions?[/QUOTE]
I'm using Ubuntu 19.10 with Linux kernel 5.4.24. I also tried kernels 5.5.x, 5.6.x and they work too. |
[QUOTE=preda;539879]I'm using Ubuntu 19.10 with Linux kernel 5.4.24. I also tried kernels 5.5.x, 5.6.x and they work too.[/QUOTE]
Anyone get rocm 3.1 to work on Ubuntu 19.04? I've tried 3 times without success. |
-O2 or not
[QUOTE=paulunderwood;539876]
With 2 gpuowl instances (with same settings (sclk 3 etc, but with 5 extra Watts and gpuowl v6.11-197-g3886a11-dirty)) I am getting ~1475 us/it each. Thanks for prompting me Ernst -- a great speed-up :tu:[/QUOTE] This was compiled without -O2 in the Makefile. With it, the iterations are 1487us and the power usage is 1 or 2 Watts lower. |
[QUOTE=Prime95;539882]Anyone get rocm 3.1 to work on Ubuntu 19.04? I've tried 3 times without success.[/QUOTE]
I would expect Ubuntu 19.04 to be pretty similar to 19.10 from ROCm POV (more important would be the kernel version). What step is falling? I mentioned here [url]https://github.com/RadeonOpenCompute/ROCm/issues/977[/url] that I had to install libncurses5 too. |
Using not yet committed code:
Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it. Running 2 instances: 604us/it (200W measured by rocm-smi) I love this GPU. |
[QUOTE=Prime95;539912]Using not yet committed code:
Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it. Running 2 instances: 604us/it (200W measured by rocm-smi) I love this GPU.[/QUOTE] How many GHzD/d does that translate to? 400-ish? :shock: EDIT:- Probably more like high 400, low 500! |
[QUOTE=axn;539927]How many GHzD/d does that translate to? 400-ish? :shock:
EDIT:- Probably more like high 400, low 500![/QUOTE] 510 PRP-GHzD/d |
[QUOTE=Prime95;539955]510 PRP-GHzD/d[/QUOTE]where did you buy that beauty, and whose brand is it? (Just completed the RMA/refund process on my second one.)
|
[QUOTE=kriesel;539958]where did you buy that beauty, and whose brand is it? (Just completed the RMA/refund process on my second one.)[/QUOTE]
All GPUs except one range from 602us to 615us. The one outlier is 630us running in I7-860 (not a Sandy Bridge as I originally reported) which is not PCIE 3.0. |
[QUOTE=Prime95;539955]510 PRP-GHzD/d[/QUOTE]
20 GPUs = 1 Curtis Cooper |
[QUOTE=Prime95;539960]20 GPUs = 1 Curtis Cooper[/QUOTE]
Is that the next unit like a P90 year? |
[QUOTE=Prime95;539912]Using not yet committed code:
Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it. Running 2 instances: 604us/it (200W measured by rocm-smi) I love this GPU.[/QUOTE] Nice - how much % gain is that over current commit, and did you also do timings @5632K? And, have you tried running > 2 instances to see if there is any further marginal throughput gain to be had that way? [b]Edit:[/b] Just tried the latter experiment - but not using George's uncommitted code, obviously - on my own machine, here the timing/throughput figure for 1-3 workers, all @5632K FFT, sclk = 5: 1: 754 us/iter => 1362 iter/sec 2: 1405 us/iter => 1423 iter/sec 3: 2174 us/iter => 1380 iter/sec So, deterioration above 2 workers. |
[QUOTE=ewmayer;539973]Nice - how much % gain is that over current commit, and did you also do timings @5632K?
And, have you tried running > 2 instances to see if there is any further marginal throughput gain to be had that way? [b]Edit:[/b] Just tried the latter experiment - but not using George's uncommitted code, obviously - on my own machine, here the timing/throughput figure for 1-3 workers, all @5632K FFT, sclk = 5: 1: 754 us/iter => 1362 iter/sec 2: 1405 us/iter => 1423 iter/sec 3: 2174 us/iter => 1380 iter/sec So, deterioration above 2 workers.[/QUOTE] 5632K is the FFT size I can relate to also. 5632K, sclk=5, 1000Mhz memclk, 185W as mesured by rocm, one worker, older version of gpuowl: 872 us/iter |
[QUOTE=PhilF;539983]5632K is the FFT size I can relate to also.
5632K, sclk=5, 1000Mhz memclk, 185W as mesured by rocm, one worker, older version of gpuowl: 872 us/iter[/QUOTE] Your mem-downclock is likely the reason you both run slower and at significantly lower power than I, at the same sclk and 1-worker setting. But why not fire up a second worker? |
gpuowl-win v6.11-198-g build and initial speed checks
2 Attachment(s)
The usual warning shower, see build log, but it runs.
Win7 64 Pro, dual E5645, prime95 maxed, 12GB ram RX550: V6.11-134[CODE]2020-03-17 13:25:39 condorella/rx550 93873049 OK 60600000 64.56%; 14442 us/it; ETA 5d 13:29; 7c9ef8f79b678f5e (check 5.82s) 2020-03-17 14:13:57 condorella/rx550 93873049 OK 60800000 64.77%; 14458 us/it; ETA 5d 12:50; cf3a3470216c1801 (check 5.86s) 2020-03-17 15:02:12 condorella/rx550 93873049 OK 61000000 64.98%; 14448 us/it; ETA 5d 11:56; 975f4c7dd6bd8513 (check 5.81s) 2020-03-17 15:50:28 condorella/rx550 93873049 OK 61200000 65.19%; 14452 us/it; ETA 5d 11:10; f2e39e36a1a45ad8 (check 5.83s) 2020-03-17 15:53:43 condorella/rx550 Stopping, please wait.. 2020-03-17 15:53:55 condorella/rx550 93873049 OK 61214000 65.21%; 14379 us/it; ETA 5d 10:27; cb89cd1c515d4a12 (check 5.82s) 2020-03-17 15:53:55 condorella/rx550 Exiting because "stop requested" 2020-03-17 15:53:55 condorella/rx550 Bye[/CODE][CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-198-g628f3cd\rx550>gpuowl-win 2020-03-17 15:54:11 gpuowl v6.11-198-g628f3cd 2020-03-17 15:54:11 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM,UNROLL_HEIGHT,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CARRY32,ORIGINAL_METHOD,LESS_ACCURATE 2020-03-17 15:54:11 device 1, unique id '' 2020-03-17 15:54:11 condorella/rx550 93873049 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.90 bits/word 2020-03-17 15:54:13 condorella/rx550 Warning: -use LESS_ACCURATE has no effect 2020-03-17 15:54:13 condorella/rx550 Warning: -use MERGED_MIDDLE has no effect 2020-03-17 15:54:13 condorella/rx550 Warning: -use ORIGINAL_METHOD has no effect 2020-03-17 15:54:13 condorella/rx550 Warning: -use T2_SHUFFLE_HEIGHT has no effect 2020-03-17 15:54:13 condorella/rx550 Warning: -use T2_SHUFFLE_MIDDLE has no effect 2020-03-17 15:54:13 condorella/rx550 Warning: -use WORKINGOUT2 has no effect 2020-03-17 15:54:13 condorella/rx550 OpenCL args "-DEXP=93873049u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.8b9afd7da35e8p-3 -DIWEIGHT_STEP=0xe.fa9b7f6844848p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DLESS_ACCURATE=1 -DMERGED_M IDDLE=1 -DNO_ASM=1 -DORIGINAL_METHOD=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DUNROLL_HEIGHT=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-17 15:54:17 condorella/rx550 OpenCL compilation in 4.21 s 2020-03-17 15:54:24 condorella/rx550 93873049 OK 61200000 loaded: blockSize 400, f2e39e36a1a45ad8 2020-03-17 15:54:42 condorella/rx550 93873049 OK 61200800 65.20%; 15423 us/it; ETA 5d 19:58; 978b866258bcc6ff (check 6.02s) 2020-03-17 16:44:59 condorella/rx550 93873049 OK 61400000 65.41%; 15117 us/it; ETA 5d 16:22; bba7f7db066343fb (check 6.13s)[/CODE]Same exponent, v6.11-198 is slower than v6.11-134 on RX550: 1-14453/15117 = 4.4% slower. RX480: V6.11-134[CODE]2020-03-17 14:49:29 condorella/rx480 94073297 OK 36600000 38.91%; 3580 us/it; ETA 2d 09:10; 8fe30d16593f9dd3 (check 1.51s) 2020-03-17 15:01:47 condorella/rx480 94073297 OK 36800000 39.12%; 3685 us/it; ETA 2d 10:37; d3eed3f44f9e2a97 (check 1.47s) 2020-03-17 15:14:07 condorella/rx480 94073297 OK 37000000 39.33%; 3691 us/it; ETA 2d 10:31; 416999f765247350 (check 1.50s) 2020-03-17 15:26:15 condorella/rx480 94073297 OK 37200000 39.54%; 3631 us/it; ETA 2d 09:22; c8d58ee219203f21 (check 1.53s) 2020-03-17 15:38:16 condorella/rx480 94073297 OK 37400000 39.76%; 3600 us/it; ETA 2d 08:41; 94e4dcc24e50b2bb (check 1.50s) 2020-03-17 15:50:12 condorella/rx480 94073297 OK 37600000 39.97%; 3574 us/it; ETA 2d 08:04; 062e7035793975f0 (check 1.56s) 2020-03-17 16:02:07 condorella/rx480 94073297 OK 37800000 40.18%; 3570 us/it; ETA 2d 07:48; 30335ecba13bf8e9 (check 1.47s) 2020-03-17 16:14:16 condorella/rx480 94073297 OK 38000000 40.39%; 3637 us/it; ETA 2d 08:39; 650cb74a2a044221 (check 1.48s) 2020-03-17 16:15:34 condorella/rx480 Stopping, please wait.. 2020-03-17 16:15:37 condorella/rx480 94073297 OK 38022400 40.42%; 3536 us/it; ETA 2d 07:03; 5f6a5d73eb842b21 (check 1.47s) 2020-03-17 16:15:37 condorella/rx480 Exiting because "stop requested" 2020-03-17 16:15:37 condorella/rx480 Bye[/CODE]V6.11-198:[CODE]2020-03-17 16:17:37 gpuowl v6.11-198-g628f3cd 2020-03-17 16:17:37 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1, WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG 2020-03-17 16:17:37 config: 2020-03-17 16:17:37 config: :4.5m fft NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32 ,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE 2020-03-17 16:17:37 config: :5m fft NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUA RES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG 2020-03-17 16:17:37 device 0, unique id '' 2020-03-17 16:17:37 condorella/rx480 94073297 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.94 bits/word 2020-03-17 16:17:40 condorella/rx480 Warning: -use CHEBYSHEV_MIDDLEMUL2 has no effect 2020-03-17 16:17:40 condorella/rx480 Warning: -use MERGED_MIDDLE has no effect 2020-03-17 16:17:40 condorella/rx480 Warning: -use MORE_SQUARES_MIDDLEMUL1 has no effect 2020-03-17 16:17:40 condorella/rx480 Warning: -use T2_SHUFFLE_HEIGHT has no effect 2020-03-17 16:17:40 condorella/rx480 Warning: -use T2_SHUFFLE_WIDTH has no effect 2020-03-17 16:17:40 condorella/rx480 Warning: -use WORKINGOUT1 has no effect 2020-03-17 16:17:40 condorella/rx480 OpenCL args "-DEXP=94073297u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.52733b011536p-3 -DIWEIGHT_STE P=0xf.617b45b852608p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DME RGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1 =1 -DWORKINGOUT1=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-17 16:17:44 condorella/rx480 OpenCL compilation in 4.10 s 2020-03-17 16:17:46 condorella/rx480 94073297 OK 37826000 loaded: blockSize 400, febfc14d0f24e5fa 2020-03-17 16:17:50 condorella/rx480 94073297 OK 37826800 40.21%; 3461 us/it; ETA 2d 06:04; b5a114da61bfa21e (check 1.55s) 2020-03-17 16:28:31 condorella/rx480 94073297 OK 38000000 40.39%; 3692 us/it; ETA 2d 09:31; 650cb74a2a044221 (check 1.51s) 2020-03-17 16:40:53 condorella/rx480 94073297 OK 38200000 40.61%; 3703 us/it; ETA 2d 09:28; 39ce25c654678cec (check 1.51s)[/CODE]1-3617/3697 = .0216 = 2.16% slower v6.11-198 than v6.11-134 on RX480 Downclocked radeon VII: v6.11-134[CODE]020-03-17 17:43:24 roa/radeonvii 655685803 OK 626640000 95.57%; 10363 us/it; ETA 3d 11:37; 6fcd5d380e08a691 (check 5.99s) 20 errors 2020-03-17 17:46:57 roa/radeonvii 655685803 OK 626660000 95.57%; 10378 us/it; ETA 3d 11:40; 08e9287564d158a0 (check 5.94s) 20 errors 2020-03-17 17:50:30 roa/radeonvii 655685803 OK 626680000 95.58%; 10363 us/it; ETA 3d 11:30; cc909f1a064cff84 (check 5.88s) 20 errors 2020-03-17 17:51:49 roa/radeonvii Stopping, please wait.. 2020-03-17 17:51:59 roa/radeonvii 655685803 OK 626688000 95.58%; 10369 us/it; ETA 3d 11:31; ceaa28d44748f0e7 (check 5.89s) 20 errors 2020-03-17 17:51:59 roa/radeonvii Exiting because "stop requested" 2020-03-17 17:51:59 roa/radeonvii Bye[/CODE]V6.11-198-g628f3cd[CODE]C:\Users\ken\Documents\gpuowl-v6.11-198-g628f3cd>gpuowl-win 2020-03-17 17:54:57 gpuowl v6.11-198-g628f3cd 2020-03-17 17:54:57 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use -device 1 -user kriesel -cpu roa/radeonvii -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE 2020-03-17 17:54:57 config: ;NO_ASM,ORIG_SLOWTRIG 2020-03-17 17:54:57 device 1, unique id '' 2020-03-17 17:54:57 roa/radeonvii 655685803 FFT 40960K: Width 256x4, Height 256x8, Middle 10; 15.63 bits/word 2020-03-17 17:55:07 roa/radeonvii Warning: -use CHEBYSHEV_METHOD has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use LESS_ACCURATE has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use MERGED_MIDDLE has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use ORIG_MIDDLEMUL2 has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use T2_SHUFFLE_HEIGHT has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use T2_SHUFFLE_REVERSELINE has no effect 2020-03-17 17:55:07 roa/radeonvii Warning: -use UNROLL_MIDDLEMUL2 has no effect 2020-03-17 17:55:07 roa/radeonvii OpenCL args "-DEXP=655685803u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -DWEIGHT_STEP=0xa.51aa7280d93dp-3 -DIWEIGHT_STEP=0xc.677fd3dfd408p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-17 17:55:22 roa/radeonvii OpenCL compilation in 15.11 s 2020-03-17 17:55:31 roa/radeonvii 655685803 OK 626680000 loaded: blockSize 400, cc909f1a064cff84 2020-03-17 17:55:45 roa/radeonvii 655685803 OK 626680800 95.58%; 10753 us/it; ETA 3d 14:38; 47b82db06405ee2a (check 6.07s) 20 errors 2020-03-17 17:59:18 roa/radeonvii 655685803 OK 626700000 95.58%; 10769 us/it; ETA 3d 14:42; 810fe9130840b14d (check 6.08s) 20 errors 2020-03-17 18:02:59 roa/radeonvii 655685803 OK 626720000 95.58%; 10757 us/it; ETA 3d 14:33; 3294d8e8770b2911 (check 6.22s) 20 errors [/CODE]1- 10370/10763 = 3.65% slower v6.11-198 than v6.11-134 on the same Radeon VII and exponent, same clock settings. |
[QUOTE=ewmayer;539987]Your mem-downclock is likely the reason you both run slower and at significantly lower power than I, at the same sclk and 1-worker setting. But why not fire up a second worker?[/QUOTE]
I will, once I can make time for it. :smile: |
[QUOTE=PhilF;539992]I will, once I can make time for it. :smile:[/QUOTE]
I used the low-tech method: 1. open 2nd term; 2. create 2nd rundir under gpuowl-dir, cd into same; 3. populate worktodo by running primenet.py (which I see is based on the same-name py-script of another GIMPS coder who shall remain nameless, but who restored the same "run just once and quit rather than recurring by adding '-t 0'" option which exists in his original script to the gpuowl one :); 4. ../gpuowl I use same system sclk and fan settings for 2-worker-running as for 1. |
[QUOTE=ewmayer;539996]I used the low-tech method:
1. open 2nd term; 2. create 2nd rundir under gpuowl-dir, cd into same; 3. populate worktodo by running primenet.py (which I see is based on the same-name py-script of another GIMPS coder who shall remain nameless, but who restored the same "run just once and quit rather than recurring by adding '-t 0'" option which exists in his original script to the gpuowl one :); 4. ../gpuowl I use same system sclk and fan settings for 2-worker-running as for 1.[/QUOTE] Do you know about -pool option? You create a single folder where you put the output of primenet.py , and run multiple instances, each in its own folder (indicated with -dir) all feeding from the common pool (indicated with -pool). You can also put config files in the pool dir, for shared config. Next logical step is to put the -pool option in the config file of each individual instance, and now you can start it with only -dir. I'm afraid all that is not very clear, so let's see an example config: ~/gpuowl-xfx/config.txt contains: -cpu XFX -uid 780c28cffffffeee -pool /home/user/pool (note that the gpu is indicated not by -d <position> but by UID, which is very useful when shuffling GPUs around. A symbolic name "XFX" is associated to it). ~/pool/config.txt contains: -user name (because the user is shared by all instances) primenet.py only knows about the pool dir: ~/gpuowl/tools/primenet.py -u <user> -p <password> --dirs ~/pool --tasks 4 -w PRP & |
[QUOTE=kriesel;539991]1- 10370/10763 = 3.65% slower v6.11-198 than v6.11-134 on the same Radeon VII and exponent, same clock settings.[/QUOTE]
There are warnings there telling you that you are passing -use options that don't exist (because they have been removed), some have been replaced by more general e.g. there is still a T2_SHUFFLE that you may want to try. But a bigger benefit would be to move to ROCm (either 2.10 or 3.1). |
[QUOTE=preda;540029]There are warnings there telling you that you are passing -use options that don't exist (because they have been removed), some have been replaced by more general e.g. there is still a T2_SHUFFLE that you may want to try. But a bigger benefit would be to move to ROCm (either 2.10 or 3.1).[/QUOTE]Moving to rocm isn't an option on Windows. [url]https://rocm.github.io/[/url]
Documentation of what -use options are available would be helpful. |
[QUOTE=preda;540027]Do you know about -pool option? You create a single folder where you put the output of primenet.py , and run multiple instances, each in its own folder (indicated with -dir) all feeding from the common pool (indicated with -pool). You can also put config files in the pool dir, for shared config. Next logical step is to put the -pool option in the config file of each individual instance, and now you can start it with only -dir.
[snip][/QUOTE] Thanks, Mihai, but after a few more lines your recipe was already more complex than my neoluddite recipe. :) Perhaps I'll find it useful when I build my multi-GPU dream system later this year. To paraphrase the late great SNL comedian Phil Hartman by way of his recurring [i]Unfrozen Caveman Lawyer[/i] sketches: "I'm just a *caveman* - the ways of you modern human are strange and unfathomable to me. (But what I do know is that my client deserves at least a $5 million triple-damages settement for injuries and psychological trauma resulting from the defendant's spilling ketchup on him.)" |
[QUOTE=ewmayer;539872]Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10.[/QUOTE]
See posts in [url]https://mersenneforum.org/showthread.php?t=22204&page=149[/url] I am running at sclk 3 and have just bumped up my ASUS's memory to 1200 and voltage at 830. 187W and no errors yet... I will lower the voltage if I can over time. 2 instances @ 1467 us/it each for FFT 5632K |
[QUOTE=paulunderwood;540097]See posts in [url]https://mersenneforum.org/showthread.php?t=22204&page=149[/url]
I am running at sclk 3 and have just bumped up my ASUS's memory to 1200 and voltage at 830. 187W and no errors yet... I will lower the voltage if I can over time. 2 instances @ 1467 us/it each for FFT 5632K[/QUOTE] Thanks - I long ago did the featuremask fiddle. My issue is not a missing pp_od_clk_voltage entry, is that Ubuntu is not allowing me to modify it. Not a huge deal, just eating the extra Watts and running with mclk at stock. Could become an issue when I build that multi-GPU dream system later this year, though - have an 850W[sup]*[/sup] PS laid in, intended to drive something along the lines of an 8-core AMD CPU plus 3 Radeon VIIs. That will likely need sclk = 3 undervolting of the latter to get the wattage within what the PS can stably handle, would be nice to be able to tune mclk to help maximize throughput of the setup by way of another "tuning dial". ------------- [sup]*[/sup]That seemed to be the sweet spot in terms of $/watt at $120, all the >= 1KW PSs I looked at cost well over $200. Plus I don't want to be running a system needing more than a kW, our household circuit breakers start tripping at that level when anything else that is power-hungry (e.g. toaster, hair dryer) running off the same part of the "household grid" gets turned on. |
[QUOTE=kriesel;540034]Moving to rocm isn't an option on Windows. [url]https://rocm.github.io/[/url]
Documentation of what -use options are available would be helpful.[/QUOTE] There is some brief documentation at the top of gpuowl.cl, that we try to maintain in sync with the code as it changes: [QUOTE] DEBUG : enable asserts. Slow, but allows to verify that all asserts hold. NO_ASM : request to not use any inline __asm() NO_OMOD: do not use GCN output modifiers in __asm() OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing> IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing> ORIG_X2 <nVidia default> INLINE_X2 <AMD default> UNROLL_ALL <nVidia default> UNROLL_NONE UNROLL_WIDTH UNROLL_HEIGHT <AMD default> T2_SHUFFLE <nVidia default> NO_T2_SHUFFLE <AMD default> OLD_FFT8 <default> NEWEST_FFT8 NEW_FFT8 OLD_FFT5 NEW_FFT5 <default> NEWEST_FFT5 NEW_FFT10 <default> OLD_FFT10 CARRY32 <AMD default> // This is potentially dangerous option for large FFTs. Carry may not fit in 31 bits. CARRY64 <nVidia default> ORIG_SLOWTRIG // Use the compliler's implementation of sin/cos functions NEW_SLOWTRIG <default> // Our own sin/cos implementation ---- P-1 below ---- NO_P2_FUSED_TAIL // Do not use the big kernel tailFusedMulDelta [/QUOTE] |
[QUOTE=Prime95;539912]Using not yet committed code:
Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it. Running 2 instances: 604us/it (200W measured by rocm-smi) I love this GPU.[/QUOTE] Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us. |
[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]
Single instance or two instance combined? |
[QUOTE=axn;540195]Single instance or two instance combined?[/QUOTE]
The two instance combined. |
[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]
So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs? |
[QUOTE=ewmayer;540279]So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs?[/QUOTE]
Yes. "git clone" only the first time, afterwards you can "git pull" in the existing dir. "scons" can be used as an alternative to "make" (I build myself with scons), but either should work. |
[QUOTE=preda;540285]Yes.
"git clone" only the first time, afterwards you can "git pull" in the existing dir. "scons" can be used as an alternative to "make" (I build myself with scons), but either should work.[/QUOTE] Thanks - timing for my pair of side-by-side jobs at 5632K FFT and sclk=4 dropped from 1475 us/iter (for each job) to 1387 us/iter, 6.3% faster. So now I'm getting slightly better throughput at sclk=4 than I was before at sclk=5. Comparing apples-to-apples at sclk=5, before was 2 jobs each @1405 us/iter, with the new build down to 1331, 5.6% faster. But sclk=4 saves 60 watts ... hmm, tough choice. I'll probably run at sclk=4 on warm days, sclk=5 otherwise and at night. Nice work, guys! I hope to begin contributing more substantively later this year, rather than just running code and cheerleading. |
__attribute__(overloadable) support
I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.
The attribute is described here: [url]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/url] I would like confirmation that it works on these platforms: - windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?) - Nvidia - amdgpuPro (the other driver for Linux vs. ROCm) To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.: in gpuowl.cl Replace T2 mul(T2 a, T2 b) ... with T2 __attribute__((overloadable)) mul(T2 a, T2 b) ... And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup. Thanks! Note: the title should read "__attribute__((overloadable))", double parens. |
:tu:
With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each |
[QUOTE=paulunderwood;540308]:tu:
With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each[/QUOTE] You're gonna need that extra speed - I'm a mere 60,000 GHz-days behind you in the Top500, roughly equivalent to 150 PRP-tests @5632K. :) |
Possible bug- -cleanup works for PRP tests but not P-1 for me.
|
Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1 ... the latter just crashed:
[code]2020-03-23 15:03:11 gfx906+sram-ecc-0 102958243 P2 2394/2880: 174743 primes; setup 4.24 s, 2.271 ms/prime Memory access fault by GPU node-1 (Agent handle: 0x562cd3ec2150) on address 0xb5f9ee28000. Reason: Unknown. Aborted (core dumped)[/code] Both jobs now appear to be halted as result ... wait, the PRP is still *running* but somehow got tripped into super-low priority - I saw the same kind of MCLK-suddenly-gets-cut-by-2/3 yesterday, result of some kind of GPU glitch, that needed reboot to resolve at the time: [code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU% 1 37.0c 24.0W 809Mhz 351Mhz 21.96% manual 250.0W 2% 0%[/code] Just restarted the p-1 run, it's re-doing the entire stage 2 ... MCLK now back to normal (1001 MHz). Will update re. what happens with the p-1 stage 2 retry once it finishes. This is gpuowl v6.11-211-gca63aa9-dirty. [b]Edit:[/b] Misspoke - p-1 stage 2 picked up at ~90% of the way through ("P2 2736/2880"), and retry completed successfully. So appear to have been a one-off glitch in the matrix. Also oddly, on restart of the p-1 job MCLK got reset back to normal, but SCLK, which I had manually downclocked to 4, somehow got reset not to its default level (IIRC, 7) but to 5, as reflected by wall wattage, fan noise and GPU temperature. Reset to 4, all appears back to normal. |
[QUOTE=ewmayer;540693]Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1.[/QUOTE]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself. So can we count you as another fan of P-1 save files? |
[QUOTE=kriesel;540709]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.[/QUOTE] I was just running 2 separate PRP-assignment jobs - for the PRPs there is a marked throughput boost from 2-job-running (cf. my timings in post #1956) - one of which just happened to start on a PRP-assignment for which p-1 had not yet been done. Not using -maxAlloc. [QUOTE]So can we count you as another fan of P-1 save files?[/QUOTE] I'm a fan of doing whatever works for increasing users' overall throughput! :) That of course includes minimizing wasted time resulting from run-crashes/BSODs/system-resets/etc. |
GpuOwl P-1 error detection and handling
Gpuowl stage 1 needs a res64 error check. This was in v6.11-134.[CODE]2020-03-25 00:57:17 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-25 00:57:25 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-25 00:57:31 roa/radeonvii OpenCL compilation in 5.88 s 2020-03-25 00:57:34 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 0 2020-03-25 00:59:08 roa/radeonvii 550000007 P1 10000 0.14%; 9433 us/it; ETA 0d 19:09; c8b2127abc38054b 2020-03-25 01:00:43 roa/radeonvii 550000007 P1 20000 0.27%; 9433 us/it; ETA 0d 19:07; 6f401486d14cb20f 2020-03-25 01:02:17 roa/radeonvii 550000007 P1 30000 0.41%; 9431 us/it; ETA 0d 19:05; 18a926611c75b118 2020-03-25 01:02:37 roa/radeonvii saved ... 2020-03-25 15:08:30 roa/radeonvii saved 2020-03-25 15:09:08 roa/radeonvii 550000007 P1 5390000 73.68%; 9582 us/it; ETA 0d 05:07; 4cfe624d31a00e27 2020-03-25 15:10:42 roa/radeonvii 550000007 P1 5400000 73.82%; 9428 us/it; ETA 0d 05:01; [COLOR=Red][B]0000000000000000[/B][/COLOR] 2020-03-25 15:12:16 roa/radeonvii 550000007 P1 5410000 73.95%; 9424 us/it; ETA 0d 04:59; 0000000000000000 2020-03-25 15:13:32 roa/radeonvii saved[/CODE]Fourteen hours into the computation, an error occurred that zeroed the residue. The program does not detect the error. It continued powering the zero residue for the remaining iteration count, and periodically updating its two save files with bad interim results, for 5 more hours. It then appears to skip the stage 1 GCD under the error condition detected at the end of the set of iterations. Resume proceeds despite the bad input from the latter part of stage 1, also skipping the stage 1 GCD.[CODE]2020-03-25 20:10:24 roa/radeonvii saved 2020-03-25 20:11:58 roa/radeonvii 550000007 P1 7310000 99.93%; 9581 us/it; ETA 0d 00:01; 0000000000000000 2020-03-25 20:12:50 roa/radeonvii saved 2020-03-25 20:12:51 roa/radeonvii 550000007 P1 7315345 100.00%; 9913 us/it; ETA 0d 00:00; [COLOR=red][B]0000000000000000[/B][/COLOR] 2020-03-25 20:12:56 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%) 2020-03-25 20:12:56 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes 2020-03-25 20:12:57 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each 2020-03-25 20:31:18 roa/radeonvii 550000007 P2 38/2880: 92454 primes; setup 2.16 s, 11.881 ms/prime 2020-03-25 20:31:18 roa/radeonvii Exception St12domain_error: GCD invalid input 2020-03-25 20:31:18 roa/radeonvii waiting for background GCDs.. 2020-03-25 20:31:18 roa/radeonvii Bye C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>g611 C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>gpuowl-win 2020-03-26 09:35:49 gpuowl v6.11-134-g1e0ce1d 2020-03-26 09:35:49 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE 2020-03-26 09:35:49 config: 2020-03-26 09:35:49 config: ;NO_ASM,ORIG_SLOWTRIG 2020-03-26 09:35:49 config: ;40M NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE 2020-03-26 09:35:49 device 1, unique id '' 2020-03-26 09:35:49 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word 2020-03-26 09:35:58 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-26 09:36:05 roa/radeonvii OpenCL compilation in 6.68 s 2020-03-26 09:36:08 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 7315344 2020-03-26 09:36:09 roa/radeonvii 550000007 P2 B1=5070000, B2=152100000, starting at 38 2020-03-26 09:36:14 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%) 2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes 2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each 2020-03-26 09:54:35 roa/radeonvii 550000007 P2 76/2880: 92460 primes; setup 2.30 s, 11.868 ms/prime [/CODE]Since there is no periodic permanently retained save file from before the error occurred, and both the stage 1 save files are from after the unhandled error, the entire run is a loss (~33 hours wall clock). Stage 2 should not proceed from bad input from stage 1, but it does, without warning. Error checks, and a field for "passed last error check" in the save file could handle that. |
[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]
Hi Ken, I agree that was a loss. I'll look into improving this. |
Gpuowl-win v6.11-219-ge70ec99 build
2 Attachment(s)
Built, produced a help output, no other testing yet.
|
1 Attachment(s)
[QUOTE=preda;540307]I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.
The attribute is described here: [URL]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/URL] I would like confirmation that it works on these platforms: - windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?) - Nvidia - amdgpuPro (the other driver for Linux vs. ROCm) To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.: in gpuowl.cl Replace T2 mul(T2 a, T2 b) ... with T2 __attribute__((overloadable)) mul(T2 a, T2 b) ... And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup. Thanks! Note: the title should read "__attribute__((overloadable))", double parens.[/QUOTE] AOK on AMD RX480 /Win7 x64: [CODE]// complex mul T2 __attribute__((overloadable)) mul(T2 a, T2 b) { return U2(mad1(a.x, b.x, -a.y * b.y), mad1(a.x, b.y, a.y * b.x)); } Driver version as indicated by GPU-Z: 25.20.14007.1000 (Adrenalin 18.10.21/Win 64) [/CODE][CODE]2020-03-26 17:16:48 gpuowl v6.11-219-ge70ec99-dirty 2020-03-26 17:16:48 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 2020-03-26 17:16:48 device 0, unique id '' 2020-03-26 17:16:48 condorella/rx480 97685813 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word 2020-03-26 17:16:51 condorella/rx480 OpenCL args "-DEXP=97685813u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598a6d26b0dap-3 -DIWEIGHT_STE P=0xf.546b91e1254f8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DAMDGPU=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-26 17:16:55 condorella/rx480 OpenCL compilation in 4.81 s 2020-03-26 17:16:56 condorella/rx480 97685813 P1 B1=1000000, B2=27000000; 1442134 bits; starting at 0 2020-03-26 17:17:34 condorella/rx480 97685813 P1 10000 0.69%; 3785 us/it; ETA 0d 01:30; 6bd301fd8aadd98a[/CODE]Also on Win7 x64, NVIDIA GTX1080, NVIDIA driver version 378.92:[CODE]C:\Users\ken\Documents\gpuowl-v6.11-219-ge70ec99\overloadable test>gpuowl-win 2020-03-26 18:05:21 gpuowl v6.11-219-ge70ec99-dirty 2020-03-26 18:05:21 config: -device 0 -user kriesel -cpu emu/gtx1080 -yield -maxAlloc 7500 -use NO_ASM 2020-03-26 18:05:21 device 0, unique id '' 2020-03-26 18:05:21 emu/gtx1080 97685953 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word 2020-03-26 18:05:23 emu/gtx1080 OpenCL args "-DEXP=97685953u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598138082486p-3 -DIWEIGHT_STEP=0xf .547c79820ff18p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 " 2020-03-26 18:05:28 emu/gtx1080 2020-03-26 18:05:28 emu/gtx1080 OpenCL compilation in 5.26 s 2020-03-26 18:05:29 emu/gtx1080 97685953 P1 B1=1000000, B2=270000000; 1442134 bits; starting at 0 2020-03-26 18:06:18 emu/gtx1080 97685953 P1 10000 0.69%; 4908 us/it; ETA 0d 01:57; 4577ae6cbb52f038 2020-03-26 18:07:07 emu/gtx1080 97685953 P1 20000 1.39%; 4917 us/it; ETA 0d 01:57; fc2022db22907e71[/CODE] |
[QUOTE=preda;539360]Yes. All gpuowl does on savefile is write the file and close it. From this point on, it's the OS's job to persist the file to disk. It turns out often the OS is lazy and prefers to keep the data in RAM for a while longer, and if a OS crash happens in this window, the savefile isn't properly persisted.[/QUOTE]This on fflush sounds like you could force the commit to disk for critical info. [url]https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fflush?view=vs-2019[/url]
|
I recommend no P-1 testing until further notice. I'm investigating a bug.
|
[QUOTE=Prime95;540993]I recommend no P-1 testing until further notice. I'm investigating a bug.[/QUOTE]Do you have any guidance on what versions are thought affected or unaffected?
|
[QUOTE=preda;540961]Hi Ken, I agree that was a loss. I'll look into improving this.[/QUOTE]
Thanks. For reference, [URL]https://mersenneforum.org/showpost.php?p=537396&postcount=1838[/URL] [URL]https://mersenneforum.org/showpost.php?p=537580&postcount=1853[/URL] [URL]https://mersenneforum.org/showpost.php?p=537628&postcount=1856[/URL] [URL]https://mersenneforum.org/showpost.php?p=537647&postcount=1859[/URL] [URL]https://mersenneforum.org/showpost.php?p=540929&postcount=1982[/URL] |
[QUOTE=kriesel;540995]Do you have any guidance on what versions are thought affected or unaffected?[/QUOTE]
In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option. |
[QUOTE=Prime95;541007]In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.[/QUOTE]
George, any update on the exponent ranges in question? |
[QUOTE=ewmayer;541093]George, any update on the exponent ranges in question?[/QUOTE]
Sort of. Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP. Here is the long answer: What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1. Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length. The formula for expected max carry32 during the mul-by-3 P-1 step should be: 3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261) If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors). Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT: 0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261) BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522 BPW = 17.755 max exp for 5M FFT = 93.1M similarly for a 5.5M FFT, max exp = 102.2M |
[QUOTE=Prime95;541101]Sort of.
Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP. Here is the long answer: What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1. Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length. The formula for expected max carry32 during the mul-by-3 P-1 step should be: 3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261) If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors). Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT: 0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261) BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522 BPW = 17.755 max exp for 5M FFT = 93.1M[/QUOTE] Thanks for the explainer. Looking at my own scalar-double carry macro - here all vars are doubles, x is the convolution output we are normalizing, wi_re is the inverse DWT weight (the 1/n is absorbed into that), prp_mult is your 3, cy is carryin from next-lower iFFT term (and re-used for carryout): [code]x *= wi_re;\ temp = DNINT(x);\ frac = fabs(x-temp);\ temp = temp*prp_mult + cy;\ cy = DNINT(temp*baseinv[i]);\ x = (temp-cy*base[i])*wt_re;\[/code] I'm guessing using all-doubles is not a good option for your target hardware. Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?. [QUOTE]similarly for a 5.5M FFT, max exp = 102.2M[/QUOTE] Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled? |
[QUOTE=ewmayer;541111]
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?[/QUOTE] Yes, but I would not redo those P-1. |
[QUOTE=Prime95;541112]Yes, but I would not redo those P-1.[/QUOTE]
I don't see CARRY64 in the readme - is that an undocumented cmd-line flag? One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K? Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K? |
[QUOTE=ewmayer;541113]I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?[/QUOTE]
-use CARRY64 |
[QUOTE=ewmayer;541113]One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?[/quote]
I don't know. [quote]Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?[/QUOTE] -fft 6M works. As well as -fft 6144K |
[QUOTE=ewmayer;541113]
One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K? [/QUOTE] I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT) |
3 strikes you're out, game over until tomorrow
gpuowl could handle error cases more gracefully. Luckily I stumbled across this one while handling something else. Otherwise it could have cost nearly a day's throughput on that gpu.
Please consider commenting out a problematic worktodo line and continuing on with the next in such a case, instead of killing the run. Also, since config.txt optimization content is fft length dependent, what's optimal for one fft length can be fatal for another. Please consider fft-length-specific enhancement to config.txt, as mentioned before.[CODE]2020-03-28 10:23:18 condorella/rx480 CC 94418041 / 94418041, 4d816a6edf6393__ 2020-03-28 10:23:20 condorella/rx480 {"exponent":"94418041", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "v ersion":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-03-28 15:23:20 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid": "(redacted)", "fft-length":5242880, "res64":"4d816a6edf6393__", "residue-type":1, "errors":{"gerbicz":0 }}2020-03-28 10:23:21 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word 2020-03-28 10:23:22 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STE P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05 18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-03-28 10:23:25 condorella/rx480 OpenCL compilation in 3.68 s 2020-03-28 10:23:28 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:35 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:49; 6781adfa7991c92a (check 2.29s) 2020-03-28 10:23:37 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:44 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:48; 6781adfa7991c92a (check 2.29s) 1 errors 2020-03-28 10:23:46 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-03-28 10:23:53 condorella/rx480 131500093 EE 800 0.00%; 5255 us/it; ETA 7d 23:58; 6781adfa7991c92a (check 2.30s) 2 errors 2020-03-28 10:23:53 condorella/rx480 3 sequential errors, will stop. 2020-03-28 10:23:53 condorella/rx480 Exiting because "too many errors" 2020-03-28 10:23:53 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>title gpuowl-v6.11-134-g1e0ce1d/rx480 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-03-28 11:27:31 gpuowl v6.11-134-g1e0ce1d[/CODE] |
@George - thanks, I missed the K and M suffix options in my perusal of the readme.
[QUOTE=preda;541145]I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)[/QUOTE] I checked at end of the forced-6144K p-1 run; it indeed started the ensuing PRP at the same FFT length, so killed and restarted sans -fft flag. All runs now restarted using -use CARRY64 -- thanks, George. Also, you'll be pleased to hear tha after the latest BSOD-style crash of the Haswell system which hosts my Radeon VII I finally got round to trying the disable-C-states trick you recommend in the BIOS Overclock submenu - seem to work like charm, system has been rock-stable since, uptime 4 days and counting, which is really long for this system. More details on what happens for me with -use CARRY64, in the context of 2 side-by-side PRP runs @5632K: o Initially, each run going at a steady 1386 us/iter at my sclk=4 setting; o Stop run 0 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1402 us/iter, which seems weird since only one is using the slower-but-safe carry option; o Stop run 1 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1420 us/iter, a 2.5% hit to throughput. Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs? |
[QUOTE=ewmayer;541193]
Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?[/QUOTE] I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there. |
[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]
But Ken, what is the appropriate action to take on error? Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discart the whole test as corrupted. |
[QUOTE=preda;541235]But Ken, what is the appropriate action to take on error?
Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discard the whole test as corrupted.[/QUOTE]Count the error. Roll back to last save file that did not detect an error. That's the same as CUDALucas does when it detects an error, or prime95 does, or did before addition of Jacobi or Gerbicz checks. Or suspend effort on that worktodo line and go to the next entry; then the user can decide later whether to resume or abandon the item that had an issue. There are several res64-based checks possible. res64=0 at any iteration; check more of the res to see if it's zero too. If it is it's an error, probably a failure to copy the full residue. (There's a tiny chance that res128>0 but res64=0 occurs and is correct.) res=1 at any iteration; res=3 after the first iteration. res64 repeating from one iteration to the next. res64 cycling among a very small list of values. [URL]https://www.mersenneforum.org/showpost.php?p=515641&postcount=10[/URL] |
[QUOTE=preda;541232]I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there.[/QUOTE]
Awesome - just pulled, built, and switched runs to. George, never heard your thoughts on whether checking the relative signs of the signed-int x and the result of the 3*x might be a useful diagnostic here. |
Just noticed something curious - since I didn't know how long it might be until a fix for the carry issue, yesterday I edited my 2 worktodo files - 10 entries each - and moved all PRPs not preceded by a p-1 of the same exponent to the top. Except that I buggered one such edit, and a PRP got moved into the top slot, while its accompanying p-1 remained below. Caught the 'doh!' with the PRP ~20% done, halted, moved the p-1 to its proper place at top of the file, resumed. The exponent is 103939597. The PRP was using 5632K, the just-started p-1 is using 6144K. Is that expected?
|
[QUOTE=ewmayer;541268]Just noticed something curious - since I didn't know how long it might be until a fix for the carry issue, yesterday I edited my 2 worktodo files - 10 entries each - and moved all PRPs not preceded by a p-1 of the same exponent to the top. Except that I buggered one such edit, and a PRP got moved into the top slot, while its accompanying p-1 remained below. Caught the 'doh!' with the PRP ~20% done, halted, moved the p-1 to its proper place at top of the file, resumed. The exponent is 103939597. The PRP was using 5632K, the just-started p-1 is using 6144K. Is that expected?[/QUOTE]
Yes, the FFT bounds are a bit more conservative for P-1. I think this area (FFT bounds) is under investigation currently. But yes, what you see is expected given the current code. |
[QUOTE=preda;541269]Yes, the FFT bounds are a bit more conservative for P-1. I think this area (FFT bounds) is under investigation currently. But yes, what you see is expected given the current code.[/QUOTE]
Has the "more conservative threshold for p-1" been changed in the latest commit? Because in my 2 results files I see p as large as 103985003 using 5632 for the p-1 step, using older builds. |
I compiled gpuowl on the Colab pro and want to test it on the Tesla P100.
Anyone have a list of all the different options that can be tweaked to find the fastest combination? |
[QUOTE=ewmayer;541273]Has the "more conservative threshold for p-1" been changed in the latest commit? Because in my 2 results files I see p as large as 103985003 using 5632 for the p-1 step, using older builds.[/QUOTE]
No, the different FFT bounds between PRP and P-1 has been there for about 1month, I'm not aware of recent changes. So it may be something else. |
[QUOTE=preda;541281]No, the different FFT bounds between PRP and P-1 has been there for about 1month, I'm not aware of recent changes. So it may be something else.[/QUOTE]
FYI, the most-recentcase I see in my logs of an expo > than the recent ones using 5632K is 16. Feb, v6.11-142-gf54af2e. More weirdness, this time hardware related - current pair of runs suffered drastic slowing-down ~30 mins ago, despite temps having been well below the usual 'caution' threshold, no odd fan noises or any other sign of amiss-ness. SMI showed both s-and-m-clocks well below their normal for my default sclk=4 setting. This happened once before, and a quick 'rocm-smi --gpureset -d 1' resolved it. To cover all the bases I first rebooted, verified the slowness persisted, then tried the reset - this time no joy. |
[QUOTE=ewmayer;541284]FYI, the most-recentcase I see in my logs of an expo > than the recent ones using 5632K is 16. Feb, v6.11-142-gf54af2e.
More weirdness, this time hardware related - current pair of runs suffered drastic slowing-down ~30 mins ago, despite temps having been well below the usual 'caution' threshold, no odd fan noises or any other sign of amiss-ness. SMI showed both s-and-m-clocks well below their normal for my default sclk=4 setting. This happened once before, and a quick 'rocm-smi --gpureset -d 1' resolved it. To cover all the bases I first rebooted, verified the slowness persisted, then tried the reset - this time no joy.[/QUOTE] No idea, sorry. Did you check dmesg for errors? did you try without setting sclk after reboot, to see the behavior? Do you see the power use, how did that change? |
[QUOTE=ATH;541280]I compiled gpuowl on the Colab pro and want to test it on the Tesla P100.
Anyone have a list of all the different options that can be tweaked to find the fastest combination?[/QUOTE]See [url]https://mersenneforum.org/showpost.php?p=540152&postcount=1968[/url] and the source code. |
[QUOTE=ewmayer;541111]
Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?[/QUOTE] We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development. The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that. |
[QUOTE=Prime95;541293]We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development.
The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.[/QUOTE] I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis. For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur? |
[QUOTE=ewmayer;541297]I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.[/quote]
Yes, that is the goal. [quote]For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?[/QUOTE] In the latest code, I set -use DEBUG,CARRY32_LIMIT=0x70000000 to print any iterations where 32-bit carry is getting close to the limit. This is slow code, useful for analysis, not for production runs. |
Update on my Radeon VII sudden-onset-slowdown yesterday: more weirdness. Haven't yet spotted anything in the dmesg logs of note, but I'm still get familiar with which AMD-GPU-related messages are normal and which not. Now to the weirdness.
My usual post-reboot procedure is: 1. Fire up gpuOwl job in each of two terminal windows, each job in a separate working dir; 2. Open 3rd window, fiddle settings to sclk=4 and fan=120 (or higher if interior temps warrant it); 3. Fire up LL/PRP job on the CPU; 4. Look at rocm-smi output to check GPU state. Last night, again rebooted system (just to cover all bases), fired up first gpuOwl job, but then skipped to [4] above - all looked normal, Wattage ~200, temp nearing 70C, SCLK and MCLK at their expected values, fan noise ramping up nicely. Thought "yay! I fixed it!" Fired up second gpuOwl job - within seconds fan noise starts dropping fast, check of rocm-smi shows dreaded "the workers have gone on strike" numbers. Kill second job, things revert back to normal. No clue why the GPU is suddenly balking at running 2 jobs, but it being late figured better quit while I'm ahead - set sclk=5 to help compensate for the throughput hit from 1-job running, went to bed. Even more weirdness - Just fired up 2nd job to see if the issue is reproducible, now all seems back to normal. I believe the technical term is "gremlins". |
From gpuowl.cl:
[QUOTE]OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing> IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing>[/QUOTE] What are the possible values and range to test for these variables? On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration: -use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4 |
[QUOTE=ATH;541364]From gpuowl.cl:
What are the possible values and range to test for these variables? On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration: -use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4[/QUOTE] IN/OUT_WG=64,128,256,512 IN/OUT_SIZEX=4,8,16,32,64,128 (gpuowl will whine when the combination does not make sense) IN/OUT_SPACING=4,8,16,32,64,128 You are the first nVidia user to test all these combinations. Alas, previously the colab GPUs showed only minor differences in these settings whereas nVidia consumer GPUs benefitted much more from an optimal setting. BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG. |
[QUOTE=Prime95;541384]BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.[/QUOTE]
I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again. This is the one to use right? git clone [url]https://github.com/preda/gpuowl[/url] |
[QUOTE=ATH;541410]I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.
This is the one to use right? git clone [url]https://github.com/preda/gpuowl[/url][/QUOTE] Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate. For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64. Yes, that is the correct source. |
[QUOTE=Prime95;541411]Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.[/QUOTE] Just grabbed (v. 62a3025) and built. Switched one of my 2 runs to it to goive a spin, see this on start (PRP of p = 103937143 @5632K): [i] Expected maximum carry32: 47840000 [/i] Aside - before switching that run to the new version, both were getting ~1335 us/iter (total 1498 iter/sec). With 1 run using new version, that run is now @ 1580 us/iter and the other has speeded up to 1168 us/iter (total 1490 iter/sec). With both runs using new version, both are at 1333 us/iter (total 1500 iter/sec). Probably some weird rocm-process-priority thing. Aside #2: I've been doing near-daily price checks of new XFX Radeon VII cards on Amazon - they fluctuate interestingly. Couple days ago, $580. Yesterday, back to the same $550 I paid for mine in Feb. Just now, $600. |
Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.
This version is a bit faster on default settings at 5M FFT: 895µs/iteration Got down to 832 µs with: -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32 I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs: [QUOTE]-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2 -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=128,IN_SIZEX=16,IN_SPACING=4[/QUOTE] Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs. Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference. |
[QUOTE=ATH;541441]Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.
This version is a bit faster on default settings at 5M FFT: 895µs/iteration Got down to 832 µs with: -use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32 I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs: Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs. Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.[/QUOTE] ORIG_X2 and INLINE_X2 do not exist anymore, setting them has no effect whatsoever. This seems to suggest these changes to Nvidia defaults: - handle T2_SHUFFLE like on AMD (i.e. default to NO_T2_SHUFFLE) - handle CARRY like on AMD (i.e. default to CARRY32) Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia. Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware). |
Windows compilation:
[code] g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp In file included from ProofSet.h:6, from Gpu.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] 33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); | ~^ ~~~~~~~~~~~~ | | | | char* const value_type* {aka const wchar_t*} | %hs Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)': Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ^~ Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~~~~~~~ Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~ make: *** [Makefile:30: Gpu.o] Error 1 [/code] |
The error should be fixed (the printf warning remains).
Your compiler is strange: - it has 32-bit long (we talked about this before). This is allowed, but unusual. You can verify this by e.g. checking sizeof(long) - it seems that std::string is basic_string<wchar_t> (from the warning message). This is, in my understanding of the C++ standard, not allowed. [QUOTE=kracker;541453]Windows compilation: [code] g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp In file included from ProofSet.h:6, from Gpu.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] 33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); | ~^ ~~~~~~~~~~~~ | | | | char* const value_type* {aka const wchar_t*} | %hs Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)': Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ^~ Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~~~~~~~ Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive] 881 | constexpr float roundScale = 1.0 / (1L << 32); | ~~~~^~~~~~ make: *** [Makefile:30: Gpu.o] Error 1 [/code][/QUOTE] |
ROCm 3.1 is now the recommended platform for AMD GPUs -- it is the fastest, and also actively tuned for.
|
[QUOTE=preda;541454]The error should be fixed (the printf warning remains).
Your compiler is strange: - it has 32-bit long (we talked about this before). This is allowed, but unusual. You can verify this by e.g. checking sizeof(long) - it seems that std::string is basic_string<wchar_t> (from the warning message). This is, in my understanding of the C++ standard, not allowed.[/QUOTE] long is apparently 4 byte in Visual Studio or gcc typically on Windows: [url]https://docs.microsoft.com/en-us/cpp/cpp/fundamental-types-cpp?view=vs-2019[/url] [url]https://stackoverflow.com/questions/22344388/size-of-long-int-and-int-in-c-showing-4-bytes[/url] |
No, that's [QUOTE]Visual Studio 2008 on a 32-bit architecture[/QUOTE].
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays. As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules. OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know. [QUOTE=kriesel;541459]long is apparently 4 byte in Visual Studio or gcc typically on Windows: [url]https://docs.microsoft.com/en-us/cpp/cpp/fundamental-types-cpp?view=vs-2019[/url] [url]https://stackoverflow.com/questions/22344388/size-of-long-int-and-int-in-c-showing-4-bytes[/url][/QUOTE] |
[QUOTE=preda;541462]No, that's .
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays. As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules. OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know.[/QUOTE] long long is typically how you get the 8 byte version on windows. |
[QUOTE=preda;541462]No, that's .
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays. As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules. OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know.[/QUOTE] Also 4-byte on the gcc compile performed on Windows, per the second link. |
robust fail to start PRP
I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093. Gpuowl v6.11-134-g1e0ce1d chose the initial 7M fft length on its own. After finding it reproducible, I successively incremented -fft to seek a reliable run case. It wasn't until it reached 9M fft that it succeeded in the GEC. The resulting speed penalty is considerable, 7.5 msec/iter versus 5.3 on an RX480. From the program's help output,[CODE]FFT 7M [ 11.01M - 132.46M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7
FFT 8M [ 12.58M - 150.85M] 2K-2K 4K-1K FFT 9M [ 14.16M - 169.18M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9[/CODE][CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:47:57 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:47:57 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM 2020-04-01 07:47:57 device 0, unique id '' 2020-04-01 07:47:57 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word 2020-04-01 07:47:59 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STEP=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:48:03 condorella/rx480 OpenCL compilation in 3.97 s 2020-04-01 07:48:06 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:48:13 condorella/rx480 131500093 EE 800 0.00%; 5272 us/it; ETA 8d 00:34; 6781adfa7991c92a (check 2.31s) 2020-04-01 07:48:15 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:48:22 condorella/rx480 131500093 EE 800 0.00%; 5309 us/it; ETA 8d 01:56; 6781adfa7991c92a (check 2.31s) 1 errors 2020-04-01 07:48:24 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:48:31 condorella/rx480 131500093 EE 800 0.00%; 5298 us/it; ETA 8d 01:32; 6781adfa7991c92a (check 2.33s) 2 errors 2020-04-01 07:48:31 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:48:31 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:48:31 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:48:50 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:48:50 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +1 2020-04-01 07:48:50 device 0, unique id '' 2020-04-01 07:48:50 condorella/rx480 131500093 FFT 7168K: Width 64x4, Height 256x8, Middle 7; 17.92 bits/word 2020-04-01 07:48:53 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=256u -DSMALL_HEIGHT=2048u -DMIDDLE=7u -DWEIGHT_STE P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05 18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:48:57 condorella/rx480 OpenCL compilation in 4.67 s 2020-04-01 07:49:01 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:49:11 condorella/rx480 131500093 EE 800 0.00%; 7714 us/it; ETA 11d 17:46; 55f854bea6c1cecf (check 3.28s) 2020-04-01 07:49:14 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:49:24 condorella/rx480 131500093 EE 800 0.00%; 7697 us/it; ETA 11d 17:10; 55f854bea6c1cecf (check 3.29s) 1 errors 2020-04-01 07:49:27 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:49:37 condorella/rx480 131500093 EE 800 0.00%; 7687 us/it; ETA 11d 16:46; 55f854bea6c1cecf (check 3.27s) 2 errors 2020-04-01 07:49:37 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:49:37 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:49:37 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:50:25 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:50:25 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +2 2020-04-01 07:50:25 device 0, unique id '' 2020-04-01 07:50:25 condorella/rx480 131500093 FFT 7168K: Width 64x8, Height 256x4, Middle 7; 17.92 bits/word 2020-04-01 07:50:27 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=512u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a115506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:50:31 condorella/rx480 OpenCL compilation in 3.72 s 2020-04-01 07:50:34 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:50:42 condorella/rx480 131500093 EE 800 0.00%; 6286 us/it; ETA 9d 13:37; 6f8253cbb2fe58e9 (check 2.71s) 2020-04-01 07:50:45 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:50:53 condorella/rx480 131500093 EE 800 0.00%; 6283 us/it; ETA 9d 13:29; 6f8253cbb2fe58e9 (check 2.71s) 1 errors 2020-04-01 07:50:56 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:51:03 condorella/rx480 131500093 EE 800 0.00%; 6299 us/it; ETA 9d 14:05; 6f8253cbb2fe58e9 (check 2.71s) 2 errors 2020-04-01 07:51:03 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:51:03 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:51:03 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:51:29 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:51:29 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +3 2020-04-01 07:51:29 device 0, unique id '' 2020-04-01 07:51:29 condorella/rx480 131500093 FFT 7168K: Width 256x8, Height 64x4, Middle 7; 17.92 bits/word 2020-04-01 07:51:29 condorella/rx480 using long carry kernels 2020-04-01 07:51:32 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=2048u -DSMALL_HEIGHT=256u -DMIDDLE=7u -DWEIGHT_STE P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1 15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:51:36 condorella/rx480 OpenCL compilation in 3.97 s 2020-04-01 07:51:39 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:51:46 condorella/rx480 131500093 EE 800 0.00%; 5275 us/it; ETA 8d 00:42; cfbd904e74b67aae (check 2.31s) 2020-04-01 07:51:48 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:51:54 condorella/rx480 131500093 EE 800 0.00%; 5249 us/it; ETA 7d 23:44; cfbd904e74b67aae (check 2.29s)1 errors 2020-04-01 07:51:57 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:52:03 condorella/rx480 131500093 EE 800 0.00%; 5239 us/it; ETA 7d 23:23; cfbd904e74b67aae (check 2.29s)2 errors 2020-04-01 07:52:03 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:52:03 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:52:03 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:52:07 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:52:07 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +4 2020-04-01 07:52:07 device 0, unique id '' 2020-04-01 07:52:07 condorella/rx480 131500093 FFT 8192K: Width 256x8, Height 256x8; 15.68 bits/word 2020-04-01 07:52:07 condorella/rx480 using long carry kernels 2020-04-01 07:52:10 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=2048u -DSMALL_HEIGHT=2048u -DMIDDLE=1u -DWEIGHT_ST EP=0xa.039f00d8f95f8p-3 -DIWEIGHT_STEP=0xc.c82be96a7181p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1 15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:52:15 condorella/rx480 OpenCL compilation in 5.16 s 2020-04-01 07:52:18 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:52:27 condorella/rx480 131500093 EE 800 0.00%; 6583 us/it; ETA 10d 00:28; 05252a7f59574e37 (check 2.85s) 2020-04-01 07:52:30 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:52:38 condorella/rx480 131500093 EE 800 0.00%; 6587 us/it; ETA 10d 00:36; 05252a7f59574e37 (check 2.85s) 1 errors 2020-04-01 07:52:41 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:52:49 condorella/rx480 131500093 EE 800 0.00%; 6594 us/it; ETA 10d 00:53; 05252a7f59574e37 (check 2.86s) 2 errors 2020-04-01 07:52:49 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:52:49 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:52:49 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:53:21 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:53:21 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +5 2020-04-01 07:53:21 device 0, unique id '' 2020-04-01 07:53:21 condorella/rx480 131500093 FFT 8192K: Width 512x8, Height 256x4; 15.68 bits/word 2020-04-01 07:53:21 condorella/rx480 using long carry kernels 2020-04-01 07:53:23 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=1u -DWEIGHT_ST EP=0xa.039f00d8f95f8p-3 -DIWEIGHT_STEP=0xc.c82be96a7181p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1 15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:53:26 condorella/rx480 OpenCL compilation in 3.53 s 2020-04-01 07:53:30 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:53:39 condorella/rx480 131500093 EE 800 0.00%; 7196 us/it; ETA 10d 22:51; 6df742314b82f841 (check 3.11s) 2020-04-01 07:53:42 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:53:51 condorella/rx480 131500093 EE 800 0.00%; 7219 us/it; ETA 10d 23:43; 6df742314b82f841 (check 3.11s) 1 errors 2020-04-01 07:53:54 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:54:03 condorella/rx480 131500093 EE 800 0.00%; 7190 us/it; ETA 10d 22:38; 6df742314b82f841 (check 3.10s) 2 errors 2020-04-01 07:54:03 condorella/rx480 3 sequential errors, will stop. 2020-04-01 07:54:03 condorella/rx480 Exiting because "too many errors" 2020-04-01 07:54:03 condorella/rx480 Bye C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611 C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win 2020-04-01 07:54:08 gpuowl v6.11-134-g1e0ce1d 2020-04-01 07:54:08 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +6 2020-04-01 07:54:08 device 0, unique id '' 2020-04-01 07:54:08 condorella/rx480 131500093 FFT 9216K: Width 256x4, Height 64x8, Middle 9; 13.93 bits/word 2020-04-01 07:54:08 condorella/rx480 using long carry kernels 2020-04-01 07:54:12 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -DWEIGHT_STEP=0x8.5f7e7ead6051p-3 -DIWEIGHT_STEP=0xf.498539ec95fe8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2020-04-01 07:54:16 condorella/rx480 OpenCL compilation in 4.11 s 2020-04-01 07:54:20 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003 2020-04-01 07:54:29 condorella/rx480 131500093 OK 800 0.00%; 7461 us/it; ETA 11d 08:32; bbe24bd13cd73020 (check 3.26s) 2020-04-01 08:19:33 condorella/rx480 131500093 OK 200000 0.15%; 7541 us/it; ETA 11d 11:03; 190bb27ff665f83b (check 3.25s) [/CODE] |
[QUOTE=kriesel;541479]I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093.[/QUOTE]
Works for me on Linux rocm 3.1. Maybe a Windows compiler / driver bug? |
[QUOTE=henryzz;541463]long long is typically how you get the 8 byte version on windows.[/QUOTE]
Lack of unambiguous-length basic data types was one of the great blunders of the C language standard. Here is the snip of preprocessor code I use in my Mlucas types.h file, the 64-bit section includes a small hack from George. I'm sure event this is not completely portable, but it has served me well. 'MSVC' refers to Visual Studio, OS_BITS is an Mlucas predef reflecting the bit-ness (32 or 64) of the OS: [code]typedef char int8; typedef char sint8; typedef unsigned char uint8; typedef short int16; typedef short sint16; typedef unsigned short uint16; typedef int int32; typedef int sint32; typedef unsigned int uint32; /* 64-bit int: */ /* MSVC doesn't like 'long long', and of course MS has their own completely non-portable substitute: */ #if(defined(OS_TYPE_WINDOWS) && defined(COMPILER_TYPE_MSVC)) typedef signed __int64 int64; typedef signed __int64 sint64; typedef unsigned __int64 uint64; typedef const signed __int64 int64c; typedef const signed __int64 sint64c; typedef const unsigned __int64 uint64c; /* GW: In many cases where the C code is interfacing with the assembly code */ /* we must declare variables that are exactly 32-bits wide. This is the */ /* portable way to do this, as the linux x86-64 C compiler defines the */ /* long data type as 64 bits. We also use portable definitions for */ /* values that can be either an integer or a pointer. */ #if OS_BITS == 64 typedef int64 intptr_t; typedef uint64 uintptr_t; #else typedef int32 intptr_t; typedef uint32 uintptr_t; #endif #else typedef long long int64; typedef long long sint64; typedef unsigned long long uint64; typedef const long long int64c; typedef const long long sint64c; typedef const unsigned long long uint64c; #endif[/code] |
[QUOTE=Prime95;541498]Works for me on Linux rocm 3.1. Maybe a Windows compiler / driver bug?[/QUOTE]
Reproduced here, on gpuowl-win v6.11-134, -fft 0 and -fft +5: RX550 on same system as previous report; driver 25.20.14007.1000 (Adrenalin 18.10.2) / Win7 64 An RX550 on separate system, driver 26.20.12028.2 (Adrenalin 19.20)/Win10 64 A Radeon VII, on driver 26.20.12028.2 (Adrenalin 19.20)/Win10 64 Not reproduced here, on gpuowl-win v6.11-198, same Radeon VII as above, same system, same driver. Also not reproduced, on gpuowl-win v6.11-219-ge70ec99, on the same RX480, system, driver combo that led this off with the previous post 2031. All my gpuowl-win builds are done on the one system where the RX480 resides, with the same step by step build process; mkdir latest cd latest git clone [URL]https://github.com/preda/gpuowl[/URL] cd gpuowl make gpuowl-win.exe |
I'm investigating.
[QUOTE=kriesel;541479]I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093. Gpuowl v6.11-134-g1e0ce1d[/QUOTE] |
[QUOTE=preda;541444]Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.
Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).[/QUOTE] I tested the first 100K iterations of 95,000,011: ORIG_SLOWTRIG: Roundoff: N=10374, max 0.312500, avg 0.212775 NEW_SLOWTRIG: Roundoff: N=10374, max 0.312500, avg 0.214292 I can try and test on my own 2080, if I can compile gpuowl in Windows, or find a new compiled version. |
RTX 2080 is so bad at double precision and the timings are very inconsistent.
But NEW_SLOWTRIG is better at 3520µs/ite vs 3680µs/ite for ORIG_SLOWTRIG. T2_SHUFFLE is slightly better at 3520µs vs 3553µs for NO_T2_SHUFFLE Otherwise CARRY64 and CARRY32 is about the same. I'm not going to test all those 6 variables on this, since it is very slow and the inconsistencies in the timings is larger than the differences. Btw UNROLL_NONE,UNROLL_WIDTH and UNROLL_HEIGHT does not work at all on either the Tesla P100 or the RTX 2080. |
| All times are UTC. The time now is 07:02. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.