mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

paulunderwood 2020-03-16 15:14

ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C

I am still a bit scared to upgrade to ROCm 3.1...

Well, I tried to upgrade and now I get:

[code]
./gpuowl -device 1
2020-03-16 17:10:53 gpuowl v6.11-197-g3886a11
2020-03-16 17:10:53 Note: not found 'config.txt'
2020-03-16 17:10:53 config: -device 1
2020-03-16 17:10:53 device 1, unique id 'f582388172fd5d41'
2020-03-16 17:10:53 f582388172fd5d41 worktodo.txt line ignored: ""
2020-03-16 17:10:53 f582388172fd5d41 999xxxxx FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word
Segmentation fault
[/code]

[code]
uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
[/code]

paulunderwood 2020-03-16 17:49

[QUOTE=paulunderwood;539853]ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C

I am still a bit scared to upgrade to ROCm 3.1...

Well, I tried to upgrade and now I get:

[code]
./gpuowl -device 1
2020-03-16 17:10:53 gpuowl v6.11-197-g3886a11
2020-03-16 17:10:53 Note: not found 'config.txt'
2020-03-16 17:10:53 config: -device 1
2020-03-16 17:10:53 device 1, unique id 'f582388172fd5d41'
2020-03-16 17:10:53 f582388172fd5d41 worktodo.txt line ignored: ""
2020-03-16 17:10:53 f582388172fd5d41 999xxxxx FFT 5632K: Width 256x4, Height 64x4, Middle 11; 17.34 bits/word
Segmentation fault
[/code]

[code]
uname -a
Linux honeypot9 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux
[/code][/QUOTE]


I am willing to try a different distro for ROCm 3.1 -- any suggestions?

ewmayer 2020-03-16 19:20

[QUOTE=preda;539852]ROCm 3.1, sclk 3, mem 1180, FFT 5M: 708us/it. (150W)[/QUOTE]

[QUOTE=paulunderwood;539853]ROCm 2.10 sclk 3 mem 1150 FFT 5632K gpuowl v6.11-134-g1e0ce1d 800 us/it 194.00W 93C[/QUOTE]

Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10.

paulunderwood 2020-03-16 20:07

[QUOTE=ewmayer;539872]Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10.[/QUOTE]

I wrestled back the ROCm 2.10.0 driver...

With 2 gpuowl instances (with same settings (sclk 3 etc, but with 5 extra Watts and gpuowl v6.11-197-g3886a11-dirty)) I am getting ~1475 us/it each. Thanks for prompting me Ernst -- a great speed-up :tu:

preda 2020-03-16 20:46

[QUOTE=paulunderwood;539862]I am willing to try a different distro for ROCm 3.1 -- any suggestions?[/QUOTE]

I'm using Ubuntu 19.10 with Linux kernel 5.4.24. I also tried kernels 5.5.x, 5.6.x and they work too.

Prime95 2020-03-16 22:04

[QUOTE=preda;539879]I'm using Ubuntu 19.10 with Linux kernel 5.4.24. I also tried kernels 5.5.x, 5.6.x and they work too.[/QUOTE]

Anyone get rocm 3.1 to work on Ubuntu 19.04? I've tried 3 times without success.

paulunderwood 2020-03-17 06:23

-O2 or not
 
[QUOTE=paulunderwood;539876]

With 2 gpuowl instances (with same settings (sclk 3 etc, but with 5 extra Watts and gpuowl v6.11-197-g3886a11-dirty)) I am getting ~1475 us/it each. Thanks for prompting me Ernst -- a great speed-up :tu:[/QUOTE]

This was compiled without -O2 in the Makefile. With it, the iterations are 1487us and the power usage is 1 or 2 Watts lower.

preda 2020-03-17 10:01

[QUOTE=Prime95;539882]Anyone get rocm 3.1 to work on Ubuntu 19.04? I've tried 3 times without success.[/QUOTE]

I would expect Ubuntu 19.04 to be pretty similar to 19.10 from ROCm POV (more important would be the kernel version). What step is falling?

I mentioned here [url]https://github.com/RadeonOpenCompute/ROCm/issues/977[/url] that I had to install libncurses5 too.

Prime95 2020-03-17 10:47

Using not yet committed code:

Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it.
Running 2 instances: 604us/it (200W measured by rocm-smi)

I love this GPU.

axn 2020-03-17 12:45

[QUOTE=Prime95;539912]Using not yet committed code:

Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it.
Running 2 instances: 604us/it (200W measured by rocm-smi)

I love this GPU.[/QUOTE]

How many GHzD/d does that translate to? 400-ish? :shock:

EDIT:- Probably more like high 400, low 500!

Prime95 2020-03-17 16:14

[QUOTE=axn;539927]How many GHzD/d does that translate to? 400-ish? :shock:

EDIT:- Probably more like high 400, low 500![/QUOTE]

510 PRP-GHzD/d

kriesel 2020-03-17 16:35

[QUOTE=Prime95;539955]510 PRP-GHzD/d[/QUOTE]where did you buy that beauty, and whose brand is it? (Just completed the RMA/refund process on my second one.)

Prime95 2020-03-17 16:58

[QUOTE=kriesel;539958]where did you buy that beauty, and whose brand is it? (Just completed the RMA/refund process on my second one.)[/QUOTE]

All GPUs except one range from 602us to 615us. The one outlier is 630us running in I7-860 (not a Sandy Bridge as I originally reported) which is not PCIE 3.0.

Prime95 2020-03-17 17:03

[QUOTE=Prime95;539955]510 PRP-GHzD/d[/QUOTE]

20 GPUs = 1 Curtis Cooper

Uncwilly 2020-03-17 19:37

[QUOTE=Prime95;539960]20 GPUs = 1 Curtis Cooper[/QUOTE]
Is that the next unit like a P90 year?

ewmayer 2020-03-17 19:45

[QUOTE=Prime95;539912]Using not yet committed code:

Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it.
Running 2 instances: 604us/it (200W measured by rocm-smi)

I love this GPU.[/QUOTE]

Nice - how much % gain is that over current commit, and did you also do timings @5632K?

And, have you tried running > 2 instances to see if there is any further marginal throughput gain to be had that way?

[b]Edit:[/b] Just tried the latter experiment - but not using George's uncommitted code, obviously - on my own machine, here the timing/throughput figure for 1-3 workers, all @5632K FFT, sclk = 5:

1: 754 us/iter => 1362 iter/sec
2: 1405 us/iter => 1423 iter/sec
3: 2174 us/iter => 1380 iter/sec

So, deterioration above 2 workers.

PhilF 2020-03-17 21:53

[QUOTE=ewmayer;539973]Nice - how much % gain is that over current commit, and did you also do timings @5632K?

And, have you tried running > 2 instances to see if there is any further marginal throughput gain to be had that way?

[b]Edit:[/b] Just tried the latter experiment - but not using George's uncommitted code, obviously - on my own machine, here the timing/throughput figure for 1-3 workers, all @5632K FFT, sclk = 5:

1: 754 us/iter => 1362 iter/sec
2: 1405 us/iter => 1423 iter/sec
3: 2174 us/iter => 1380 iter/sec

So, deterioration above 2 workers.[/QUOTE]
5632K is the FFT size I can relate to also.

5632K, sclk=5, 1000Mhz memclk, 185W as mesured by rocm, one worker, older version of gpuowl:

872 us/iter

ewmayer 2020-03-17 22:16

[QUOTE=PhilF;539983]5632K is the FFT size I can relate to also.

5632K, sclk=5, 1000Mhz memclk, 185W as mesured by rocm, one worker, older version of gpuowl:

872 us/iter[/QUOTE]

Your mem-downclock is likely the reason you both run slower and at significantly lower power than I, at the same sclk and 1-worker setting. But why not fire up a second worker?

kriesel 2020-03-17 22:36

gpuowl-win v6.11-198-g build and initial speed checks
 
2 Attachment(s)
The usual warning shower, see build log, but it runs.
Win7 64 Pro, dual E5645, prime95 maxed, 12GB ram
RX550: V6.11-134[CODE]2020-03-17 13:25:39 condorella/rx550 93873049 OK 60600000 64.56%; 14442 us/it; ETA 5d 13:29; 7c9ef8f79b678f5e (check 5.82s)
2020-03-17 14:13:57 condorella/rx550 93873049 OK 60800000 64.77%; 14458 us/it; ETA 5d 12:50; cf3a3470216c1801 (check 5.86s)
2020-03-17 15:02:12 condorella/rx550 93873049 OK 61000000 64.98%; 14448 us/it; ETA 5d 11:56; 975f4c7dd6bd8513 (check 5.81s)
2020-03-17 15:50:28 condorella/rx550 93873049 OK 61200000 65.19%; 14452 us/it; ETA 5d 11:10; f2e39e36a1a45ad8 (check 5.83s)
2020-03-17 15:53:43 condorella/rx550 Stopping, please wait..
2020-03-17 15:53:55 condorella/rx550 93873049 OK 61214000 65.21%; 14379 us/it; ETA 5d 10:27; cb89cd1c515d4a12 (check 5.82s)
2020-03-17 15:53:55 condorella/rx550 Exiting because "stop requested"
2020-03-17 15:53:55 condorella/rx550 Bye[/CODE][CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-198-g628f3cd\rx550>gpuowl-win
2020-03-17 15:54:11 gpuowl v6.11-198-g628f3cd
2020-03-17 15:54:11 config: -device 1 -user kriesel -cpu condorella/rx550 -yield -maxAlloc 3600 -use NO_ASM,UNROLL_HEIGHT,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_MIDDLE,CARRY32,ORIGINAL_METHOD,LESS_ACCURATE
2020-03-17 15:54:11 device 1, unique id ''
2020-03-17 15:54:11 condorella/rx550 93873049 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.90 bits/word
2020-03-17 15:54:13 condorella/rx550 Warning: -use LESS_ACCURATE has no effect
2020-03-17 15:54:13 condorella/rx550 Warning: -use MERGED_MIDDLE has no effect
2020-03-17 15:54:13 condorella/rx550 Warning: -use ORIGINAL_METHOD has no effect
2020-03-17 15:54:13 condorella/rx550 Warning: -use T2_SHUFFLE_HEIGHT has no effect
2020-03-17 15:54:13 condorella/rx550 Warning: -use T2_SHUFFLE_MIDDLE has no effect
2020-03-17 15:54:13 condorella/rx550 Warning: -use WORKINGOUT2 has no effect
2020-03-17 15:54:13 condorella/rx550 OpenCL args "-DEXP=93873049u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.8b9afd7da35e8p-3 -DIWEIGHT_STEP=0xe.fa9b7f6844848p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DLESS_ACCURATE=1 -DMERGED_M
IDDLE=1 -DNO_ASM=1 -DORIGINAL_METHOD=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_MIDDLE=1 -DUNROLL_HEIGHT=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-17 15:54:17 condorella/rx550 OpenCL compilation in 4.21 s
2020-03-17 15:54:24 condorella/rx550 93873049 OK 61200000 loaded: blockSize 400, f2e39e36a1a45ad8
2020-03-17 15:54:42 condorella/rx550 93873049 OK 61200800 65.20%; 15423 us/it; ETA 5d 19:58; 978b866258bcc6ff (check 6.02s)
2020-03-17 16:44:59 condorella/rx550 93873049 OK 61400000 65.41%; 15117 us/it; ETA 5d 16:22; bba7f7db066343fb (check 6.13s)[/CODE]Same exponent, v6.11-198 is slower than v6.11-134 on RX550: 1-14453/15117 = 4.4% slower.

RX480: V6.11-134[CODE]2020-03-17 14:49:29 condorella/rx480 94073297 OK 36600000 38.91%; 3580 us/it; ETA 2d 09:10; 8fe30d16593f9dd3 (check 1.51s)
2020-03-17 15:01:47 condorella/rx480 94073297 OK 36800000 39.12%; 3685 us/it; ETA 2d 10:37; d3eed3f44f9e2a97 (check 1.47s)
2020-03-17 15:14:07 condorella/rx480 94073297 OK 37000000 39.33%; 3691 us/it; ETA 2d 10:31; 416999f765247350 (check 1.50s)
2020-03-17 15:26:15 condorella/rx480 94073297 OK 37200000 39.54%; 3631 us/it; ETA 2d 09:22; c8d58ee219203f21 (check 1.53s)
2020-03-17 15:38:16 condorella/rx480 94073297 OK 37400000 39.76%; 3600 us/it; ETA 2d 08:41; 94e4dcc24e50b2bb (check 1.50s)
2020-03-17 15:50:12 condorella/rx480 94073297 OK 37600000 39.97%; 3574 us/it; ETA 2d 08:04; 062e7035793975f0 (check 1.56s)
2020-03-17 16:02:07 condorella/rx480 94073297 OK 37800000 40.18%; 3570 us/it; ETA 2d 07:48; 30335ecba13bf8e9 (check 1.47s)
2020-03-17 16:14:16 condorella/rx480 94073297 OK 38000000 40.39%; 3637 us/it; ETA 2d 08:39; 650cb74a2a044221 (check 1.48s)
2020-03-17 16:15:34 condorella/rx480 Stopping, please wait..
2020-03-17 16:15:37 condorella/rx480 94073297 OK 38022400 40.42%; 3536 us/it; ETA 2d 07:03; 5f6a5d73eb842b21 (check 1.47s)
2020-03-17 16:15:37 condorella/rx480 Exiting because "stop requested"
2020-03-17 16:15:37 condorella/rx480 Bye[/CODE]V6.11-198:[CODE]2020-03-17 16:17:37 gpuowl v6.11-198-g628f3cd
2020-03-17 16:17:37 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,
WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG
2020-03-17 16:17:37 config:
2020-03-17 16:17:37 config: :4.5m fft NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32
,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE
2020-03-17 16:17:37 config: :5m fft NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUA
RES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG
2020-03-17 16:17:37 device 0, unique id ''
2020-03-17 16:17:37 condorella/rx480 94073297 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.94 bits/word
2020-03-17 16:17:40 condorella/rx480 Warning: -use CHEBYSHEV_MIDDLEMUL2 has no effect
2020-03-17 16:17:40 condorella/rx480 Warning: -use MERGED_MIDDLE has no effect
2020-03-17 16:17:40 condorella/rx480 Warning: -use MORE_SQUARES_MIDDLEMUL1 has no effect
2020-03-17 16:17:40 condorella/rx480 Warning: -use T2_SHUFFLE_HEIGHT has no effect
2020-03-17 16:17:40 condorella/rx480 Warning: -use T2_SHUFFLE_WIDTH has no effect
2020-03-17 16:17:40 condorella/rx480 Warning: -use WORKINGOUT1 has no effect
2020-03-17 16:17:40 condorella/rx480 OpenCL args "-DEXP=94073297u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.52733b011536p-3 -DIWEIGHT_STE
P=0xf.617b45b852608p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DME
RGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1
=1 -DWORKINGOUT1=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-17 16:17:44 condorella/rx480 OpenCL compilation in 4.10 s
2020-03-17 16:17:46 condorella/rx480 94073297 OK 37826000 loaded: blockSize 400, febfc14d0f24e5fa
2020-03-17 16:17:50 condorella/rx480 94073297 OK 37826800 40.21%; 3461 us/it; ETA 2d 06:04; b5a114da61bfa21e (check 1.55s)
2020-03-17 16:28:31 condorella/rx480 94073297 OK 38000000 40.39%; 3692 us/it; ETA 2d 09:31; 650cb74a2a044221 (check 1.51s)
2020-03-17 16:40:53 condorella/rx480 94073297 OK 38200000 40.61%; 3703 us/it; ETA 2d 09:28; 39ce25c654678cec (check 1.51s)[/CODE]1-3617/3697 = .0216 = 2.16% slower v6.11-198 than v6.11-134 on RX480


Downclocked radeon VII:
v6.11-134[CODE]020-03-17 17:43:24 roa/radeonvii 655685803 OK 626640000 95.57%; 10363 us/it; ETA 3d 11:37; 6fcd5d380e08a691 (check 5.99s) 20 errors
2020-03-17 17:46:57 roa/radeonvii 655685803 OK 626660000 95.57%; 10378 us/it; ETA 3d 11:40; 08e9287564d158a0 (check 5.94s) 20 errors
2020-03-17 17:50:30 roa/radeonvii 655685803 OK 626680000 95.58%; 10363 us/it; ETA 3d 11:30; cc909f1a064cff84 (check 5.88s) 20 errors
2020-03-17 17:51:49 roa/radeonvii Stopping, please wait..
2020-03-17 17:51:59 roa/radeonvii 655685803 OK 626688000 95.58%; 10369 us/it; ETA 3d 11:31; ceaa28d44748f0e7 (check 5.89s) 20 errors
2020-03-17 17:51:59 roa/radeonvii Exiting because "stop requested"
2020-03-17 17:51:59 roa/radeonvii Bye[/CODE]V6.11-198-g628f3cd[CODE]C:\Users\ken\Documents\gpuowl-v6.11-198-g628f3cd>gpuowl-win
2020-03-17 17:54:57 gpuowl v6.11-198-g628f3cd
2020-03-17 17:54:57 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use -device 1 -user kriesel -cpu roa/radeonvii -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-03-17 17:54:57 config: ;NO_ASM,ORIG_SLOWTRIG
2020-03-17 17:54:57 device 1, unique id ''
2020-03-17 17:54:57 roa/radeonvii 655685803 FFT 40960K: Width 256x4, Height 256x8, Middle 10; 15.63 bits/word
2020-03-17 17:55:07 roa/radeonvii Warning: -use CHEBYSHEV_METHOD has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use LESS_ACCURATE has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use MERGED_MIDDLE has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use ORIG_MIDDLEMUL2 has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use T2_SHUFFLE_HEIGHT has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use T2_SHUFFLE_REVERSELINE has no effect
2020-03-17 17:55:07 roa/radeonvii Warning: -use UNROLL_MIDDLEMUL2 has no effect
2020-03-17 17:55:07 roa/radeonvii OpenCL args "-DEXP=655685803u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=10u -DWEIGHT_STEP=0xa.51aa7280d93dp-3 -DIWEIGHT_STEP=0xc.677fd3dfd408p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DPM1=0 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-17 17:55:22 roa/radeonvii OpenCL compilation in 15.11 s
2020-03-17 17:55:31 roa/radeonvii 655685803 OK 626680000 loaded: blockSize 400, cc909f1a064cff84
2020-03-17 17:55:45 roa/radeonvii 655685803 OK 626680800 95.58%; 10753 us/it; ETA 3d 14:38; 47b82db06405ee2a (check 6.07s) 20 errors
2020-03-17 17:59:18 roa/radeonvii 655685803 OK 626700000 95.58%; 10769 us/it; ETA 3d 14:42; 810fe9130840b14d (check 6.08s) 20 errors
2020-03-17 18:02:59 roa/radeonvii 655685803 OK 626720000 95.58%; 10757 us/it; ETA 3d 14:33; 3294d8e8770b2911 (check 6.22s) 20 errors
[/CODE]1- 10370/10763 = 3.65% slower v6.11-198 than v6.11-134 on the same Radeon VII and exponent, same clock settings.

PhilF 2020-03-17 22:43

[QUOTE=ewmayer;539987]Your mem-downclock is likely the reason you both run slower and at significantly lower power than I, at the same sclk and 1-worker setting. But why not fire up a second worker?[/QUOTE]

I will, once I can make time for it. :smile:

ewmayer 2020-03-17 22:56

[QUOTE=PhilF;539992]I will, once I can make time for it. :smile:[/QUOTE]

I used the low-tech method:

1. open 2nd term;
2. create 2nd rundir under gpuowl-dir, cd into same;
3. populate worktodo by running primenet.py (which I see is based on the same-name py-script of another GIMPS coder who shall remain nameless, but who restored the same "run just once and quit rather than recurring by adding '-t 0'" option which exists in his original script to the gpuowl one :);
4. ../gpuowl

I use same system sclk and fan settings for 2-worker-running as for 1.

preda 2020-03-18 09:59

[QUOTE=ewmayer;539996]I used the low-tech method:

1. open 2nd term;
2. create 2nd rundir under gpuowl-dir, cd into same;
3. populate worktodo by running primenet.py (which I see is based on the same-name py-script of another GIMPS coder who shall remain nameless, but who restored the same "run just once and quit rather than recurring by adding '-t 0'" option which exists in his original script to the gpuowl one :);
4. ../gpuowl

I use same system sclk and fan settings for 2-worker-running as for 1.[/QUOTE]

Do you know about -pool option? You create a single folder where you put the output of primenet.py , and run multiple instances, each in its own folder (indicated with -dir) all feeding from the common pool (indicated with -pool). You can also put config files in the pool dir, for shared config. Next logical step is to put the -pool option in the config file of each individual instance, and now you can start it with only -dir.

I'm afraid all that is not very clear, so let's see an example config:

~/gpuowl-xfx/config.txt contains:
-cpu XFX -uid 780c28cffffffeee -pool /home/user/pool

(note that the gpu is indicated not by -d <position> but by UID, which is very useful when shuffling GPUs around. A symbolic name "XFX" is associated to it).

~/pool/config.txt contains:
-user name

(because the user is shared by all instances)

primenet.py only knows about the pool dir:
~/gpuowl/tools/primenet.py -u <user> -p <password> --dirs ~/pool --tasks 4 -w PRP &

preda 2020-03-18 10:14

[QUOTE=kriesel;539991]1- 10370/10763 = 3.65% slower v6.11-198 than v6.11-134 on the same Radeon VII and exponent, same clock settings.[/QUOTE]

There are warnings there telling you that you are passing -use options that don't exist (because they have been removed), some have been replaced by more general e.g. there is still a T2_SHUFFLE that you may want to try. But a bigger benefit would be to move to ROCm (either 2.10 or 3.1).

kriesel 2020-03-18 12:27

[QUOTE=preda;540029]There are warnings there telling you that you are passing -use options that don't exist (because they have been removed), some have been replaced by more general e.g. there is still a T2_SHUFFLE that you may want to try. But a bigger benefit would be to move to ROCm (either 2.10 or 3.1).[/QUOTE]Moving to rocm isn't an option on Windows. [url]https://rocm.github.io/[/url]
Documentation of what -use options are available would be helpful.

ewmayer 2020-03-18 19:29

[QUOTE=preda;540027]Do you know about -pool option? You create a single folder where you put the output of primenet.py , and run multiple instances, each in its own folder (indicated with -dir) all feeding from the common pool (indicated with -pool). You can also put config files in the pool dir, for shared config. Next logical step is to put the -pool option in the config file of each individual instance, and now you can start it with only -dir.
[snip][/QUOTE]

Thanks, Mihai, but after a few more lines your recipe was already more complex than my neoluddite recipe. :) Perhaps I'll find it useful when I build my multi-GPU dream system later this year.

To paraphrase the late great SNL comedian Phil Hartman by way of his recurring [i]Unfrozen Caveman Lawyer[/i] sketches: "I'm just a *caveman* - the ways of you modern human are strange and unfathomable to me. (But what I do know is that my client deserves at least a $5 million triple-damages settement for injuries and psychological trauma resulting from the defendant's spilling ketchup on him.)"

paulunderwood 2020-03-18 21:54

[QUOTE=ewmayer;539872]Are those with 1 worker or 2? Also, what OS distro are you guys running? As I noted, I am not allowed, even as su, to fiddle the mem-clock settings in my ROCm 2.10 setup under Ubuntu 19.10.[/QUOTE]

See posts in [url]https://mersenneforum.org/showthread.php?t=22204&page=149[/url]

I am running at sclk 3 and have just bumped up my ASUS's memory to 1200 and voltage at 830. 187W and no errors yet... I will lower the voltage if I can over time. 2 instances @ 1467 us/it each for FFT 5632K

ewmayer 2020-03-18 22:06

[QUOTE=paulunderwood;540097]See posts in [url]https://mersenneforum.org/showthread.php?t=22204&page=149[/url]

I am running at sclk 3 and have just bumped up my ASUS's memory to 1200 and voltage at 830. 187W and no errors yet... I will lower the voltage if I can over time. 2 instances @ 1467 us/it each for FFT 5632K[/QUOTE]

Thanks - I long ago did the featuremask fiddle. My issue is not a missing pp_od_clk_voltage entry, is that Ubuntu is not allowing me to modify it. Not a huge deal, just eating the extra Watts and running with mclk at stock. Could become an issue when I build that multi-GPU dream system later this year, though - have an 850W[sup]*[/sup] PS laid in, intended to drive something along the lines of an 8-core AMD CPU plus 3 Radeon VIIs. That will likely need sclk = 3 undervolting of the latter to get the wattage within what the PS can stably handle, would be nice to be able to tune mclk to help maximize throughput of the setup by way of another "tuning dial".

-------------

[sup]*[/sup]That seemed to be the sweet spot in terms of $/watt at $120, all the >= 1KW PSs I looked at cost well over $200. Plus I don't want to be running a system needing more than a kW, our household circuit breakers start tripping at that level when anything else that is power-hungry (e.g. toaster, hair dryer) running off the same part of the "household grid" gets turned on.

preda 2020-03-19 13:37

[QUOTE=kriesel;540034]Moving to rocm isn't an option on Windows. [url]https://rocm.github.io/[/url]
Documentation of what -use options are available would be helpful.[/QUOTE]

There is some brief documentation at the top of gpuowl.cl, that we try to maintain in sync with the code as it changes:
[QUOTE]
DEBUG : enable asserts. Slow, but allows to verify that all asserts hold.

NO_ASM : request to not use any inline __asm()
NO_OMOD: do not use GCN output modifiers in __asm()

OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing>
IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing>

ORIG_X2 <nVidia default>
INLINE_X2 <AMD default>

UNROLL_ALL <nVidia default>
UNROLL_NONE
UNROLL_WIDTH
UNROLL_HEIGHT <AMD default>

T2_SHUFFLE <nVidia default>
NO_T2_SHUFFLE <AMD default>

OLD_FFT8 <default>
NEWEST_FFT8
NEW_FFT8

OLD_FFT5
NEW_FFT5 <default>
NEWEST_FFT5

NEW_FFT10 <default>
OLD_FFT10

CARRY32 <AMD default> // This is potentially dangerous option for large FFTs. Carry may not fit in 31 bits.
CARRY64 <nVidia default>

ORIG_SLOWTRIG // Use the compliler's implementation of sin/cos functions
NEW_SLOWTRIG <default> // Our own sin/cos implementation

---- P-1 below ----

NO_P2_FUSED_TAIL // Do not use the big kernel tailFusedMulDelta
[/QUOTE]

Prime95 2020-03-19 21:36

[QUOTE=Prime95;539912]Using not yet committed code:

Rocm 2.10, sclk 4, mem 1200, FFT 5M; 662us/it.
Running 2 instances: 604us/it (200W measured by rocm-smi)

I love this GPU.[/QUOTE]

Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.

axn 2020-03-20 03:06

[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]

Single instance or two instance combined?

Prime95 2020-03-20 05:03

[QUOTE=axn;540195]Single instance or two instance combined?[/QUOTE]

The two instance combined.

ewmayer 2020-03-20 19:11

[QUOTE=Prime95;540173]Way to go Mihai!! With his latest commit this GPU has broken through the 600us barrier -- 597us.[/QUOTE]

So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs?

preda 2020-03-20 20:04

[QUOTE=ewmayer;540279]So just 'git clone [url]https://github.com/preda/gpuowl[/url] && cd gpuowl && make', then halt/restart ongoing runs?[/QUOTE]

Yes.

"git clone" only the first time, afterwards you can "git pull" in the existing dir.

"scons" can be used as an alternative to "make" (I build myself with scons), but either should work.

ewmayer 2020-03-20 20:54

[QUOTE=preda;540285]Yes.

"git clone" only the first time, afterwards you can "git pull" in the existing dir.

"scons" can be used as an alternative to "make" (I build myself with scons), but either should work.[/QUOTE]

Thanks - timing for my pair of side-by-side jobs at 5632K FFT and sclk=4 dropped from 1475 us/iter (for each job) to 1387 us/iter, 6.3% faster. So now I'm getting slightly better throughput at sclk=4 than I was before at sclk=5.

Comparing apples-to-apples at sclk=5, before was 2 jobs each @1405 us/iter, with the new build down to 1331, 5.6% faster. But sclk=4 saves 60 watts ... hmm, tough choice. I'll probably run at sclk=4 on warm days, sclk=5 otherwise and at night.

Nice work, guys! I hope to begin contributing more substantively later this year, rather than just running code and cheerleading.

preda 2020-03-20 21:46

__attribute__(overloadable) support
 
I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.

The attribute is described here:
[url]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/url]

I would like confirmation that it works on these platforms:
- windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?)
- Nvidia
- amdgpuPro (the other driver for Linux vs. ROCm)

To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.:

in gpuowl.cl
Replace

T2 mul(T2 a, T2 b) ...
with
T2 __attribute__((overloadable)) mul(T2 a, T2 b) ...

And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup.
Thanks!

Note: the title should read "__attribute__((overloadable))", double parens.

paulunderwood 2020-03-20 21:48

:tu:

With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each

ewmayer 2020-03-20 21:53

[QUOTE=paulunderwood;540308]:tu:

With latest commit, running 1 1200 800 1050 3 @ FFT 5632K, timings have sped up for 2 instances from 1473 us/it to 1400 us/it each[/QUOTE]

You're gonna need that extra speed - I'm a mere 60,000 GHz-days behind you in the Top500, roughly equivalent to 150 PRP-tests @5632K. :)

kracker 2020-03-23 13:10

Possible bug- -cleanup works for PRP tests but not P-1 for me.

ewmayer 2020-03-23 22:19

Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1 ... the latter just crashed:
[code]2020-03-23 15:03:11 gfx906+sram-ecc-0 102958243 P2 2394/2880: 174743 primes; setup 4.24 s, 2.271 ms/prime
Memory access fault by GPU node-1 (Agent handle: 0x562cd3ec2150) on address 0xb5f9ee28000. Reason: Unknown.
Aborted (core dumped)[/code]
Both jobs now appear to be halted as result ... wait, the PRP is still *running* but somehow got tripped into super-low priority - I saw the same kind of MCLK-suddenly-gets-cut-by-2/3 yesterday, result of some kind of GPU glitch, that needed reboot to resolve at the time:
[code]GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
1 37.0c 24.0W 809Mhz 351Mhz 21.96% manual 250.0W 2% 0%[/code]
Just restarted the p-1 run, it's re-doing the entire stage 2 ... MCLK now back to normal (1001 MHz). Will update re. what happens with the p-1 stage 2 retry once it finishes.

This is gpuowl v6.11-211-gca63aa9-dirty.

[b]Edit:[/b] Misspoke - p-1 stage 2 picked up at ~90% of the way through ("P2 2736/2880"), and retry completed successfully. So appear to have been a one-off glitch in the matrix. Also oddly, on restart of the p-1 job MCLK got reset back to normal, but SCLK, which I had manually downclocked to 4, somehow got reset not to its default level (IIRC, 7) but to 5, as reflected by wall wattage, fan noise and GPU temperature. Reset to 4, all appears back to normal.

kriesel 2020-03-24 00:40

[QUOTE=ewmayer;540693]Had 2 side-by-side runs going on my Radeon VII - a PRP run, and a p-1.[/QUOTE]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.

So can we count you as another fan of P-1 save files?

ewmayer 2020-03-24 01:30

[QUOTE=kriesel;540709]I'd be skeptical about the performance advantage of running too disparate parallel runs. I've seen it reduce throughput. PRP & LL in tandem, for example, which is different code from different versions.
Did you use -maxAlloc for your P-1 run? If not, start, and if doing parallel runs the limit will need to be lower than if the P-1 stage 2 has the gpu ram to itself.[/QUOTE]
I was just running 2 separate PRP-assignment jobs - for the PRPs there is a marked throughput boost from 2-job-running (cf. my timings in post #1956) - one of which just happened to start on a PRP-assignment for which p-1 had not yet been done. Not using -maxAlloc.

[QUOTE]So can we count you as another fan of P-1 save files?[/QUOTE]

I'm a fan of doing whatever works for increasing users' overall throughput! :) That of course includes minimizing wasted time resulting from run-crashes/BSODs/system-resets/etc.

kriesel 2020-03-26 15:35

GpuOwl P-1 error detection and handling
 
Gpuowl stage 1 needs a res64 error check. This was in v6.11-134.[CODE]2020-03-25 00:57:17 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-25 00:57:25 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-25 00:57:31 roa/radeonvii OpenCL compilation in 5.88 s
2020-03-25 00:57:34 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 0
2020-03-25 00:59:08 roa/radeonvii 550000007 P1 10000 0.14%; 9433 us/it; ETA 0d 19:09; c8b2127abc38054b
2020-03-25 01:00:43 roa/radeonvii 550000007 P1 20000 0.27%; 9433 us/it; ETA 0d 19:07; 6f401486d14cb20f
2020-03-25 01:02:17 roa/radeonvii 550000007 P1 30000 0.41%; 9431 us/it; ETA 0d 19:05; 18a926611c75b118
2020-03-25 01:02:37 roa/radeonvii saved
...
2020-03-25 15:08:30 roa/radeonvii saved
2020-03-25 15:09:08 roa/radeonvii 550000007 P1 5390000 73.68%; 9582 us/it; ETA 0d 05:07; 4cfe624d31a00e27
2020-03-25 15:10:42 roa/radeonvii 550000007 P1 5400000 73.82%; 9428 us/it; ETA 0d 05:01; [COLOR=Red][B]0000000000000000[/B][/COLOR]
2020-03-25 15:12:16 roa/radeonvii 550000007 P1 5410000 73.95%; 9424 us/it; ETA 0d 04:59; 0000000000000000
2020-03-25 15:13:32 roa/radeonvii saved[/CODE]Fourteen hours into the computation, an error occurred that zeroed the residue. The program does not detect the error. It continued powering the zero residue for the remaining iteration count, and periodically updating its two save files with bad interim results, for 5 more hours. It then appears to skip the stage 1 GCD under the error condition detected at the end of the set of iterations.
Resume proceeds despite the bad input from the latter part of stage 1, also skipping the stage 1 GCD.[CODE]2020-03-25 20:10:24 roa/radeonvii saved
2020-03-25 20:11:58 roa/radeonvii 550000007 P1 7310000 99.93%; 9581 us/it; ETA 0d 00:01; 0000000000000000
2020-03-25 20:12:50 roa/radeonvii saved
2020-03-25 20:12:51 roa/radeonvii 550000007 P1 7315345 100.00%; 9913 us/it; ETA 0d 00:00; [COLOR=red][B]0000000000000000[/B][/COLOR]
2020-03-25 20:12:56 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%)
2020-03-25 20:12:56 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes
2020-03-25 20:12:57 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each
2020-03-25 20:31:18 roa/radeonvii 550000007 P2 38/2880: 92454 primes; setup 2.16 s, 11.881 ms/prime
2020-03-25 20:31:18 roa/radeonvii Exception St12domain_error: GCD invalid input
2020-03-25 20:31:18 roa/radeonvii waiting for background GCDs..
2020-03-25 20:31:18 roa/radeonvii Bye
C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>g611

C:\Users\ken\Documents\gpuowl-v6.11-134-g1e0ce1d>gpuowl-win
2020-03-26 09:35:49 gpuowl v6.11-134-g1e0ce1d
2020-03-26 09:35:49 config: -device 1 -user kriesel -cpu roa/radeonvii -yield -maxAlloc 16000 -use NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-03-26 09:35:49 config:
2020-03-26 09:35:49 config: ;NO_ASM,ORIG_SLOWTRIG
2020-03-26 09:35:49 config: ;40M NO_ASM,UNROLL_MIDDLEMUL2,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT3,T2_SHUFFLE_REVERSELINE,T2_SHUFFLE_HEIGHT,CARRY32,CHEBYSHEV_METHOD,ORIG_MIDDLEMUL2,LESS_ACCURATE
2020-03-26 09:35:49 device 1, unique id ''
2020-03-26 09:35:49 roa/radeonvii 550000007 FFT 36864K: Width 256x4, Height 256x8, Middle 9; 14.57 bits/word
2020-03-26 09:35:58 roa/radeonvii OpenCL args "-DEXP=550000007u -DWIDTH=1024u -DSMALL_HEIGHT=2048u -DMIDDLE=9u -DWEIGHT_STEP=0xa.c7166b9401b18p-3 -DIWEIGHT_STEP=0xb.e05b1786463ap-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_METHOD=1 -DLESS_ACCURATE=1 -DMERGED_MIDDLE=1 -DNO_ASM=1 -DORIG_MIDDLEMUL2=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_REVERSELINE=1 -DUNROLL_MIDDLEMUL2=1 -DWORKINGIN5=1 -DWORKINGOUT3=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-26 09:36:05 roa/radeonvii OpenCL compilation in 6.68 s
2020-03-26 09:36:08 roa/radeonvii 550000007 P1 B1=5070000, B2=152100000; 7315345 bits; starting at 7315344
2020-03-26 09:36:09 roa/radeonvii 550000007 P2 B1=5070000, B2=152100000, starting at 38
2020-03-26 09:36:14 roa/radeonvii P-1 (B1=5070000, B2=152100000, D=30030): primes 8202674, expanded 8746218, doubles 1277965 (left 5804395), singles 5646744, total 6924709 (84%)
2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using blocks [169 - 5065] to cover 6924709 primes
2020-03-26 09:36:14 roa/radeonvii 550000007 P2 using 38 buffers of 288.0 MB each
2020-03-26 09:54:35 roa/radeonvii 550000007 P2 76/2880: 92460 primes; setup 2.30 s, 11.868 ms/prime
[/CODE]Since there is no periodic permanently retained save file from before the error occurred, and both the stage 1 save files are from after the unhandled error, the entire run is a loss (~33 hours wall clock).
Stage 2 should not proceed from bad input from stage 1, but it does, without warning. Error checks, and a field for "passed last error check" in the save file could handle that.

preda 2020-03-26 19:41

[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]

Hi Ken, I agree that was a loss. I'll look into improving this.

kriesel 2020-03-26 21:52

Gpuowl-win v6.11-219-ge70ec99 build
 
2 Attachment(s)
Built, produced a help output, no other testing yet.

kriesel 2020-03-26 23:09

1 Attachment(s)
[QUOTE=preda;540307]I would like to start using __attribute__((overloadable)) in gpuowl OpenCL source, but before that I'd like to find out whether it's supported everywhere we care.

The attribute is described here:
[URL]https://clang.llvm.org/docs/AttributeReference.html#overloadable[/URL]

I would like confirmation that it works on these platforms:
- windows (with whatever OpenCL windows uses for AMD GPUs -- catalyst?)
- Nvidia
- amdgpuPro (the other driver for Linux vs. ROCm)

To check the attribute, simply add "__attribute__((overloadable))" to some function between the return type and function name, e.g.:

in gpuowl.cl
Replace

T2 mul(T2 a, T2 b) ...
with
T2 __attribute__((overloadable)) mul(T2 a, T2 b) ...

And recompile, and afterwards *run* the resulting gpuowl to check the OpenCL compilation that happens at startup.
Thanks!

Note: the title should read "__attribute__((overloadable))", double parens.[/QUOTE]


AOK on AMD RX480 /Win7 x64:
[CODE]// complex mul
T2 __attribute__((overloadable)) mul(T2 a, T2 b) { return U2(mad1(a.x, b.x, -a.y * b.y), mad1(a.x, b.y, a.y * b.x)); }

Driver version as indicated by GPU-Z: 25.20.14007.1000 (Adrenalin 18.10.21/Win 64)
[/CODE][CODE]2020-03-26 17:16:48 gpuowl v6.11-219-ge70ec99-dirty
2020-03-26 17:16:48 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500
2020-03-26 17:16:48 device 0, unique id ''
2020-03-26 17:16:48 condorella/rx480 97685813 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word
2020-03-26 17:16:51 condorella/rx480 OpenCL args "-DEXP=97685813u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598a6d26b0dap-3 -DIWEIGHT_STE
P=0xf.546b91e1254f8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DAMDGPU=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-26 17:16:55 condorella/rx480 OpenCL compilation in 4.81 s
2020-03-26 17:16:56 condorella/rx480 97685813 P1 B1=1000000, B2=27000000; 1442134 bits; starting at 0
2020-03-26 17:17:34 condorella/rx480 97685813 P1 10000 0.69%; 3785 us/it; ETA 0d 01:30; 6bd301fd8aadd98a[/CODE]Also on Win7 x64, NVIDIA GTX1080, NVIDIA driver version 378.92:[CODE]C:\Users\ken\Documents\gpuowl-v6.11-219-ge70ec99\overloadable test>gpuowl-win
2020-03-26 18:05:21 gpuowl v6.11-219-ge70ec99-dirty
2020-03-26 18:05:21 config: -device 0 -user kriesel -cpu emu/gtx1080 -yield -maxAlloc 7500 -use NO_ASM
2020-03-26 18:05:21 device 0, unique id ''
2020-03-26 18:05:21 emu/gtx1080 97685953 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.94 bits/word
2020-03-26 18:05:23 emu/gtx1080 OpenCL args "-DEXP=97685953u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STEP=0x8.598138082486p-3 -DIWEIGHT_STEP=0xf
.547c79820ff18p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DPM1=1 -DNO_ASM=1 -cl-fast-relaxed-math -cl-std=CL2.0 "
2020-03-26 18:05:28 emu/gtx1080

2020-03-26 18:05:28 emu/gtx1080 OpenCL compilation in 5.26 s
2020-03-26 18:05:29 emu/gtx1080 97685953 P1 B1=1000000, B2=270000000; 1442134 bits; starting at 0
2020-03-26 18:06:18 emu/gtx1080 97685953 P1 10000 0.69%; 4908 us/it; ETA 0d 01:57; 4577ae6cbb52f038
2020-03-26 18:07:07 emu/gtx1080 97685953 P1 20000 1.39%; 4917 us/it; ETA 0d 01:57; fc2022db22907e71[/CODE]

kriesel 2020-03-26 23:47

[QUOTE=preda;539360]Yes. All gpuowl does on savefile is write the file and close it. From this point on, it's the OS's job to persist the file to disk. It turns out often the OS is lazy and prefers to keep the data in RAM for a while longer, and if a OS crash happens in this window, the savefile isn't properly persisted.[/QUOTE]This on fflush sounds like you could force the commit to disk for critical info. [url]https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/fflush?view=vs-2019[/url]

Prime95 2020-03-26 23:48

I recommend no P-1 testing until further notice. I'm investigating a bug.

kriesel 2020-03-27 00:04

[QUOTE=Prime95;540993]I recommend no P-1 testing until further notice. I'm investigating a bug.[/QUOTE]Do you have any guidance on what versions are thought affected or unaffected?

kriesel 2020-03-27 00:34

[QUOTE=preda;540961]Hi Ken, I agree that was a loss. I'll look into improving this.[/QUOTE]
Thanks. For reference,
[URL]https://mersenneforum.org/showpost.php?p=537396&postcount=1838[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537580&postcount=1853[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537628&postcount=1856[/URL]
[URL]https://mersenneforum.org/showpost.php?p=537647&postcount=1859[/URL]
[URL]https://mersenneforum.org/showpost.php?p=540929&postcount=1982[/URL]

Prime95 2020-03-27 01:19

[QUOTE=kriesel;540995]Do you have any guidance on what versions are thought affected or unaffected?[/QUOTE]

In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.

ewmayer 2020-03-27 19:41

[QUOTE=Prime95;541007]In the current version CARRYM64 is broken. If testing near the upper limit of an FFT (I'm working on what "near" means), use CARRY64 option.[/QUOTE]

George, any update on the exponent ranges in question?

Prime95 2020-03-27 21:32

[QUOTE=ewmayer;541093]George, any update on the exponent ranges in question?[/QUOTE]

Sort of.

Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP.

Here is the long answer:

What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1.

Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length.

The formula for expected max carry32 during the mul-by-3 P-1 step should be:

3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261)

If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors).

Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT:

0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261)
BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522
BPW = 17.755
max exp for 5M FFT = 93.1M

similarly for a 5.5M FFT, max exp = 102.2M

ewmayer 2020-03-27 23:33

[QUOTE=Prime95;541101]Sort of.

Short answer is "maximum exponent for the FFT length" - "log2(3) = 1.585 fewer bits per word" compensates for the mul-by-3 in P-1. Long answer is you can go a little higher than that because the maximum exponent has some carry32 "head room" during PRP.

Here is the long answer:

What you are worried about is the absolute value of the 32-bit carry exceeding 0x80000000. I studied 500K iterations of 24518003 in a 1.25M FFT (18.706 bits-per-word). The maximum carry32 was 0x32420000. Fine for PRP, not so for the mul-by-3 step in P-1.

Next (well actually first) I tried to calculate a reasonable max exponent for 1.25M, 2.5M, 5M, 10M, 20M, 40M, 80M exponents. We can store roughly 0.261 fewer bits per FFT word for each doubling of the FFT length.

The formula for expected max carry32 during the mul-by-3 P-1 step should be:

3 * 0x32420000 * 2^(BPW - 18.706) * 2 ^ (log2(FFTLEN/1.25M) * .261)

If this max exceeds 0x70000000 I'd be worried. I'm thinking less than 0x67000000 should be very safe. It's all a matter of how much protection you want from an outlier value (much the same as protecting against outlier round off errors).

Let's see if an example works. Going for a fairly safe max carry32 of 0x70000000 in a 5M FFT:

0x70000000 = 3 * 0x32420000 * 2^BPW * 2^-18.706 * 2^(2 * .261)
BPW = log2 (0x70000000 / 3 / 0x32420000) + 18.706 - .522
BPW = 17.755
max exp for 5M FFT = 93.1M[/QUOTE]
Thanks for the explainer. Looking at my own scalar-double carry macro - here all vars are doubles, x is the convolution output we are normalizing, wi_re is the inverse DWT weight (the 1/n is absorbed into that), prp_mult is your 3, cy is carryin from next-lower iFFT term (and re-used for carryout):
[code]x *= wi_re;\
temp = DNINT(x);\
frac = fabs(x-temp);\
temp = temp*prp_mult + cy;\
cy = DNINT(temp*baseinv[i]);\
x = (temp-cy*base[i])*wt_re;\[/code]
I'm guessing using all-doubles is not a good option for your target hardware.
Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?.
[QUOTE]similarly for a 5.5M FFT, max exp = 102.2M[/QUOTE]
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?

Prime95 2020-03-27 23:39

[QUOTE=ewmayer;541111]
Ruh-roh - I've been doing p-1 work at 5.5M for exp ~= 103M, using the checkin of last week. Should I halt my runs and restart with CARRY64 enabled?[/QUOTE]

Yes, but I would not redo those P-1.

ewmayer 2020-03-28 00:16

[QUOTE=Prime95;541112]Yes, but I would not redo those P-1.[/QUOTE]

I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?

One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?

Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?

Prime95 2020-03-28 01:45

[QUOTE=ewmayer;541113]I don't see CARRY64 in the readme - is that an undocumented cmd-line flag?[/QUOTE]

-use CARRY64

Prime95 2020-03-28 01:48

[QUOTE=ewmayer;541113]One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?[/quote]

I don't know.

[quote]Oh, small UI suggestion: -fft 6144 for the above gave "FFT too small" error, i.e. the UI needs raw FFT length, in this case 6291456. It was a little annoying to have the resulting run immediately echo to effect of "starting run with FFT length 6144K". Could the -fft option be fiddled to use FFT length in K?[/QUOTE]

-fft 6M works. As well as -fft 6144K

preda 2020-03-28 11:48

[QUOTE=ewmayer;541113]
One my 2 runs just finished a PRP and started p-1 on the next expo - I killed, deleted savefiles and restarted the p-1 job @6144K - assuming that finds no factor, will the ensuing PRP of the same expo automatically switch back to 5632K?
[/QUOTE]

I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)

kriesel 2020-03-28 16:49

3 strikes you're out, game over until tomorrow
 
gpuowl could handle error cases more gracefully. Luckily I stumbled across this one while handling something else. Otherwise it could have cost nearly a day's throughput on that gpu.

Please consider commenting out a problematic worktodo line and continuing on with the next in such a case, instead of killing the run.

Also, since config.txt optimization content is fft length dependent, what's optimal for one fft length can be fatal for another.

Please consider fft-length-specific enhancement to config.txt, as mentioned before.[CODE]2020-03-28 10:23:18 condorella/rx480 CC 94418041 / 94418041, 4d816a6edf6393__
2020-03-28 10:23:20 condorella/rx480 {"exponent":"94418041", "worktype":"PRP-3", "status":"C", "program":{"name":"gpuowl", "v
ersion":"v6.11-134-g1e0ce1d"}, "timestamp":"2020-03-28 15:23:20 UTC", "user":"kriesel", "computer":"condorella/rx480", "aid":
"(redacted)", "fft-length":5242880, "res64":"4d816a6edf6393__", "residue-type":1, "errors":{"gerbicz":0
}}2020-03-28 10:23:21 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word
2020-03-28 10:23:22 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STE
P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05
18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-03-28 10:23:25 condorella/rx480 OpenCL compilation in 3.68 s
2020-03-28 10:23:28 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:35 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:49; 6781adfa7991c92a (check 2.29s)
2020-03-28 10:23:37 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:44 condorella/rx480 131500093 EE 800 0.00%; 5251 us/it; ETA 7d 23:48; 6781adfa7991c92a (check 2.29s)
1 errors
2020-03-28 10:23:46 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-03-28 10:23:53 condorella/rx480 131500093 EE 800 0.00%; 5255 us/it; ETA 7d 23:58; 6781adfa7991c92a (check 2.30s)
2 errors
2020-03-28 10:23:53 condorella/rx480 3 sequential errors, will stop.
2020-03-28 10:23:53 condorella/rx480 Exiting because "too many errors"
2020-03-28 10:23:53 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>title gpuowl-v6.11-134-g1e0ce1d/rx480

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-03-28 11:27:31 gpuowl v6.11-134-g1e0ce1d[/CODE]

ewmayer 2020-03-28 20:49

@George - thanks, I missed the K and M suffix options in my perusal of the readme.

[QUOTE=preda;541145]I suppose you passed a -fft command line argument. Then it will affect all the tasks, thus will affect the PRP as well. (i.e. will not switch back to the default FFT)[/QUOTE]

I checked at end of the forced-6144K p-1 run; it indeed started the ensuing PRP at the same FFT length, so killed and restarted sans -fft flag.

All runs now restarted using -use CARRY64 -- thanks, George. Also, you'll be pleased to hear tha after the latest BSOD-style crash of the Haswell system which hosts my Radeon VII I finally got round to trying the disable-C-states trick you recommend in the BIOS Overclock submenu - seem to work like charm, system has been rock-stable since, uptime 4 days and counting, which is really long for this system.

More details on what happens for me with -use CARRY64, in the context of 2 side-by-side PRP runs @5632K:

o Initially, each run going at a steady 1386 us/iter at my sclk=4 setting;
o Stop run 0 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1402 us/iter, which seems weird since only one is using the slower-but-safe carry option;
o Stop run 1 & restart with -use CARRY64, after a few more 200Kiter intervals, both jobs are up to 1420 us/iter, a 2.5% hit to throughput.

Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?

preda 2020-03-29 09:04

[QUOTE=ewmayer;541193]
Since the bug only affects p-1 runs, would it be difficult to tweak things so that -use CARRY64 invoked-by-user is only operative in p-1 testing? Or maybe allow separate specification by job type, e.g -use CARRY64 means both worktypes, -pm1 CARRY64 means apply to p-1, -prp CARRY64 means apply to prp runs?[/QUOTE]

I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there.

preda 2020-03-29 10:10

[QUOTE=kriesel;540929]Gpuowl stage 1 needs a res64 error check.[/QUOTE]

But Ken, what is the appropriate action to take on error?

Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discart the whole test as corrupted.

kriesel 2020-03-29 12:16

[QUOTE=preda;541235]But Ken, what is the appropriate action to take on error?

Let's say that during P-1 stage1, a residue==0 is detected. This is not the result of innapropriate FFT-size, it indicates a hardware error. But what to do in this situation, given there's no way to check P-1 -- basically what makes sense is for gpuowl to simply stop doing any P-1 (on that GPU). It can't reliably roll back to any trusted point. Assuming that a residue that is != 0 is a correct one, in the situation where the GPU produces res==0 sometimes, is not a good way to go. I would just discard the whole test as corrupted.[/QUOTE]Count the error. Roll back to last save file that did not detect an error. That's the same as CUDALucas does when it detects an error, or prime95 does, or did before addition of Jacobi or Gerbicz checks.
Or suspend effort on that worktodo line and go to the next entry; then the user can decide later whether to resume or abandon the item that had an issue.
There are several res64-based checks possible.
res64=0 at any iteration; check more of the res to see if it's zero too. If it is it's an error, probably a failure to copy the full residue. (There's a tiny chance that res128>0 but res64=0 occurs and is correct.)
res=1 at any iteration; res=3 after the first iteration.
res64 repeating from one iteration to the next.
res64 cycling among a very small list of values. [URL]https://www.mersenneforum.org/showpost.php?p=515641&postcount=10[/URL]

ewmayer 2020-03-29 19:53

[QUOTE=preda;541232]I just commited a change that makes CARRY64 the default for P-1, and CARRY32 the default for PRP on AMD. The rationale being that, if CARRY32 is not appropriate, this fact will be visible for PRP, thus safe; on P-1 we use the safe default (i.e. CARRY64) until we have a better solution there.[/QUOTE]

Awesome - just pulled, built, and switched runs to.

George, never heard your thoughts on whether checking the relative signs of the signed-int x and the result of the 3*x might be a useful diagnostic here.

ewmayer 2020-03-29 21:17

Just noticed something curious - since I didn't know how long it might be until a fix for the carry issue, yesterday I edited my 2 worktodo files - 10 entries each - and moved all PRPs not preceded by a p-1 of the same exponent to the top. Except that I buggered one such edit, and a PRP got moved into the top slot, while its accompanying p-1 remained below. Caught the 'doh!' with the PRP ~20% done, halted, moved the p-1 to its proper place at top of the file, resumed. The exponent is 103939597. The PRP was using 5632K, the just-started p-1 is using 6144K. Is that expected?

preda 2020-03-29 21:26

[QUOTE=ewmayer;541268]Just noticed something curious - since I didn't know how long it might be until a fix for the carry issue, yesterday I edited my 2 worktodo files - 10 entries each - and moved all PRPs not preceded by a p-1 of the same exponent to the top. Except that I buggered one such edit, and a PRP got moved into the top slot, while its accompanying p-1 remained below. Caught the 'doh!' with the PRP ~20% done, halted, moved the p-1 to its proper place at top of the file, resumed. The exponent is 103939597. The PRP was using 5632K, the just-started p-1 is using 6144K. Is that expected?[/QUOTE]

Yes, the FFT bounds are a bit more conservative for P-1. I think this area (FFT bounds) is under investigation currently. But yes, what you see is expected given the current code.

ewmayer 2020-03-29 21:56

[QUOTE=preda;541269]Yes, the FFT bounds are a bit more conservative for P-1. I think this area (FFT bounds) is under investigation currently. But yes, what you see is expected given the current code.[/QUOTE]

Has the "more conservative threshold for p-1" been changed in the latest commit? Because in my 2 results files I see p as large as 103985003 using 5632 for the p-1 step, using older builds.

ATH 2020-03-29 23:03

I compiled gpuowl on the Colab pro and want to test it on the Tesla P100.
Anyone have a list of all the different options that can be tweaked to find the fastest combination?

preda 2020-03-29 23:04

[QUOTE=ewmayer;541273]Has the "more conservative threshold for p-1" been changed in the latest commit? Because in my 2 results files I see p as large as 103985003 using 5632 for the p-1 step, using older builds.[/QUOTE]

No, the different FFT bounds between PRP and P-1 has been there for about 1month, I'm not aware of recent changes. So it may be something else.

ewmayer 2020-03-29 23:35

[QUOTE=preda;541281]No, the different FFT bounds between PRP and P-1 has been there for about 1month, I'm not aware of recent changes. So it may be something else.[/QUOTE]

FYI, the most-recentcase I see in my logs of an expo > than the recent ones using 5632K is 16. Feb, v6.11-142-gf54af2e.

More weirdness, this time hardware related - current pair of runs suffered drastic slowing-down ~30 mins ago, despite temps having been well below the usual 'caution' threshold, no odd fan noises or any other sign of amiss-ness. SMI showed both s-and-m-clocks well below their normal for my default sclk=4 setting. This happened once before, and a quick 'rocm-smi --gpureset -d 1' resolved it. To cover all the bases I first rebooted, verified the slowness persisted, then tried the reset - this time no joy.

preda 2020-03-30 00:20

[QUOTE=ewmayer;541284]FYI, the most-recentcase I see in my logs of an expo > than the recent ones using 5632K is 16. Feb, v6.11-142-gf54af2e.

More weirdness, this time hardware related - current pair of runs suffered drastic slowing-down ~30 mins ago, despite temps having been well below the usual 'caution' threshold, no odd fan noises or any other sign of amiss-ness. SMI showed both s-and-m-clocks well below their normal for my default sclk=4 setting. This happened once before, and a quick 'rocm-smi --gpureset -d 1' resolved it. To cover all the bases I first rebooted, verified the slowness persisted, then tried the reset - this time no joy.[/QUOTE]

No idea, sorry. Did you check dmesg for errors? did you try without setting sclk after reboot, to see the behavior? Do you see the power use, how did that change?

kriesel 2020-03-30 01:00

[QUOTE=ATH;541280]I compiled gpuowl on the Colab pro and want to test it on the Tesla P100.
Anyone have a list of all the different options that can be tweaked to find the fastest combination?[/QUOTE]See [url]https://mersenneforum.org/showpost.php?p=540152&postcount=1968[/url] and the source code.

Prime95 2020-03-30 01:01

[QUOTE=ewmayer;541111]
Question: In the 4th of those 6 lines, both temp and cy can be of either sign, but temp, the rounded-but-not-yet-wordsize-normalized iFFT term, is nearly always going to much larger in magnitude than the +cy carryin, i.e. we should be able to infer the expected sign of the next line's carryout computation from it, to see whether - in your case - the integer result overlowing into the sign bit, yes?[/QUOTE]

We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development.

The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.

ewmayer 2020-03-30 02:51

[QUOTE=Prime95;541293]We could detect the error condition with about 3 or 4 instructions. However, we try to create the fastest code possible and pick default settings that should safely avoid dangerous situations. Sometimes we don't quite succeed -- especially with day-to-day development.

The current code that selects CARRY64 for all P-1 work is overkill. I know how to fix that.[/QUOTE]

I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.

For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?

Prime95 2020-03-30 03:17

[QUOTE=ewmayer;541297]I forgot to ask if it could simply be a matter of p-1 auto-switching to use CARRY64 for a suitable set of exponent thresholds, based on your analysis.[/quote]

Yes, that is the goal.

[quote]For going-forward debug runs, would it be feasible to wrap the simple "sign after *3 same as sign of input" parity check in a set of preprocessor #ifs so you could build the slower-but-with-error-check code, run a bunch of expos just below the thresholds you set, and see if any such overflow-into-sign-bit errors occur?[/QUOTE]

In the latest code, I set -use DEBUG,CARRY32_LIMIT=0x70000000 to print any iterations where 32-bit carry is getting close to the limit. This is slow code, useful for analysis, not for production runs.

ewmayer 2020-03-30 19:34

Update on my Radeon VII sudden-onset-slowdown yesterday: more weirdness. Haven't yet spotted anything in the dmesg logs of note, but I'm still get familiar with which AMD-GPU-related messages are normal and which not. Now to the weirdness.

My usual post-reboot procedure is:

1. Fire up gpuOwl job in each of two terminal windows, each job in a separate working dir;
2. Open 3rd window, fiddle settings to sclk=4 and fan=120 (or higher if interior temps warrant it);
3. Fire up LL/PRP job on the CPU;
4. Look at rocm-smi output to check GPU state.

Last night, again rebooted system (just to cover all bases), fired up first gpuOwl job, but then skipped to [4] above - all looked normal, Wattage ~200, temp nearing 70C, SCLK and MCLK at their expected values, fan noise ramping up nicely. Thought "yay! I fixed it!" Fired up second gpuOwl job - within seconds fan noise starts dropping fast, check of rocm-smi shows dreaded "the workers have gone on strike" numbers. Kill second job, things revert back to normal. No clue why the GPU is suddenly balking at running 2 jobs, but it being late figured better quit while I'm ahead - set sclk=5 to help compensate for the throughput hit from 1-job running, went to bed.

Even more weirdness - Just fired up 2nd job to see if the issue is reproducible, now all seems back to normal.

I believe the technical term is "gremlins".

ATH 2020-03-31 00:07

From gpuowl.cl:

[QUOTE]OUT_WG,OUT_SIZEX,OUT_SPACING <AMD default is 256,32,4> <nVidia default is 256,4,1 but needs testing>
IN_WG,IN_SIZEX,IN_SPACING <AMD default is 256,32,1> <nVidia default is 256,4,1 but needs testing>[/QUOTE]

What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4

Prime95 2020-03-31 03:56

[QUOTE=ATH;541364]From gpuowl.cl:



What are the possible values and range to test for these variables?


On the Tesla P100 on Google Colab pro, it was at 909 µs/iteration at default settings at 5M FFT, now with tuned settings at 817 µs/iteration:
-use ORIG_X2,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,ORIG_SLOWTRIG,OUT_WG=256,OUT_SIZEX=32,OUT_SPACING=4[/QUOTE]

IN/OUT_WG=64,128,256,512
IN/OUT_SIZEX=4,8,16,32,64,128 (gpuowl will whine when the combination does not make sense)
IN/OUT_SPACING=4,8,16,32,64,128

You are the first nVidia user to test all these combinations. Alas, previously the colab GPUs showed only minor differences in these settings whereas nVidia consumer GPUs benefitted much more from an optimal setting.

BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.

ATH 2020-03-31 17:10

[QUOTE=Prime95;541384]BTW, are you using today's checked in code? I'm surprised ORIG_SLOWTRIG would be faster than NEW_SLOWTRIG.[/QUOTE]

I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone [url]https://github.com/preda/gpuowl[/url]

Prime95 2020-03-31 17:30

[QUOTE=ATH;541410]I compiled it 2 days ago, any important changes since then? I can try to compile it again later today or tomorrow and test again.

This is the one to use right?
git clone [url]https://github.com/preda/gpuowl[/url][/QUOTE]

Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.

Yes, that is the correct source.

ewmayer 2020-03-31 19:56

[QUOTE=Prime95;541411]Since 2 days ago, the trig code changed -- probably a smidge faster and more accurate.
For Ernst, the new FFT boundaries are in place with automated selection of CARRY32 vs. CARRY64.[/QUOTE]

Just grabbed (v. 62a3025) and built. Switched one of my 2 runs to it to goive a spin, see this on start (PRP of p = 103937143 @5632K):
[i]
Expected maximum carry32: 47840000
[/i]
Aside - before switching that run to the new version, both were getting ~1335 us/iter (total 1498 iter/sec). With 1 run using new version, that run is now @ 1580 us/iter and the other has speeded up to 1168 us/iter (total 1490 iter/sec). With both runs using new version, both are at 1333 us/iter (total 1500 iter/sec). Probably some weird rocm-process-priority thing.

Aside #2: I've been doing near-daily price checks of new XFX Radeon VII cards on Amazon - they fluctuate interestingly. Couple days ago, $580. Yesterday, back to the same $550 I paid for mine in Feb. Just now, $600.

ATH 2020-04-01 00:20

Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:
[QUOTE]-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=64,IN_SIZEX=8,IN_SPACING=2
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32,OUT_WG=64,OUT_SIZEX=8,OUT_SPACING=4,IN_WG=128,IN_SIZEX=16,IN_SPACING=4[/QUOTE]

Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.

preda 2020-04-01 01:15

[QUOTE=ATH;541441]Ok, just compiled it again 1-1.5h ago on the Colab Tesla P100-PCIE-16GB.

This version is a bit faster on default settings at 5M FFT: 895µs/iteration

Got down to 832 µs with:
-use ORIG_X2,ORIG_SLOWTRIG,UNROLL_ALL,NO_T2_SHUFFLE,CARRY32

I did not test all 144 x2 combinations of the 6 variables, but I did test many and found 2 different combinations that both run at 809 µs:


Many other combinations run at 810-820µs and many more at 820-850 µs, and a few rare bad ones ran at 980-990µs.

Switching back from ORIG_SLOWTRIG to NEW_SLOWTRIG at the final settings, changed the speed from 809µs to 814-815µs, so not a big difference.[/QUOTE]

ORIG_X2 and INLINE_X2 do not exist anymore, setting them has no effect whatsoever.

This seems to suggest these changes to Nvidia defaults:
- handle T2_SHUFFLE like on AMD (i.e. default to NO_T2_SHUFFLE)
- handle CARRY like on AMD (i.e. default to CARRY32)

Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.

Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).

kracker 2020-04-01 03:39

Windows compilation:

[code]
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)':
Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ^~
Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~~~~~~~
Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~
make: *** [Makefile:30: Gpu.o] Error 1
[/code]

preda 2020-04-01 04:17

The error should be fixed (the printf warning remains).
Your compiler is strange:
- it has 32-bit long (we talked about this before). This is allowed, but unusual. You can verify this by e.g. checking sizeof(long)
- it seems that std::string is basic_string<wchar_t> (from the warning message). This is, in my understanding of the C++ standard, not allowed.

[QUOTE=kracker;541453]Windows compilation:

[code]
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:33:25: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
33 | log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
| ~^ ~~~~~~~~~~~~
| | |
| char* const value_type* {aka const wchar_t*}
| %hs
Gpu.cpp: In member function 'std::tuple<bool, long long unsigned int, unsigned int> Gpu::isPrimePRP(u32, const Args&, std::atomic<unsigned int>&)':
Gpu.cpp:881:51: warning: left shift count >= width of type [-Wshift-count-overflow]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ^~
Gpu.cpp:881:42: warning: division by zero [-Wdiv-by-zero]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~~~~~~~
Gpu.cpp:881:48: error: right operand of shift expression '(1 << 32)' is >= than the precision of the left operand [-fpermissive]
881 | constexpr float roundScale = 1.0 / (1L << 32);
| ~~~~^~~~~~
make: *** [Makefile:30: Gpu.o] Error 1
[/code][/QUOTE]

preda 2020-04-01 04:57

ROCm 3.1 is now the recommended platform for AMD GPUs -- it is the fastest, and also actively tuned for.

kriesel 2020-04-01 07:11

[QUOTE=preda;541454]The error should be fixed (the printf warning remains).
Your compiler is strange:
- it has 32-bit long (we talked about this before). This is allowed, but unusual. You can verify this by e.g. checking sizeof(long)
- it seems that std::string is basic_string<wchar_t> (from the warning message). This is, in my understanding of the C++ standard, not allowed.[/QUOTE]
long is apparently 4 byte in Visual Studio or gcc typically on Windows:

[url]https://docs.microsoft.com/en-us/cpp/cpp/fundamental-types-cpp?view=vs-2019[/url]
[url]https://stackoverflow.com/questions/22344388/size-of-long-int-and-int-in-c-showing-4-bytes[/url]

preda 2020-04-01 08:28

No, that's [QUOTE]Visual Studio 2008 on a 32-bit architecture[/QUOTE].
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays.
As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules.

OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know.

[QUOTE=kriesel;541459]long is apparently 4 byte in Visual Studio or gcc typically on Windows:

[url]https://docs.microsoft.com/en-us/cpp/cpp/fundamental-types-cpp?view=vs-2019[/url]
[url]https://stackoverflow.com/questions/22344388/size-of-long-int-and-int-in-c-showing-4-bytes[/url][/QUOTE]

henryzz 2020-04-01 08:44

[QUOTE=preda;541462]No, that's .
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays.
As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules.

OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know.[/QUOTE]

long long is typically how you get the 8 byte version on windows.

kriesel 2020-04-01 13:49

[QUOTE=preda;541462]No, that's .
On a 64-bit architecture (as is common), I expect even VS has sizeof(long)==8, nowadays.
As I said, the size of long is not mandated by C++. It's just normal for long to be 64bit (on 64-bit architectures), but if it isn't that's withing the rules.

OK, reading more on the first link, I see that they state that long is 32bit even on 64bit arch. That's a MS-VS idiosyncrasy then. Good to know.[/QUOTE]
Also 4-byte on the gcc compile performed on Windows, per the second link.

kriesel 2020-04-01 13:55

robust fail to start PRP
 
I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093. Gpuowl v6.11-134-g1e0ce1d chose the initial 7M fft length on its own. After finding it reproducible, I successively incremented -fft to seek a reliable run case. It wasn't until it reached 9M fft that it succeeded in the GEC. The resulting speed penalty is considerable, 7.5 msec/iter versus 5.3 on an RX480. From the program's help output,[CODE]FFT 7M [ 11.01M - 132.46M] 1K-512-7 256-2K-7 512-1K-7 2K-256-7
FFT 8M [ 12.58M - 150.85M] 2K-2K 4K-1K
FFT 9M [ 14.16M - 169.18M] 1K-512-9 256-2K-9 512-1K-9 2K-256-9[/CODE][CODE]C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:47:57 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:47:57 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM
2020-04-01 07:47:57 device 0, unique id ''
2020-04-01 07:47:57 condorella/rx480 131500093 FFT 7168K: Width 256x4, Height 64x8, Middle 7; 17.92 bits/word
2020-04-01 07:47:59 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=7u -DWEIGHT_STEP=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:48:03 condorella/rx480 OpenCL compilation in 3.97 s
2020-04-01 07:48:06 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:48:13 condorella/rx480 131500093 EE 800 0.00%; 5272 us/it; ETA 8d 00:34; 6781adfa7991c92a (check 2.31s)
2020-04-01 07:48:15 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:48:22 condorella/rx480 131500093 EE 800 0.00%; 5309 us/it; ETA 8d 01:56; 6781adfa7991c92a (check 2.31s) 1 errors
2020-04-01 07:48:24 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:48:31 condorella/rx480 131500093 EE 800 0.00%; 5298 us/it; ETA 8d 01:32; 6781adfa7991c92a (check 2.33s) 2 errors
2020-04-01 07:48:31 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:48:31 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:48:31 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:48:50 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:48:50 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +1
2020-04-01 07:48:50 device 0, unique id ''
2020-04-01 07:48:50 condorella/rx480 131500093 FFT 7168K: Width 64x4, Height 256x8, Middle 7; 17.92 bits/word
2020-04-01 07:48:53 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=256u -DSMALL_HEIGHT=2048u -DMIDDLE=7u -DWEIGHT_STE
P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f05
18db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:48:57 condorella/rx480 OpenCL compilation in 4.67 s
2020-04-01 07:49:01 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:49:11 condorella/rx480 131500093 EE 800 0.00%; 7714 us/it; ETA 11d 17:46; 55f854bea6c1cecf (check 3.28s)
2020-04-01 07:49:14 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:49:24 condorella/rx480 131500093 EE 800 0.00%; 7697 us/it; ETA 11d 17:10; 55f854bea6c1cecf (check 3.29s) 1 errors
2020-04-01 07:49:27 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:49:37 condorella/rx480 131500093 EE 800 0.00%; 7687 us/it; ETA 11d 16:46; 55f854bea6c1cecf (check 3.27s) 2 errors
2020-04-01 07:49:37 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:49:37 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:49:37 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:50:25 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:50:25 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +2
2020-04-01 07:50:25 device 0, unique id ''
2020-04-01 07:50:25 condorella/rx480 131500093 FFT 7168K: Width 64x8, Height 256x4, Middle 7; 17.92 bits/word
2020-04-01 07:50:27 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=512u -DSMALL_HEIGHT=1024u -DMIDDLE=7u -DWEIGHT_STEP=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a115506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:50:31 condorella/rx480 OpenCL compilation in 3.72 s
2020-04-01 07:50:34 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:50:42 condorella/rx480 131500093 EE 800 0.00%; 6286 us/it; ETA 9d 13:37; 6f8253cbb2fe58e9 (check 2.71s)
2020-04-01 07:50:45 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:50:53 condorella/rx480 131500093 EE 800 0.00%; 6283 us/it; ETA 9d 13:29; 6f8253cbb2fe58e9 (check 2.71s) 1 errors
2020-04-01 07:50:56 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:51:03 condorella/rx480 131500093 EE 800 0.00%; 6299 us/it; ETA 9d 14:05; 6f8253cbb2fe58e9 (check 2.71s) 2 errors
2020-04-01 07:51:03 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:51:03 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:51:03 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:51:29 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:51:29 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +3
2020-04-01 07:51:29 device 0, unique id ''
2020-04-01 07:51:29 condorella/rx480 131500093 FFT 7168K: Width 256x8, Height 64x4, Middle 7; 17.92 bits/word
2020-04-01 07:51:29 condorella/rx480 using long carry kernels
2020-04-01 07:51:32 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=2048u -DSMALL_HEIGHT=256u -DMIDDLE=7u -DWEIGHT_STE
P=0x8.7b964bd91a558p-3 -DIWEIGHT_STEP=0xf.16e489ea55fc8p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1
15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:51:36 condorella/rx480 OpenCL compilation in 3.97 s
2020-04-01 07:51:39 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:51:46 condorella/rx480 131500093 EE 800 0.00%; 5275 us/it; ETA 8d 00:42; cfbd904e74b67aae (check 2.31s)
2020-04-01 07:51:48 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:51:54 condorella/rx480 131500093 EE 800 0.00%; 5249 us/it; ETA 7d 23:44; cfbd904e74b67aae (check 2.29s)1 errors
2020-04-01 07:51:57 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:52:03 condorella/rx480 131500093 EE 800 0.00%; 5239 us/it; ETA 7d 23:23; cfbd904e74b67aae (check 2.29s)2 errors
2020-04-01 07:52:03 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:52:03 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:52:03 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:52:07 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:52:07 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +4
2020-04-01 07:52:07 device 0, unique id ''
2020-04-01 07:52:07 condorella/rx480 131500093 FFT 8192K: Width 256x8, Height 256x8; 15.68 bits/word
2020-04-01 07:52:07 condorella/rx480 using long carry kernels
2020-04-01 07:52:10 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=2048u -DSMALL_HEIGHT=2048u -DMIDDLE=1u -DWEIGHT_ST
EP=0xa.039f00d8f95f8p-3 -DIWEIGHT_STEP=0xc.c82be96a7181p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1
15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:52:15 condorella/rx480 OpenCL compilation in 5.16 s
2020-04-01 07:52:18 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:52:27 condorella/rx480 131500093 EE 800 0.00%; 6583 us/it; ETA 10d 00:28; 05252a7f59574e37 (check 2.85s)
2020-04-01 07:52:30 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:52:38 condorella/rx480 131500093 EE 800 0.00%; 6587 us/it; ETA 10d 00:36; 05252a7f59574e37 (check 2.85s) 1 errors
2020-04-01 07:52:41 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:52:49 condorella/rx480 131500093 EE 800 0.00%; 6594 us/it; ETA 10d 00:53; 05252a7f59574e37 (check 2.86s) 2 errors
2020-04-01 07:52:49 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:52:49 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:52:49 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:53:21 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:53:21 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +5
2020-04-01 07:53:21 device 0, unique id ''
2020-04-01 07:53:21 condorella/rx480 131500093 FFT 8192K: Width 512x8, Height 256x4; 15.68 bits/word
2020-04-01 07:53:21 condorella/rx480 using long carry kernels
2020-04-01 07:53:23 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=4096u -DSMALL_HEIGHT=1024u -DMIDDLE=1u -DWEIGHT_ST
EP=0xa.039f00d8f95f8p-3 -DIWEIGHT_STEP=0xc.c82be96a7181p-4 -DWEIGHT_BIGSTEP=0xa.5fed6a9b15138p-3 -DIWEIGHT_BIGSTEP=0xc.5672a1
15506d8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:53:26 condorella/rx480 OpenCL compilation in 3.53 s
2020-04-01 07:53:30 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:53:39 condorella/rx480 131500093 EE 800 0.00%; 7196 us/it; ETA 10d 22:51; 6df742314b82f841 (check 3.11s)
2020-04-01 07:53:42 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:53:51 condorella/rx480 131500093 EE 800 0.00%; 7219 us/it; ETA 10d 23:43; 6df742314b82f841 (check 3.11s) 1 errors
2020-04-01 07:53:54 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:54:03 condorella/rx480 131500093 EE 800 0.00%; 7190 us/it; ETA 10d 22:38; 6df742314b82f841 (check 3.10s) 2 errors
2020-04-01 07:54:03 condorella/rx480 3 sequential errors, will stop.
2020-04-01 07:54:03 condorella/rx480 Exiting because "too many errors"
2020-04-01 07:54:03 condorella/rx480 Bye
C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>g611

C:\msys64\home\ken\gpuowl-compile\gpuowl-v6.11-134-g1e0ce1d\rx480>gpuowl-win
2020-04-01 07:54:08 gpuowl v6.11-134-g1e0ce1d
2020-04-01 07:54:08 config: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM -fft +6
2020-04-01 07:54:08 device 0, unique id ''
2020-04-01 07:54:08 condorella/rx480 131500093 FFT 9216K: Width 256x4, Height 64x8, Middle 9; 13.93 bits/word
2020-04-01 07:54:08 condorella/rx480 using long carry kernels
2020-04-01 07:54:12 condorella/rx480 OpenCL args "-DEXP=131500093u -DWIDTH=1024u -DSMALL_HEIGHT=512u -DMIDDLE=9u -DWEIGHT_STEP=0x8.5f7e7ead6051p-3 -DIWEIGHT_STEP=0xf.498539ec95fe8p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2020-04-01 07:54:16 condorella/rx480 OpenCL compilation in 4.11 s
2020-04-01 07:54:20 condorella/rx480 131500093 OK 0 loaded: blockSize 400, 0000000000000003
2020-04-01 07:54:29 condorella/rx480 131500093 OK 800 0.00%; 7461 us/it; ETA 11d 08:32; bbe24bd13cd73020 (check 3.26s)
2020-04-01 08:19:33 condorella/rx480 131500093 OK 200000 0.15%; 7541 us/it; ETA 11d 11:03; 190bb27ff665f83b (check 3.25s)
[/CODE]

Prime95 2020-04-01 16:59

[QUOTE=kriesel;541479]I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093.[/QUOTE]

Works for me on Linux rocm 3.1. Maybe a Windows compiler / driver bug?

ewmayer 2020-04-01 20:08

[QUOTE=henryzz;541463]long long is typically how you get the 8 byte version on windows.[/QUOTE]

Lack of unambiguous-length basic data types was one of the great blunders of the C language standard. Here is the snip of preprocessor code I use in my Mlucas types.h file, the 64-bit section includes a small hack from George. I'm sure event this is not completely portable, but it has served me well. 'MSVC' refers to Visual Studio, OS_BITS is an Mlucas predef reflecting the bit-ness (32 or 64) of the OS:
[code]typedef char int8;
typedef char sint8;
typedef unsigned char uint8;

typedef short int16;
typedef short sint16;
typedef unsigned short uint16;

typedef int int32;
typedef int sint32;
typedef unsigned int uint32;

/* 64-bit int: */
/* MSVC doesn't like 'long long', and of course MS has their own
completely non-portable substitute:
*/
#if(defined(OS_TYPE_WINDOWS) && defined(COMPILER_TYPE_MSVC))
typedef signed __int64 int64;
typedef signed __int64 sint64;
typedef unsigned __int64 uint64;
typedef const signed __int64 int64c;
typedef const signed __int64 sint64c;
typedef const unsigned __int64 uint64c;

/* GW: In many cases where the C code is interfacing with the assembly code */
/* we must declare variables that are exactly 32-bits wide. This is the */
/* portable way to do this, as the linux x86-64 C compiler defines the */
/* long data type as 64 bits. We also use portable definitions for */
/* values that can be either an integer or a pointer. */
#if OS_BITS == 64
typedef int64 intptr_t;
typedef uint64 uintptr_t;
#else
typedef int32 intptr_t;
typedef uint32 uintptr_t;
#endif

#else
typedef long long int64;
typedef long long sint64;
typedef unsigned long long uint64;
typedef const long long int64c;
typedef const long long sint64c;
typedef const unsigned long long uint64c;
#endif[/code]

kriesel 2020-04-01 20:08

[QUOTE=Prime95;541498]Works for me on Linux rocm 3.1. Maybe a Windows compiler / driver bug?[/QUOTE]
Reproduced here, on gpuowl-win v6.11-134, -fft 0 and -fft +5:
RX550 on same system as previous report; driver 25.20.14007.1000 (Adrenalin 18.10.2) / Win7 64

An RX550 on separate system, driver 26.20.12028.2 (Adrenalin 19.20)/Win10 64
A Radeon VII, on driver 26.20.12028.2 (Adrenalin 19.20)/Win10 64

Not reproduced here, on gpuowl-win v6.11-198, same Radeon VII as above, same system, same driver.
Also not reproduced, on gpuowl-win v6.11-219-ge70ec99, on the same RX480, system, driver combo that led this off with the previous post 2031.

All my gpuowl-win builds are done on the one system where the RX480 resides, with the same step by step build process;
mkdir latest
cd latest
git clone [URL]https://github.com/preda/gpuowl[/URL]
cd gpuowl
make gpuowl-win.exe

preda 2020-04-01 20:55

I'm investigating.

[QUOTE=kriesel;541479]I don't know why, but -fft 0 through -fft +5 all hit EE in 800 iterations on this exponent 131500093. Gpuowl v6.11-134-g1e0ce1d[/QUOTE]

ATH 2020-04-01 22:15

[QUOTE=preda;541444]Could you run with -use ROUNDOFF paired in turn with ORIG_SLOWTRIG/NEW_SLOWTRIG and look at the average roundoff error to evaluate their respective accuracy. If ORIG_SLOWTRIG is similarly accurate to NEW_SLOWTRIG we may consider making it the default on Nvidia.

Could other Nvidia users speak up if those proposed Nvidia defaults have adverse performance effects for them (due to different hardware).[/QUOTE]

I tested the first 100K iterations of 95,000,011:

ORIG_SLOWTRIG: Roundoff: N=10374, max 0.312500, avg 0.212775
NEW_SLOWTRIG: Roundoff: N=10374, max 0.312500, avg 0.214292


I can try and test on my own 2080, if I can compile gpuowl in Windows, or find a new compiled version.

ATH 2020-04-02 00:28

RTX 2080 is so bad at double precision and the timings are very inconsistent.

But NEW_SLOWTRIG is better at 3520µs/ite vs 3680µs/ite for ORIG_SLOWTRIG.
T2_SHUFFLE is slightly better at 3520µs vs 3553µs for NO_T2_SHUFFLE
Otherwise CARRY64 and CARRY32 is about the same.
I'm not going to test all those 6 variables on this, since it is very slow and the inconsistencies in the timings is larger than the differences.

Btw UNROLL_NONE,UNROLL_WIDTH and UNROLL_HEIGHT does not work at all on either the Tesla P100 or the RTX 2080.


All times are UTC. The time now is 07:02.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.