mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

Prime95 2020-03-01 20:41

Congrats on getting gpuowl working under rocm 3.1.
Warning to rocm 2.10 users: this version is slower. The occupancy of carryFused went from 7 to 6.

Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1?

P.S. With your last sin/cos change, you can delete the comments on MORE_ACCURATE and LESS_ACCURATE. You should also be able to change P-1 back to using the newer trig code.

Prime95 2020-03-02 04:46

The MIDDLE=1 FFTs appear to be broken.

Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory.

If you think it would help, I'll write a middle=8 routine for you.

preda 2020-03-02 10:11

[QUOTE=Prime95;538656]The occupancy of carryFused went from 7 to 6.

Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1?
[/QUOTE]

This is the difference in occupancy/VGPRs between 2.10 (on left) and 3.1 (on right)

[CODE]
fftHout : Occupancy: = 7 | fftHout : Occupancy: = 6
fftMiddleOut : Occupancy: = 3 | fftMiddleOut : Occupancy: = 4
carryA : Occupancy: = 10 | carryA : Occupancy: = 9
carryFused : Occupancy: = 7 | carryFused : Occupancy: = 6
square : Occupancy: = 9 | square : Occupancy: = 10

---------------------

isEqual : NumVgprs: = 7 | isEqual : NumVgprs: = 6
isNotZero : NumVgprs: = 5 | isNotZero : NumVgprs: = 4
fftW : NumVgprs: = 32 | fftW : NumVgprs: = 31
fftHin : NumVgprs: = 35 | fftHin : NumVgprs: = 36
fftHout : NumVgprs: = 36 | fftHout : NumVgprs: = 39
k_fftP : NumVgprs: = 33 | k_fftP : NumVgprs: = 34
fftMiddleIn : NumVgprs: = 63 | fftMiddleIn : NumVgprs: = 62
fftMiddleOut : NumVgprs: = 65 | fftMiddleOut : NumVgprs: = 64
carryA : NumVgprs: = 15 | carryA : NumVgprs: = 26
carryM : NumVgprs: = 15 | carryM : NumVgprs: = 18
carryB : NumVgprs: = 15 | carryB : NumVgprs: = 14
carryFused : NumVgprs: = 36 | carryFused : NumVgprs: = 38
carryFusedMul : NumVgprs: = 39 | carryFusedMul : NumVgprs: = 37
transposeW : NumVgprs: = 105 | transposeW : NumVgprs: = 102
transposeH : NumVgprs: = 103 | transposeH : NumVgprs: = 101
square : NumVgprs: = 27 | square : NumVgprs: = 24
tailFusedMul : NumVgprs: = 120 | tailFusedMul : NumVgprs: = 118
tailFusedMulLow : NumVgprs: = 122 | tailFusedMulLow : NumVgprs: = 120
tailFusedMulDelta: NumVgprs: = 122 | tailFusedMulDelta: NumVgprs: = 120
[/CODE]

As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). OTOH I see a speedup of about 1.2% on ROCm 3.1 compared to 2.10

preda 2020-03-02 10:26

[QUOTE=Prime95;538685]The MIDDLE=1 FFTs appear to be broken.
[/QUOTE]
Do you know why it's broken? (or which change broke it?)

[QUOTE]
Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory.

If you think it would help, I'll write a middle=8 routine for you.[/QUOTE]

Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know.

Prime95 2020-03-02 13:39

[QUOTE=preda;538694]Do you know why it's broken? (or which change broke it?)

Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know.[/QUOTE]

I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2.

I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something!

I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required.

We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15.

Prime95 2020-03-02 13:43

[QUOTE=preda;538692]As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). [/QUOTE]

I'll upgrade a machine to 3.1 and work on this. I remember it took a full day fighting the optimizer to save that one VGPR register.

kriesel 2020-03-02 17:03

config.txt
 
On a given gpu model in gpuowl, what is optimal for one fft length is not for another, and sometimes breaks correct function in another fft length. That is caught by the PRP GEC, stopping progress until the user intervenes, and sometimes produces all-zeroes meaningless computation in P-1 stage 1.
Perhaps the config.txt syntax could be extended to support a regardless-of-fft-length line, optionally fft-length-specific -use lines, and optionally a default safe but slower line for fft lengths that have not been benchmarked and tuned yet? Maybe something like
[CODE]all: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500
4608K: -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE
5120K: -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG
default: -use NO_ASM
[/CODE]worktodo line to reproduce (there are others): [CODE]B1=1020000,B2=29580000;PFactor=0,1,2,96580489,-1,77,2[/CODE]Following -use options are result of optimization runs for 5120K but breaks 5632K P-1: [CODE]-device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/CODE]results in all-zero repeating res64 in start of stage 1:[CODE]2020-03-02 10:22:05 device 0, unique id ''
2020-03-02 10:22:05 condorella/rx480 96580489 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.75 bits/word
2020-03-02 10:22:07 condorella/rx480 OpenCL args "-DEXP=96580489u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STE
P=0x9.893b9e4410c28p-3 -DIWEIGHT_STEP=0xd.6c37c4b92b54p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f051
8db8a8p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO
_ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1=1 -DWORKINGOUT1=1 -I. -cl-
fast-relaxed-math -cl-std=CL2.0"
2020-03-02 10:22:10 condorella/rx480 OpenCL compilation in 3.03 s
2020-03-02 10:22:10 condorella/rx480 96580489 P1 B1=1020000, B2=29580000; 1471504 bits; starting at 0
2020-03-02 10:22:48 condorella/rx480 96580489 P1 10000 0.68%; 3789 us/it; ETA 0d 01:32; 0000000000000000
2020-03-02 10:23:26 condorella/rx480 96580489 P1 20000 1.36%; 3797 us/it; ETA 0d 01:32; 0000000000000000
2020-03-02 10:23:41 condorella/rx480 Stopping, please wait..
2020-03-02 10:23:41 condorella/rx480 Exiting because "stop requested"
2020-03-02 10:23:41 condorella/rx480 Bye[/CODE]This is repeatable for other exponents

Prime95 2020-03-02 21:27

[QUOTE=Prime95;538727]I'll upgrade a machine to 3.1 and work on this..[/QUOTE]

Giant mistake.

Upgrading did not work. Had to reinstall the OS. Rocm 3.1 doesn't work even on a fresh install (clinfo cannot find any of the GPUs). The entire process has bricked one of the GPUs. Not happy.

The dmesg error on the bricked GPU is "Direct firmware load for amdgpu/vega20_ta.bin failed with error -2".

Six hours wasted, more struggles ahead.


Correction: I get the dmesg error on the two working GPUs. No error message for the bricked GPU.

PhilF 2020-03-02 21:51

Holy moly.

This is a perfect example why it's called "the bleeding edge".

This is especially true anytime AMD drivers are involved. I knew their Windows drivers could be radioactive to the point of having to reinstall the OS, but I had no idea Linux could be lethally irradiated also.

preda 2020-03-03 12:06

[QUOTE=Prime95;538726]I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2.

I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something!

I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required.

We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15.[/QUOTE]

Thank you! I just merged the changes. Indeed this is a massive speedup on powers-of-2 FFTs probably bringing them in line with the other sizes. (and thanks for the FFT-size display fix)

I added a few asserts() to .cl (enabled with -use DEBUG) that allow to check the angle range.

mrh 2020-03-03 17:37

I've been wanting to do some P-1 runs where the B1/B2 (mostly B2) values are too big for u32. I was thinking of turning them all into u64's, unless this is bad idea or I'm missing something. What do you think?


All times are UTC. The time now is 23:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.