![]() |
Congrats on getting gpuowl working under rocm 3.1.
Warning to rocm 2.10 users: this version is slower. The occupancy of carryFused went from 7 to 6. Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1? P.S. With your last sin/cos change, you can delete the comments on MORE_ACCURATE and LESS_ACCURATE. You should also be able to change P-1 back to using the newer trig code. |
The MIDDLE=1 FFTs appear to be broken.
Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory. If you think it would help, I'll write a middle=8 routine for you. |
[QUOTE=Prime95;538656]The occupancy of carryFused went from 7 to 6.
Can you post the register usage for carryFused, tailFused, and two fftMiddle routines in rocm 3.1? [/QUOTE] This is the difference in occupancy/VGPRs between 2.10 (on left) and 3.1 (on right) [CODE] fftHout : Occupancy: = 7 | fftHout : Occupancy: = 6 fftMiddleOut : Occupancy: = 3 | fftMiddleOut : Occupancy: = 4 carryA : Occupancy: = 10 | carryA : Occupancy: = 9 carryFused : Occupancy: = 7 | carryFused : Occupancy: = 6 square : Occupancy: = 9 | square : Occupancy: = 10 --------------------- isEqual : NumVgprs: = 7 | isEqual : NumVgprs: = 6 isNotZero : NumVgprs: = 5 | isNotZero : NumVgprs: = 4 fftW : NumVgprs: = 32 | fftW : NumVgprs: = 31 fftHin : NumVgprs: = 35 | fftHin : NumVgprs: = 36 fftHout : NumVgprs: = 36 | fftHout : NumVgprs: = 39 k_fftP : NumVgprs: = 33 | k_fftP : NumVgprs: = 34 fftMiddleIn : NumVgprs: = 63 | fftMiddleIn : NumVgprs: = 62 fftMiddleOut : NumVgprs: = 65 | fftMiddleOut : NumVgprs: = 64 carryA : NumVgprs: = 15 | carryA : NumVgprs: = 26 carryM : NumVgprs: = 15 | carryM : NumVgprs: = 18 carryB : NumVgprs: = 15 | carryB : NumVgprs: = 14 carryFused : NumVgprs: = 36 | carryFused : NumVgprs: = 38 carryFusedMul : NumVgprs: = 39 | carryFusedMul : NumVgprs: = 37 transposeW : NumVgprs: = 105 | transposeW : NumVgprs: = 102 transposeH : NumVgprs: = 103 | transposeH : NumVgprs: = 101 square : NumVgprs: = 27 | square : NumVgprs: = 24 tailFusedMul : NumVgprs: = 120 | tailFusedMul : NumVgprs: = 118 tailFusedMulLow : NumVgprs: = 122 | tailFusedMulLow : NumVgprs: = 120 tailFusedMulDelta: NumVgprs: = 122 | tailFusedMulDelta: NumVgprs: = 120 [/CODE] As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). OTOH I see a speedup of about 1.2% on ROCm 3.1 compared to 2.10 |
[QUOTE=Prime95;538685]The MIDDLE=1 FFTs appear to be broken.
[/QUOTE] Do you know why it's broken? (or which change broke it?) [QUOTE] Question: Would gpuowl benefit from a MIDDLE=8 step? The whole reason for MIDDLE=1 was to do 4 passes over memory instead of 6. Now that we support MERGED_MIDDLE, both MIDDLE=1 and MIDDLE=8 would do 4 passes over memory. If you think it would help, I'll write a middle=8 routine for you.[/QUOTE] Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know. |
[QUOTE=preda;538694]Do you know why it's broken? (or which change broke it?)
Do I understand correctly that only powers-of-two FFTs would benefit from a middle=8? The wavefront is not on a power of two ATM, but will get there. Would a middle=4 make sense? About how well it would work (compared to middle=1), I guess we have to try it to know.[/QUOTE] I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2. I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something! I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required. We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15. |
[QUOTE=preda;538692]As you say, the occupancy of carryFused went one notch down. (it is exactly 1 VGPR over, it may be possible to win that one back). [/QUOTE]
I'll upgrade a machine to 3.1 and work on this. I remember it took a full day fighting the optimizer to save that one VGPR register. |
config.txt
On a given gpu model in gpuowl, what is optimal for one fft length is not for another, and sometimes breaks correct function in another fft length. That is caught by the PRP GEC, stopping progress until the user intervenes, and sometimes produces all-zeroes meaningless computation in P-1 stage 1.
Perhaps the config.txt syntax could be extended to support a regardless-of-fft-length line, optionally fft-length-specific -use lines, and optionally a default safe but slower line for fft lengths that have not been benchmarked and tuned yet? Maybe something like [CODE]all: -device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 4608K: -use NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1,T2_SHUFFLE_WIDTH,T2_SHUFFLE_HEIGHT,UNROLL_MIDDLEMUL2,UNROLL_MIDDLEMUL1,CARRY32,CHEBYSHEV_METHOD_FMA,CHEBYSHEV_MIDDLEMUL2,LESS_ACCURATE 5120K: -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG default: -use NO_ASM [/CODE]worktodo line to reproduce (there are others): [CODE]B1=1020000,B2=29580000;PFactor=0,1,2,96580489,-1,77,2[/CODE]Following -use options are result of optimization runs for 5120K but breaks 5632K P-1: [CODE]-device 0 -user kriesel -cpu condorella/rx480 -yield -maxAlloc 7500 -use NO_ASM,UNROLL_HEIGHT,UNROLL_WIDTH,MERGED_MIDDLE,WORKINGIN1,WORKINGOUT1,T2_SHUFFLE_HEIGHT,T2_SHUFFLE_WIDTH,CARRY32,MORE_SQUARES_MIDDLEMUL1,CHEBYSHEV_MIDDLEMUL2,NEW_SLOWTRIG[/CODE]results in all-zero repeating res64 in start of stage 1:[CODE]2020-03-02 10:22:05 device 0, unique id '' 2020-03-02 10:22:05 condorella/rx480 96580489 FFT 5632K: Width 256x4, Height 64x4, Middle 11; 16.75 bits/word 2020-03-02 10:22:07 condorella/rx480 OpenCL args "-DEXP=96580489u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=11u -DWEIGHT_STE P=0x9.893b9e4410c28p-3 -DIWEIGHT_STEP=0xd.6c37c4b92b54p-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f051 8db8a8p-4 -DAMDGPU=1 -DCARRY32=1 -DCHEBYSHEV_MIDDLEMUL2=1 -DMERGED_MIDDLE=1 -DMORE_SQUARES_MIDDLEMUL1=1 -DNEW_SLOWTRIG=1 -DNO _ASM=1 -DT2_SHUFFLE_HEIGHT=1 -DT2_SHUFFLE_WIDTH=1 -DUNROLL_HEIGHT=1 -DUNROLL_WIDTH=1 -DWORKINGIN1=1 -DWORKINGOUT1=1 -I. -cl- fast-relaxed-math -cl-std=CL2.0" 2020-03-02 10:22:10 condorella/rx480 OpenCL compilation in 3.03 s 2020-03-02 10:22:10 condorella/rx480 96580489 P1 B1=1020000, B2=29580000; 1471504 bits; starting at 0 2020-03-02 10:22:48 condorella/rx480 96580489 P1 10000 0.68%; 3789 us/it; ETA 0d 01:32; 0000000000000000 2020-03-02 10:23:26 condorella/rx480 96580489 P1 20000 1.36%; 3797 us/it; ETA 0d 01:32; 0000000000000000 2020-03-02 10:23:41 condorella/rx480 Stopping, please wait.. 2020-03-02 10:23:41 condorella/rx480 Exiting because "stop requested" 2020-03-02 10:23:41 condorella/rx480 Bye[/CODE]This is repeatable for other exponents |
[QUOTE=Prime95;538727]I'll upgrade a machine to 3.1 and work on this..[/QUOTE]
Giant mistake. Upgrading did not work. Had to reinstall the OS. Rocm 3.1 doesn't work even on a fresh install (clinfo cannot find any of the GPUs). The entire process has bricked one of the GPUs. Not happy. The dmesg error on the bricked GPU is "Direct firmware load for amdgpu/vega20_ta.bin failed with error -2". Six hours wasted, more struggles ahead. Correction: I get the dmesg error on the two working GPUs. No error message for the bricked GPU. |
Holy moly.
This is a perfect example why it's called "the bleeding edge". This is especially true anytime AMD drivers are involved. I knew their Windows drivers could be radioactive to the point of having to reinstall the OS, but I had no idea Linux could be lethally irradiated also. |
[QUOTE=Prime95;538726]I did some digging: MIDDLE=1 requires ORIG_SLOWTRIG. Apparently transpose uses k,n values outside the range 0 to pi/2.
I timed a 3.5M, 4M, 4.5M FFT all with ORIG_SLOWTRIG. Times were 509us, 803us, 651us respectively. So we definitely need to do something! I agree we need both a MIDDLE=4 and MIDDLE=8. This would let us eliminate the fft_HEIGHT and fft_WIDTH routines that are built on fft8. It has been my experience that fft8 is substantially slower than fft4, probably due to the extra VGPRs required. We also need MIDDLE=13,14,15 so we have a complete set of MIDDLE values from 4 to 15.[/QUOTE] Thank you! I just merged the changes. Indeed this is a massive speedup on powers-of-2 FFTs probably bringing them in line with the other sizes. (and thanks for the FFT-size display fix) I added a few asserts() to .cl (enabled with -use DEBUG) that allow to check the angle range. |
I've been wanting to do some P-1 runs where the B1/B2 (mostly B2) values are too big for u32. I was thinking of turning them all into u64's, unless this is bad idea or I'm missing something. What do you think?
|
| All times are UTC. The time now is 23:11. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.