![]() |
[QUOTE=nomead;532541]Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore.
... But that's way too much effort to sink into a quick test like this. Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT).[/QUOTE] I always throw away the first one. It's there just to recreate the situation analogous to the program and hardware is up to steady state, let's see the sustainable throughput. Not only would the first have an advantage of cool hardware and higher clock rate where the clock is not pinned, it is likely to have the advantage of memory is already free and ready to allocate, and others I can't think of right now. I think you left "quick test" territory a while ago. Wow that's thorough. I'm running single 10,000-iter timings generally. Re corporate, condolences. Scheduled virus scans and backups may be dodged, but not all aspects. |
gpuowl-v6.11-79-g0c139c4
Win7 Pro x64, AMD RX550 4GB (fixed 1203Mhz gpu clock by design) 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word config -device 1 -user kriesel -cpu condorella/rx550 15919 NO_ASM us/sq warmup & user interaction 15915 NO_ASM baseline 20500 NO_ASM,MERGED_MIDDLE,WORKINGIN 20498 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability) [B]15585 [/B]NO_ASM,MERGED_MIDDLE,WORKINGIN1 15589 NO_ASM,MERGED_MIDDLE,WORKINGIN1A 15751 NO_ASM,MERGED_MIDDLE,WORKINGIN2 15990 NO_ASM,MERGED_MIDDLE,WORKINGIN3 18175 NO_ASM,MERGED_MIDDLE,WORKINGIN4 15568 NO_ASM,MERGED_MIDDLE,WORKINGIN5 16065 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT4 33707 NO_ASM,MERGED_MIDDLE,WORKINGOUT 19353 NO_ASM,MERGED_MIDDLE,WORKINGOUT0 16301 NO_ASM,MERGED_MIDDLE,WORKINGOUT1 16284 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A [B]15945 [/B]NO_ASM,MERGED_MIDDLE,WORKINGOUT2 16002 NO_ASM,MERGED_MIDDLE,WORKINGOUT3 16484 NO_ASM,MERGED_MIDDLE,WORKINGOUT4 17037 NO_ASM,MERGED_MIDDLE,WORKINGOUT5 15869 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1 15917 NO_ASM [B]15373[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2 repeatability +-1/20499 = +-0.005% best 15373 base 15915 ratio 1.0353 |
Latest git commit is slightly slower on a P100(754 vs 751 compared to 0c139c4, 836 vs 821 for P1)
By the way... how is P1 currently for gpuowl? |
[QUOTE=nomead;532530]Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.[/QUOTE]
The WORKINGIN and WORKINGOUT settings are independent. You do not need to test every combination. That is, if you find that WORKINGIN1 is best with the default setting of WORKINGOUT3, then WORKINGIN1 should be be the best choice for all the WORKINGOUT settings. It is interesting that the 2080 and P100 show little difference among the choices. On the Radeon VII, there can be 100+us difference (15+%). |
[QUOTE=kracker;532551]Latest git commit is slightly slower on a P100?[/QUOTE]
Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times. |
[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]
Running at 749/750 us/it now...:whee: We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl... |
[QUOTE=kracker;532569]Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...[/QUOTE] Interesting. There are several other places in the code that could shuffle T values (a double) rather than T2 values (2 doubles - a complex number). It would double the amount of local storage required, which could negatively impact occupancy.... |
Interesting...
Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
[CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty 2019-12-10 14:39:00 Note: no config.txt file found 2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500 2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word 2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200 #pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__) ^ C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required #error OpenCL >= 2.0 required ^ 1 warning and 1 error generated. error: Clang front-end compilation failed! Frontend phase failed compilation. Error: Compiling CL to IR 2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build 2019-12-10 14:39:01 Bye[/CODE] |
[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]
Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide. I had just hacked in the new shuffle. Now I'll go back and code it up proper (with -use switches) so we can turn the feature on and off as needed on different GPUs. Thanks for prompting me to try this! |
[QUOTE=Prime95;532584]Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.[/QUOTE]
:shock: There is such a wealth of knowledge on these boards, I find myself constantly in awe. |
The OpenCL version check should be fixed now (recent commit)
[QUOTE=xx005fs;532580]Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though. [CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty 2019-12-10 14:39:00 Note: no config.txt file found 2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500 2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word 2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200 #pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__) ^ C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required #error OpenCL >= 2.0 required ^ 1 warning and 1 error generated. error: Clang front-end compilation failed! Frontend phase failed compilation. Error: Compiling CL to IR 2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build 2019-12-10 14:39:01 Bye[/CODE][/QUOTE] |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.