mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2019-12-10 14:04

[QUOTE=nomead;532541]Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore.

...
But that's way too much effort to sink into a quick test like this.

Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT).[/QUOTE]
I always throw away the first one. It's there just to recreate the situation analogous to the program and hardware is up to steady state, let's see the sustainable throughput. Not only would the first have an advantage of cool hardware and higher clock rate where the clock is not pinned, it is likely to have the advantage of memory is already free and ready to allocate, and others I can't think of right now.
I think you left "quick test" territory a while ago. Wow that's thorough. I'm running single 10,000-iter timings generally.

Re corporate, condolences. Scheduled virus scans and backups may be dodged, but not all aspects.

kriesel 2019-12-10 15:13

gpuowl-v6.11-79-g0c139c4
Win7 Pro x64, AMD RX550 4GB (fixed 1203Mhz gpu clock by design)
89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
config -device 1 -user kriesel -cpu condorella/rx550

15919 NO_ASM us/sq warmup & user interaction
15915 NO_ASM baseline
20500 NO_ASM,MERGED_MIDDLE,WORKINGIN
20498 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
[B]15585 [/B]NO_ASM,MERGED_MIDDLE,WORKINGIN1
15589 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
15751 NO_ASM,MERGED_MIDDLE,WORKINGIN2
15990 NO_ASM,MERGED_MIDDLE,WORKINGIN3
18175 NO_ASM,MERGED_MIDDLE,WORKINGIN4
15568 NO_ASM,MERGED_MIDDLE,WORKINGIN5
16065 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT4
33707 NO_ASM,MERGED_MIDDLE,WORKINGOUT
19353 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
16301 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
16284 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
[B]15945 [/B]NO_ASM,MERGED_MIDDLE,WORKINGOUT2
16002 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
16484 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
17037 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
15869 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1
15917 NO_ASM

[B]15373[/B] NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2
repeatability +-1/20499 = +-0.005%
best 15373
base 15915
ratio 1.0353

kracker 2019-12-10 15:19

Latest git commit is slightly slower on a P100(754 vs 751 compared to 0c139c4, 836 vs 821 for P1)

By the way... how is P1 currently for gpuowl?

Prime95 2019-12-10 19:29

[QUOTE=nomead;532530]Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.[/QUOTE]

The WORKINGIN and WORKINGOUT settings are independent. You do not need to test every combination. That is, if you find that WORKINGIN1 is best with the default setting of WORKINGOUT3, then WORKINGIN1 should be be the best choice for all the WORKINGOUT settings.

It is interesting that the 2080 and P100 show little difference among the choices. On the Radeon VII, there can be 100+us difference (15+%).

Prime95 2019-12-10 19:45

[QUOTE=kracker;532551]Latest git commit is slightly slower on a P100?[/QUOTE]

Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.

kracker 2019-12-10 20:42

[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]

Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...

Prime95 2019-12-10 22:26

[QUOTE=kracker;532569]Running at 749/750 us/it now...:whee:
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...[/QUOTE]

Interesting. There are several other places in the code that could shuffle T values (a double) rather than T2 values (2 doubles - a complex number). It would double the amount of local storage required, which could negatively impact occupancy....

xx005fs 2019-12-10 22:40

Interesting...
 
Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
[CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye[/CODE]

Prime95 2019-12-11 00:12

[QUOTE=Prime95;532566]Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.[/QUOTE]

Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.

I had just hacked in the new shuffle. Now I'll go back and code it up proper (with -use switches) so we can turn the feature on and off as needed on different GPUs.

Thanks for prompting me to try this!

CRGreathouse 2019-12-11 01:38

[QUOTE=Prime95;532584]Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.[/QUOTE]

:shock:

There is such a wealth of knowledge on these boards, I find myself constantly in awe.

preda 2019-12-11 03:53

The OpenCL version check should be fixed now (recent commit)

[QUOTE=xx005fs;532580]Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
[CODE]2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye[/CODE][/QUOTE]


All times are UTC. The time now is 23:14.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.