mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-12-10, 14:04   #1552
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by nomead View Post
Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore.

...
But that's way too much effort to sink into a quick test like this.

Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT).
I always throw away the first one. It's there just to recreate the situation analogous to the program and hardware is up to steady state, let's see the sustainable throughput. Not only would the first have an advantage of cool hardware and higher clock rate where the clock is not pinned, it is likely to have the advantage of memory is already free and ready to allocate, and others I can't think of right now.
I think you left "quick test" territory a while ago. Wow that's thorough. I'm running single 10,000-iter timings generally.

Re corporate, condolences. Scheduled virus scans and backups may be dodged, but not all aspects.

Last fiddled with by kriesel on 2019-12-10 at 14:06
kriesel is offline   Reply With Quote
Old 2019-12-10, 15:13   #1553
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

gpuowl-v6.11-79-g0c139c4
Win7 Pro x64, AMD RX550 4GB (fixed 1203Mhz gpu clock by design)
89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
config -device 1 -user kriesel -cpu condorella/rx550

15919 NO_ASM us/sq warmup & user interaction
15915 NO_ASM baseline
20500 NO_ASM,MERGED_MIDDLE,WORKINGIN
20498 NO_ASM,MERGED_MIDDLE,WORKINGIN (repeatability)
15585 NO_ASM,MERGED_MIDDLE,WORKINGIN1
15589 NO_ASM,MERGED_MIDDLE,WORKINGIN1A
15751 NO_ASM,MERGED_MIDDLE,WORKINGIN2
15990 NO_ASM,MERGED_MIDDLE,WORKINGIN3
18175 NO_ASM,MERGED_MIDDLE,WORKINGIN4
15568 NO_ASM,MERGED_MIDDLE,WORKINGIN5
16065 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT4
33707 NO_ASM,MERGED_MIDDLE,WORKINGOUT
19353 NO_ASM,MERGED_MIDDLE,WORKINGOUT0
16301 NO_ASM,MERGED_MIDDLE,WORKINGOUT1
16284 NO_ASM,MERGED_MIDDLE,WORKINGOUT1A
15945 NO_ASM,MERGED_MIDDLE,WORKINGOUT2
16002 NO_ASM,MERGED_MIDDLE,WORKINGOUT3
16484 NO_ASM,MERGED_MIDDLE,WORKINGOUT4
17037 NO_ASM,MERGED_MIDDLE,WORKINGOUT5
15869 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT1
15917 NO_ASM

15373 NO_ASM,MERGED_MIDDLE,WORKINGIN5,WORKINGOUT2
repeatability +-1/20499 = +-0.005%
best 15373
base 15915
ratio 1.0353
kriesel is offline   Reply With Quote
Old 2019-12-10, 15:19   #1554
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

1000011110002 Posts
Default

Latest git commit is slightly slower on a P100(754 vs 751 compared to 0c139c4, 836 vs 821 for P1)

By the way... how is P1 currently for gpuowl?
kracker is offline   Reply With Quote
Old 2019-12-10, 19:29   #1555
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by nomead View Post
Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.
The WORKINGIN and WORKINGOUT settings are independent. You do not need to test every combination. That is, if you find that WORKINGIN1 is best with the default setting of WORKINGOUT3, then WORKINGIN1 should be be the best choice for all the WORKINGOUT settings.

It is interesting that the 2080 and P100 show little difference among the choices. On the Radeon VII, there can be 100+us difference (15+%).
Prime95 is offline   Reply With Quote
Old 2019-12-10, 19:45   #1556
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

165678 Posts
Default

Quote:
Originally Posted by kracker View Post
Latest git commit is slightly slower on a P100?
Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.
Prime95 is offline   Reply With Quote
Old 2019-12-10, 20:42   #1557
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

87816 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.
Running at 749/750 us/it now...
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...
kracker is offline   Reply With Quote
Old 2019-12-10, 22:26   #1558
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by kracker View Post
Running at 749/750 us/it now...
We may be needing a place where we can lookup/submit the best gpu settings for various GPU's running gpuowl...
Interesting. There are several other places in the code that could shuffle T values (a double) rather than T2 values (2 doubles - a complex number). It would double the amount of local storage required, which could negatively impact occupancy....
Prime95 is offline   Reply With Quote
Old 2019-12-10, 22:40   #1559
xx005fs
 
"Eric"
Jan 2018
USA

110101002 Posts
Default Interesting...

Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
Code:
2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
        ^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
 ^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye

Last fiddled with by xx005fs on 2019-12-10 at 22:45
xx005fs is offline   Reply With Quote
Old 2019-12-11, 00:12   #1560
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Try -use T2_SHUFFLE. AFAICT that is the most likely culprit for any slowdown from the last commit. The other possibility is a denser packing of a bit array. It does not seem likely that reducing the amount of memory read would increase iteration times.
Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.

I had just hacked in the new shuffle. Now I'll go back and code it up proper (with -use switches) so we can turn the feature on and off as needed on different GPUs.

Thanks for prompting me to try this!
Prime95 is offline   Reply With Quote
Old 2019-12-11, 01:38   #1561
CRGreathouse
 
CRGreathouse's Avatar
 
Aug 2006

3·1,993 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Holy crap. I just coded up a T2 shuffle for the critical fft_WIDTH and fft_HEIGHT routines and it was 2.5% faster on the Radeon VII. This directly contradicts the advice in AMD's OpenCL optimization guide.


There is such a wealth of knowledge on these boards, I find myself constantly in awe.
CRGreathouse is offline   Reply With Quote
Old 2019-12-11, 03:53   #1562
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

The OpenCL version check should be fixed now (recent commit)

Quote:
Originally Posted by xx005fs View Post
Got the following error with the newest commit, despite having OpenCL 2.0 on my Vega. Works fine with Nvidia driver though.
Code:
2019-12-10 14:39:00 gpuowl v6.11-82-gdb9ce44-dirty
2019-12-10 14:39:00 Note: no config.txt file found
2019-12-10 14:39:00 config: -device 0 -carry short -nospin -use MERGED_MIDDLE,ORIG_X2,WORKINGIN5,WORKINGOUT2,T2_SHUFFLE -block 500
2019-12-10 14:39:00 94204153 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.97 bits/word
2019-12-10 14:39:01 OpenCL args "-DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-10 14:39:01 OpenCL compilation error -11 (args -DEXP=94204153u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0x8.2de8f968e724p-3 -DIWEIGHT_STEP=0xf.a6316e77270fp-4 -DWEIGHT_BIGSTEP=0xd.744fccad69d68p-3 -DIWEIGHT_BIGSTEP=0x9.837f0518db8a8p-4 -DAMDGPU=1 -DMERGED_MIDDLE=1 -DORIG_X2=1 -DWORKINGIN5=1 -DWORKINGOUT2=1 -DT2_SHUFFLE=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-10 14:39:01 C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:13:9: warning: GpuOwl requires OpenCL 200, found 200
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
        ^
C:\Users\Admin\AppData\Local\Temp\\OCL6712T0.cl:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
 ^
1 warning and 1 error generated.

error: Clang front-end compilation failed!
Frontend phase failed compilation.
Error: Compiling CL to IR

2019-12-10 14:39:01 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-10 14:39:01 Bye
preda is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 10:22.


Fri Aug 6 10:22:25 UTC 2021 up 14 days, 4:51, 1 user, load averages: 3.66, 3.78, 3.81

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.