mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2019-12-09, 20:35   #1541
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

2×743 Posts
Default

Quote:
Originally Posted by kriesel View Post
Come to think of it, that res64 check could save some lost PRP time too when errors occur. Less incentive there though, since the excellent GEC safety net catches the errors eventually.
[
Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test.
R. Gerbicz is offline   Reply With Quote
Old 2019-12-09, 21:26   #1542
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Notes on the new MERGED_MIDDLE code. There are many implementations buried in the code. The fastest implementation depends on the memory bus width and bandwidth and GPU architecture and maybe the cache architecture.

The benefits of MERGED_MIDDLE really kick in for FFTs with a WIDTH >= 256 and SMALL_HEIGHT >= 256.

To find the best implementation for your GPU. Benchmark using each of these options:
WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5

Once you've determined the best implementations you can add the best WORKINGIN and WORKINGOUT options to your production config.txt file.

The default is WORKINGIN3 and WORKINGOUT3.

If we can obtain some consistent data, we can select different default values for non-AMD GPUs. So let us know your GPU and your timings. Thanks.
Prime95 is offline   Reply With Quote
Old 2019-12-09, 21:32   #1543
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by R. Gerbicz View Post
Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test.
If res64 == 0x00 then if full res == 0 then panic, retreat
Yes there's a very slight chance that a res64 zero is correct for a nonzero full residue. One place it shows up is in penultimate residues.

It's also true that eventually we will reach a point where early residues will correctly have values that are currently treated as errors. This occurs within the 232 capability of Mlucas. (See the attachment at https://www.mersenneforum.org/showpo...72&postcount=9) Before the probability of zero or 2 res64 becomes high, the project is likely to switch to a longer residue for such checks, say res128.

With all due respect, Dr. Gerbicz, none of us will need to worry about residues for p>264, or likely 248. TF is feasible with the right software up to a point, but P-1 or primality testing exponents of order 264 is quite out of reach and will be for more than my lifetime and many others'. In GIMPS we're dealing with p<232 and generally <230 (mersenne.org exponent limit for PRP, LL, or P-1 results acceptance is109), with most current activity other than my limits testing or the 100Mdigit attempts occurring at the wavefront <226.6.
A single 230 exponent PRP takes several months on the fastest available gpus. P-1 factoring to feasible limits imposed by memory and software takes weeks on most hardware if not all. The scaling for primality testing and P-1 is roughly p2.1, p~232 takes years, p~233 decades (longer than hardware lifetime), and would require fft lengths longer than available in gpuowl or CUDALucas.

Last fiddled with by kriesel on 2019-12-09 at 21:42
kriesel is offline   Reply With Quote
Old 2019-12-09, 21:45   #1544
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default Compiling Gpuowl

Compiling Gpuowl https://www.mersenneforum.org/showpo...4&postcount=21
added to reference content. Probably has errors or omissions. I'll fix them as they are identified.
kriesel is offline   Reply With Quote
Old 2019-12-10, 03:20   #1545
xx005fs
 
"Eric"
Jan 2018
USA

3248 Posts
Default

Getting this error when trying to use -nospin as argument:
Code:
2019-12-09 19:19:27 gpuowl v6.11-71-g7e02b07
2019-12-09 19:19:27 Argument '-nospin' '' not understood
2019-12-09 19:19:27 Exiting because "args"
2019-12-09 19:19:27 Bye
Also -yield doesn't seem to reduce any CPU resources anymore on Windows.
xx005fs is offline   Reply With Quote
Old 2019-12-10, 04:40   #1546
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by preda View Post
Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.

That said, I added an OpenCL 2.0 version check, please try it out on the old cards.
Quite a shower of warnings, but it did build.
Code:
$ make gpuowl-win.exe
cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp
echo \"`git describe --long --dirty --always`\" > version.new
diff -q -N version.new version.inc >/dev/null || mv version.new version.inc
echo Version: `cat version.inc`
Version: "v6.11-79-g0c139c4"
g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17   -c -o Pm1Plan.o Pm1Plan.cpp
g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17   -c -o GmpUtil.o GmpUtil.cpp
g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17   -c -o Worktodo.o Worktodo.cpp
In file included from Worktodo.cpp:6:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17   -c -o common.o common.cpp
In file included from common.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17   -c -o main.o main.cpp
In file included from main.cpp:8:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17   -c -o Gpu.o Gpu.cpp
In file included from ProofSet.h:6,
                 from Gpu.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17   -c -o clwrap.o clwrap.cpp
In file included from clwrap.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17   -c -o Task.o Task.cpp
In file included from Task.cpp:7:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17   -c -o checkpoint.o checkpoint.cpp
In file included from checkpoint.h:5,
                 from checkpoint.cpp:3:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17   -c -o timeutil.o timeutil.cpp
g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17   -c -o Args.o Args.cpp
In file included from Args.cpp:4:
File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)':
File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=]
       log("Can't open '%s' (mode '%s')\n", name.c_str(), mode);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~  ~~~~~~~~~~~~
g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17   -c -o state.o state.cpp
g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17   -c -o Signal.o Signal.cpp
g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17   -c -o FFTConfig.o FFTConfig.cpp
g++ -MT AllocTrac.o -MMD -MP -MF .d/AllocTrac.Td -Wall -O2 -std=c++17   -c -o AllocTrac.o AllocTrac.cpp
g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17   -c -o gpuowl-wrap.o gpuowl-wrap.cpp
g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static
strip gpuowl-win.exe
It launched ok on Win7 on an AMD RX550 and is running some comparative timings now.
Following is a test of the OpenCL version check on a Quadro 2000, which indicates 1.1/1.2 in gpu-z.
Code:
c:\Users\Ken\Documents\gpuowl\v6.11-79-g0c139c4>gpuowl-win -time -iters 10000 -use NO_ASM
2019-12-09 22:33:39 gpuowl v6.11-79-g0c139c4
2019-12-09 22:33:39 config.txt: -device 1 -user kriesel -cpu condorette/q2000
2019-12-09 22:33:39 condorette/q2000 config: -time -iters 10000 -use NO_ASM
2019-12-09 22:33:39 condorette/q2000 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word
2019-12-09 22:33:40 condorette/q2000 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.b
ce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-12-09 22:33:40 condorette/q2000 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-
3 -DIWEIGHT_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-12-09 22:33:40 condorette/q2000 <kernel>:13:9: warning: GpuOwl requires OpenCL 200, found 110
#pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__)
        ^
<kernel>:14:2: error: OpenCL >= 2.0 required
#error OpenCL >= 2.0 required
 ^
<kernel>:2777:66: error: use of undeclared identifier 'memory_scope_device'
  work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device);
                                                                 ^
<kernel>:2786:66: error: use of undeclared identifier 'memory_scope_device'
  work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device);
                                                                 ^
<kernel>:2845:12: warning: implicit declaration of function 'atomic_load' is invalid in C99
    while(!atomic_load((atomic_uint *) &ready[gr - 1]));
           ^
<kernel>:2845:25: error: use of undeclared identifier 'atomic_uint'
    while(!atomic_load((atomic_uint *) &ready[gr - 1]));
                        ^
<kernel>:2845:38: error: expected expression
    while(!atomic_load((atomic_uint *) &ready[gr - 1]));
                                     ^
<kernel>:2846:5: warning: implicit declaration of function 'atomic_store' is invalid in C99
    atomic_store((atomic_uint *) &ready[gr - 1], 0);
    ^
<kernel>:2846:19: error: use of undeclared identifier 'atomic_uint'
    atomic_store((atomic_uint *) &ready[gr - 1], 0);
                  ^
<kernel>:2846:32: error: expected expression
    atomic_store((atomic_uint *) &ready[gr - 1], 0);
                               ^
<kernel>:2919:25: error: use of undeclared identifier 'atomic_uint'
    while(!atomic_load((atomic_uint *) &ready[gr - 1]));
                        ^
<kernel>:2919:38: error: expected expression
    while(!atomic_load((atomic_uint *) &ready[gr - 1]));
                                     ^
<kernel>:2920:19: error: use of undeclared identifier 'atomic_uint'
    atomic_store((atomic_uint *) &ready[gr - 1], 0);
                  ^
<kernel>:2920:32: error: expected expression
2019-12-09 22:33:40 condorette/q2000 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build
2019-12-09 22:33:40 condorette/q2000 Bye
kriesel is offline   Reply With Quote
Old 2019-12-10, 05:40   #1547
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

4758 Posts
Default

Quote:
Originally Posted by Prime95 View Post
To find the best implementation for your GPU. Benchmark using each of these options:
WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5
Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen.

RTX 2080, clock pinned to 1920 MHz, Linux. Command line options -yield -log 10000 -prp 89796247 -fft +2 -iters 50000 -use NO_ASM,MERGED_MIDDLE except for those two baseline timings (3807 and 3808 µs) that were run without MERGED_MIDDLE. And then one IN and one OUT setting chosen. For whatever reason, the differences were really small on this card. 0.35% between the highest and lowest value, and if that one outlier (IN1A and OUT1A chosen) is taken out, the rest are within 0.19%.

None of the WORKINGOUT0 tests would run, an error occurred:
2019-12-10 04:14:53 Exception gpu_error: OUT_OF_RESOURCES carryA at clwrap.cpp:304 run

The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message.
Attached Thumbnails
Click image for larger version

Name:	rtx2080-gpuowl-merged-working.png
Views:	68
Size:	11.3 KB
ID:	21426  
nomead is offline   Reply With Quote
Old 2019-12-10, 06:36   #1548
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by nomead View Post
The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message.
Those values are suspiciously similar, closer than my repeatability runs.
Lots of differences of course; gpu, OS, pinning clock, exponent.
Try some repeatability runs.

Also, I think it's a rectangular array with one more row and column than you allowed for.
George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out).
And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination.

Last fiddled with by kriesel on 2019-12-10 at 06:50
kriesel is offline   Reply With Quote
Old 2019-12-10, 07:18   #1549
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

317 Posts
Default

Quote:
Originally Posted by kriesel View Post
Try some repeatability runs.
Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run. The advantage of benchmarking on Linux is that the results are more predictable, it's less likely that the OS starts indexing or going through updates or scanning for viruses in the background.
Quote:
Originally Posted by kriesel View Post
Also, I think it's a rectangular array with one more row and column than you allowed for.
George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out).
And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination.
As George said in his message, the default is IN3 and OUT3, so those are chosen anyway, if nothing else is specified. And yeah, that is exactly the reason why I benchmarked the whole array of combinations, to see whether that way of searching for the optimum spot really works. (first test the IN values, then using the optimum IN value, search through the OUT values) And in my case it works, but then, there are so many "correct" spots to land on that it makes it easier than it should be.
nomead is offline   Reply With Quote
Old 2019-12-10, 10:04   #1550
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by nomead View Post
Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run.
For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's.

(Who leaves indexing and autoupdates turned on?)

Last fiddled with by kriesel on 2019-12-10 at 10:16
kriesel is offline   Reply With Quote
Old 2019-12-10, 11:23   #1551
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

31710 Posts
Default

Quote:
Originally Posted by kriesel View Post
For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's.
Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore. 1A+1A was 3689 or 3690 µs, average 3689,8 µs. 3+5 was 3680 µs every single time.

Don't get me started on quantization noise...

I'm used to getting reliable and repeatable results when timing other programs, mostly mfaktc, but I have to admit these are exceptionally steady, about one digit more than I'm used to getting. Maybe I should start doubting the method, and use some sort of external timer as well, instead of blindly trusting the internal timer within the program. But that's way too much effort to sink into a quick test like this.
Quote:
Originally Posted by kriesel View Post
(Who leaves indexing and autoupdates turned on?)
Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT). Not sure about search though. And yeah, likewise the antivirus software (F-Secure) is forced always on. I still manage to run prime95 on it, but there the iteration timings are anything but stable.
nomead is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 10:23.


Fri Aug 6 10:23:15 UTC 2021 up 14 days, 4:52, 1 user, load averages: 3.48, 3.71, 3.78

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.