![]() |
[QUOTE=kriesel;532471]
Come to think of it, that res64 check could save some lost PRP time too when errors occur. Less incentive there though, since the excellent GEC safety net catches the errors eventually. [[/QUOTE] Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test. |
Notes on the new MERGED_MIDDLE code. There are many implementations buried in the code. The fastest implementation depends on the memory bus width and bandwidth and GPU architecture and maybe the cache architecture.
The benefits of MERGED_MIDDLE really kick in for FFTs with a WIDTH >= 256 and SMALL_HEIGHT >= 256. To find the best implementation for your GPU. Benchmark using each of these options: WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5 Once you've determined the best implementations you can add the best WORKINGIN and WORKINGOUT options to your production config.txt file. The default is WORKINGIN3 and WORKINGOUT3. If we can obtain some consistent data, we can select different default values for non-AMD GPUs. So let us know your GPU and your timings. Thanks. |
[QUOTE=R. Gerbicz;532482]Ofcourse if res64=0 then you need to check the full residue to see if it is really true that res=0. For much larger p>2^64 you could see (multiple) interim res64=0 during a prp test.[/QUOTE]
If res64 == 0x00 then if full res == 0 then panic, retreat Yes there's a very slight chance that a res64 zero is correct for a nonzero full residue. One place it shows up is in penultimate residues. It's also true that eventually we will reach a point where early residues will correctly have values that are currently treated as errors. This occurs within the 2[SUP]32[/SUP] capability of Mlucas. (See the attachment at [URL]https://www.mersenneforum.org/showpost.php?p=515172&postcount=9[/URL]) Before the probability of zero or 2 res64 becomes high, the project is likely to switch to a longer residue for such checks, say res128. With all due respect, Dr. Gerbicz, none of us will need to worry about residues for p>2[SUP]64[/SUP], or likely 2[SUP]48[/SUP]. TF is feasible with the right software up to a point, but P-1 or primality testing exponents of order 2[SUP]64[/SUP] is quite out of reach and will be for more than my lifetime and many others'. In GIMPS we're dealing with p<2[SUP]32[/SUP] and generally <2[SUP]30[/SUP] (mersenne.org exponent limit for PRP, LL, or P-1 results acceptance is10[SUP]9[/SUP]), with most current activity other than my limits testing or the 100Mdigit attempts occurring at the wavefront <2[SUP]26.6[/SUP]. A single 2[SUP]30[/SUP] exponent PRP takes several months on the fastest available gpus. P-1 factoring to feasible limits imposed by memory and software takes weeks on most hardware if not all. The scaling for primality testing and P-1 is roughly p[SUP]2.1[/SUP], p~2[SUP]32[/SUP] takes years, p~2[SUP]33[/SUP] decades (longer than hardware lifetime), and would require fft lengths longer than available in gpuowl or CUDALucas. |
Compiling Gpuowl
Compiling Gpuowl [URL="https://www.mersenneforum.org/showpost.php?p=532454&postcount=21"]https://www.mersenneforum.org/showpo...4&postcount=21[/URL]
added to reference content. Probably has errors or omissions. I'll fix them as they are identified. |
Getting this error when trying to use -nospin as argument:
[CODE]2019-12-09 19:19:27 gpuowl v6.11-71-g7e02b07 2019-12-09 19:19:27 Argument '-nospin' '' not understood 2019-12-09 19:19:27 Exiting because "args" 2019-12-09 19:19:27 Bye[/CODE] Also -yield doesn't seem to reduce any CPU resources anymore on Windows. |
[QUOTE=preda;532469]Ken, I'm not confident that I can do the OpenCL version test reliably. For example, until recently, ROCm OpenCL was self-reporting as being OpenCL 1.2 although it was compiling fine 2.0. I'm worried that adding this check would not even attempt to compile in such a situation.
That said, I added an OpenCL 2.0 version check, please try it out on the old cards.[/QUOTE] Quite a shower of warnings, but it did build.[CODE]$ make gpuowl-win.exe cat head.txt gpuowl.cl tail.txt > gpuowl-wrap.cpp echo \"`git describe --long --dirty --always`\" > version.new diff -q -N version.new version.inc >/dev/null || mv version.new version.inc echo Version: `cat version.inc` Version: "v6.11-79-g0c139c4" g++ -MT Pm1Plan.o -MMD -MP -MF .d/Pm1Plan.Td -Wall -O2 -std=c++17 -c -o Pm1Plan.o Pm1Plan.cpp g++ -MT GmpUtil.o -MMD -MP -MF .d/GmpUtil.Td -Wall -O2 -std=c++17 -c -o GmpUtil.o GmpUtil.cpp g++ -MT Worktodo.o -MMD -MP -MF .d/Worktodo.Td -Wall -O2 -std=c++17 -c -o Worktodo.o Worktodo.cpp In file included from Worktodo.cpp:6: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT common.o -MMD -MP -MF .d/common.Td -Wall -O2 -std=c++17 -c -o common.o common.cpp In file included from common.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT main.o -MMD -MP -MF .d/main.Td -Wall -O2 -std=c++17 -c -o main.o main.cpp In file included from main.cpp:8: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT Gpu.o -MMD -MP -MF .d/Gpu.Td -Wall -O2 -std=c++17 -c -o Gpu.o Gpu.cpp In file included from ProofSet.h:6, from Gpu.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT clwrap.o -MMD -MP -MF .d/clwrap.Td -Wall -O2 -std=c++17 -c -o clwrap.o clwrap.cpp In file included from clwrap.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT Task.o -MMD -MP -MF .d/Task.Td -Wall -O2 -std=c++17 -c -o Task.o Task.cpp In file included from Task.cpp:7: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT checkpoint.o -MMD -MP -MF .d/checkpoint.Td -Wall -O2 -std=c++17 -c -o checkpoint.o checkpoint.cpp In file included from checkpoint.h:5, from checkpoint.cpp:3: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT timeutil.o -MMD -MP -MF .d/timeutil.Td -Wall -O2 -std=c++17 -c -o timeutil.o timeutil.cpp g++ -MT Args.o -MMD -MP -MF .d/Args.Td -Wall -O2 -std=c++17 -c -o Args.o Args.cpp In file included from Args.cpp:4: File.h: In static member function 'static File File::open(const std::filesystem::__cxx11::path&, const char*, bool)': File.h:31:11: warning: format '%s' expects argument of type 'char*', but argument 2 has type 'const value_type*' {aka 'const wchar_t*'} [-Wformat=] log("Can't open '%s' (mode '%s')\n", name.c_str(), mode); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~ g++ -MT state.o -MMD -MP -MF .d/state.Td -Wall -O2 -std=c++17 -c -o state.o state.cpp g++ -MT Signal.o -MMD -MP -MF .d/Signal.Td -Wall -O2 -std=c++17 -c -o Signal.o Signal.cpp g++ -MT FFTConfig.o -MMD -MP -MF .d/FFTConfig.Td -Wall -O2 -std=c++17 -c -o FFTConfig.o FFTConfig.cpp g++ -MT AllocTrac.o -MMD -MP -MF .d/AllocTrac.Td -Wall -O2 -std=c++17 -c -o AllocTrac.o AllocTrac.cpp g++ -MT gpuowl-wrap.o -MMD -MP -MF .d/gpuowl-wrap.Td -Wall -O2 -std=c++17 -c -o gpuowl-wrap.o gpuowl-wrap.cpp g++ -o gpuowl-win.exe Pm1Plan.o GmpUtil.o Worktodo.o common.o main.o Gpu.o clwrap.o Task.o checkpoint.o timeutil.o Args.o state.o Signal.o FFTConfig.o AllocTrac.o gpuowl-wrap.o -lstdc++fs -lOpenCL -lgmp -pthread -L/opt/rocm/opencl/lib/x86_64 -L/opt/amdgpu-pro/lib/x86_64-linux-gnu -L/c/Windows/System32 -L. -static strip gpuowl-win.exe [/CODE]It launched ok on Win7 on an AMD RX550 and is running some comparative timings now. Following is a test of the OpenCL version check on a Quadro 2000, which indicates 1.1/1.2 in gpu-z. [CODE]c:\Users\Ken\Documents\gpuowl\v6.11-79-g0c139c4>gpuowl-win -time -iters 10000 -use NO_ASM 2019-12-09 22:33:39 gpuowl v6.11-79-g0c139c4 2019-12-09 22:33:39 config.txt: -device 1 -user kriesel -cpu condorette/q2000 2019-12-09 22:33:39 condorette/q2000 config: -time -iters 10000 -use NO_ASM 2019-12-09 22:33:39 condorette/q2000 89796247 FFT 5120K: Width 256x4, Height 64x4, Middle 10; 17.13 bits/word 2019-12-09 22:33:40 condorette/q2000 OpenCL args "-DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp-3 -DIWEIGHT_STEP=0x8.b ce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0" 2019-12-09 22:33:40 condorette/q2000 OpenCL compilation error -11 (args -DEXP=89796247u -DWIDTH=1024u -DSMALL_HEIGHT=256u -DMIDDLE=10u -DWEIGHT_STEP=0xe.a6216bdf4fcdp- 3 -DIWEIGHT_STEP=0x8.bce25ec56bc2p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DNO_ASM=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0) 2019-12-09 22:33:40 condorette/q2000 <kernel>:13:9: warning: GpuOwl requires OpenCL 200, found 110 #pragma message "GpuOwl requires OpenCL 200, found " STR(__OPENCL_VERSION__) ^ <kernel>:14:2: error: OpenCL >= 2.0 required #error OpenCL >= 2.0 required ^ <kernel>:2777:66: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device); ^ <kernel>:2786:66: error: use of undeclared identifier 'memory_scope_device' work_group_barrier(CLK_GLOBAL_MEM_FENCE | CLK_LOCAL_MEM_FENCE, memory_scope_device); ^ <kernel>:2845:12: warning: implicit declaration of function 'atomic_load' is invalid in C99 while(!atomic_load((atomic_uint *) &ready[gr - 1])); ^ <kernel>:2845:25: error: use of undeclared identifier 'atomic_uint' while(!atomic_load((atomic_uint *) &ready[gr - 1])); ^ <kernel>:2845:38: error: expected expression while(!atomic_load((atomic_uint *) &ready[gr - 1])); ^ <kernel>:2846:5: warning: implicit declaration of function 'atomic_store' is invalid in C99 atomic_store((atomic_uint *) &ready[gr - 1], 0); ^ <kernel>:2846:19: error: use of undeclared identifier 'atomic_uint' atomic_store((atomic_uint *) &ready[gr - 1], 0); ^ <kernel>:2846:32: error: expected expression atomic_store((atomic_uint *) &ready[gr - 1], 0); ^ <kernel>:2919:25: error: use of undeclared identifier 'atomic_uint' while(!atomic_load((atomic_uint *) &ready[gr - 1])); ^ <kernel>:2919:38: error: expected expression while(!atomic_load((atomic_uint *) &ready[gr - 1])); ^ <kernel>:2920:19: error: use of undeclared identifier 'atomic_uint' atomic_store((atomic_uint *) &ready[gr - 1], 0); ^ <kernel>:2920:32: error: expected expression 2019-12-09 22:33:40 condorette/q2000 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:234 build 2019-12-09 22:33:40 condorette/q2000 Bye[/CODE] |
1 Attachment(s)
[QUOTE=Prime95;532488]To find the best implementation for your GPU. Benchmark using each of these options:
WORKINGIN,WORKINGIN1,WORKINGIN1A,WORKINGIN2,WORKINGIN3,WORKINGIN4,WORKINGIN5. Then benchmark again using each of these options: WORKINGOUT,WORKINGOUT0,WORKINGOUT1,WORKINGOUT1A,WORKINGOUT2,WORKINGOUT3,WORKINGOUT4,WORKINGOUT5[/QUOTE] Ah, OK, so it's more like an array of settings, and one of each list needs to be chosen. RTX 2080, clock pinned to 1920 MHz, Linux. Command line options [C]-yield -log 10000 -prp 89796247 -fft +2 -iters 50000 -use NO_ASM,MERGED_MIDDLE[/C] except for those two baseline timings (3807 and 3808 µs) that were run without MERGED_MIDDLE. And then one IN and one OUT setting chosen. For whatever reason, the differences were really small on this card. 0.35% between the highest and lowest value, and if that one outlier (IN1A and OUT1A chosen) is taken out, the rest are within 0.19%. None of the WORKINGOUT0 tests would run, an error occurred: [C]2019-12-10 04:14:53 Exception gpu_error: OUT_OF_RESOURCES carryA at clwrap.cpp:304 run[/C] The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message. |
[QUOTE=nomead;532530]The smallest value was 3680 µs, which was reached with several different combinations. I have attached the full array of timings to this message.[/QUOTE]Those values are suspiciously similar, closer than my repeatability runs.
Lots of differences of course; gpu, OS, pinning clock, exponent. Try some repeatability runs. Also, I think it's a rectangular array with one more row and column than you allowed for. George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out). And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination. |
[QUOTE=kriesel;532533] Try some repeatability runs.[/QUOTE]
Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run. The advantage of benchmarking on Linux is that the results are more predictable, it's less likely that the OS starts indexing or going through updates or scanning for viruses in the background. [QUOTE=kriesel;532533]Also, I think it's a rectangular array with one more row and column than you allowed for. George gave a list of ins and a list of outs, but there's also the null entry for in (baseline in) and for out (baseline out). And it appears from my recent test that minimum in, and minimum out, don't necessarily mean even better in combination.[/QUOTE] As George said in his message, the default is IN3 and OUT3, so those are chosen anyway, if nothing else is specified. And yeah, that is exactly the reason why I benchmarked the whole array of combinations, to see whether that way of searching for the optimum spot really works. (first test the IN values, then using the optimum IN value, search through the OUT values) And in my case it works, but then, there are so many "correct" spots to land on that it makes it easier than it should be. |
[QUOTE=nomead;532534]Already did, but of course not enough (five each of "without merge", IN1A+OUT1A, and IN3+OUT5). At least that time, the results varied max. 2µs from run to run.[/QUOTE]For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's. (Who leaves indexing and autoupdates turned on?) |
[QUOTE=kriesel;532537]For a rock steady constant signal, wouldn't there be +-1 lsb of digitization noise, in this case +-1us?
I guess George's post means that if there's MERGE_MIDDLE, the default is 3's; else baseline NO_ASM only, middle is not merged so prior code, no in or out, so no 3's. [/QUOTE] Yeah, well, whatever the explanation, I now reran those repeatability runs. 20 runs each of 50000 iterations, alternating between no merge (only NO_ASM), IN1A+OUT1A and IN3+OUT5. The baseline (NO_ASM) had a slight anomaly on the first run (3804 µs) but the rest were 3807 or 3808, with the average being 3807,4 µs including that one outlier. It is very tempting to throw away that first measurement result, but then it wouldn't be an accurate representation of reality anymore. 1A+1A was 3689 or 3690 µs, average 3689,8 µs. 3+5 was 3680 µs every single time. Don't get me started on quantization noise...:cmd: I'm used to getting reliable and repeatable results when timing other programs, mostly mfaktc, but I have to admit these are exceptionally steady, about one digit more than I'm used to getting. Maybe I should start doubting the method, and use some sort of external timer as well, instead of blindly trusting the internal timer within the program. But that's way too much effort to sink into a quick test like this. [QUOTE=kriesel;532537] (Who leaves indexing and autoupdates turned on?)[/QUOTE] Not by my own choice of course, but the win10 box I have at work has autoupdates forced on by group policy (corporate IT). Not sure about search though. And yeah, likewise the antivirus software (F-Secure) is forced always on. I still manage to run prime95 on it, but there the iteration timings are anything but stable. |
| All times are UTC. The time now is 23:14. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.