![]() |
![]() |
#1 |
"Mihai Preda"
Apr 2015
2×727 Posts |
![]()
Dear Mersenners,
it is my pleasure to [pre]announce a new GPU LL checker, gpuOwL. It is implemented in OpenCL, targeting mainly AMD GPUs (this is the hardware that I use / test on). The name gpuOwL brings together the "O" from OpenCL, the L from Lucas-Lehmer, and the obvious "gpu", evoking a long-thinking being that stays active at night (that would be my desktop :) Now on to the technical details. I implemented this from scratch, based mainly on the excellent article "Discrete Weighted Transforms and Large Integer Arithmetic" by Crandall & Fagin, http://www.faginfamily.net/barry/Pap...Transforms.pdf . The main motivation while writing this program was exploring how fast such a computation (LL) can be done on AMD GPUs -- I wanted to get a "fastest" implementation. The secondary goal was getting an elegant, beautiful implementation -- simple/straightforward and small code (the code is indeed quite small, right now totaling ~1K LOC). The code is also self-contained, without dependencies on external libraries (such as an FFT library). The only requirements are a C++ compiler and an OpenCL implementation. (I developed and use the program on Linux, with AMDGPU-Pro for OpenCL). The source-code is available on github: https://github.com/preda/gpuowl The program is still under development. Probably it still has some bugs, especially in the recent code, that are being worked out. OTOH the basic FFT and LL logic seems to be sound. Usage: - compile with either make (Makefile) or with the "build.sh" script, or even straight command-line: c++ -O2 gpuowl.cpp -ogpuowl -lOpenCL -L/path/to/lib/opencl e.g. "-L/opt/amdgpu-pro/lib/x86_64-linux-gnu" - get LL assignments for exponents around 67-77M and put them into worktodo.txt - the program prints progress report (every 20k iterations), and after a longer time like a couple of days outputs to results.txt - it automatically saves progress to a save-exponent.bin file, and can be stopped at any time, will resume on restart. The pluses: + it is fast. For exponents in the vicinity of 77M, 2.1 ms/iteration on FuryX, 2.4 ms/iter on R9-390x + the code is supposedly easy to understand (or will be so when I finish an accompanying write-up explaining the tricky bits and the maths). In any case, it is quite small. + it works on AMD GPUs :) + prints residue and error info on every log step. The minuses: - gpuOwL in the present incarnation supports a single POT FFT size, 4M. This makes it able to handle exponents "up to 78M", and probably optimally above 70M. That means, for smaller exponents a smaller FFT would be faster; while for larger exponents the floating-point errors are getting too large and require a larger FFT size. - does not use shiftcount (may be added in the future) - may have rough edges (is still in development) The tricks: "transposed convolution": getting "under the hood" of the FFT allowed some performance gains. A traditional convolution based on the "matrix FFT algorithm" (as in AMD's clFFT) proceeds with these steps: 1. transpose 2. sub-FFT 3. transpose 4. sub-FFT 5. transpose 6. Squaring 7. transpose 8. sub-FFT 9. transpose 10. sub-FFT 11. transpose (1-5 is forward-FFT, 7-11 is inverse-FFT) OTOH in gpuOwL I store the data in a transposed representation, which gets rid of step 1 (and 11) above. The squaring can be done just as well over the transposed form, thus steps 5 and 7 (the transpose before and after the squaring) aren't needed. Thus the steps remaining in gpuOwL are: 1. sub-FFT 2. transpose 3. sub-FFT 4. Squaring 5. sub-FFT 6. transpose 7. sub-FFT Another trick (which AFAIK is used by CUDALucas and clLucas as well) is mapping a real-FFT into a complex-FFT of half size, which halves the FFT size. Another small trick is using a probabilistic argument for limiting the carry propagation, which simplifies a bit the carry code. Another small trick is computing the various pre-computed constants (such as the FFT coefficients and the A, A^-1 vectors for the DWT) in quad-precision ("long double") before conversion to double precision, which pushes the error bound a tiny bit further. I'll be looking for problems to fix. I hope you'll enjoy! cheers, Mihai Last fiddled with by Uncwilly on 2020-12-01 at 06:40 |
![]() |
![]() |
![]() |
#2 |
Romulan Interpreter
"name field"
Jun 2011
Thailand
2×5,179 Posts |
![]()
Brava băiatu', felicitări și la mai multe de astea.
Întrebări: 1. Speed comparison with clLucas? 2. Any windows build? Mersic. |
![]() |
![]() |
![]() |
#3 |
"Victor de Hollander"
Aug 2011
the Netherlands
32·131 Posts |
![]()
First of all nice job!
Code:
Victor@PCVICTOR MINGW64 /home/gpuowl-master $ make g++ -O2 -std=c++11 gpuowl.cpp -ogpuowl -L/home/OpenCL -lOpenCL In file included from gpuowl.cpp:4:0: clwrap.h: In member function 'void Program::compile(Context&, const char*, const char*)': clwrap.h:56:22: error: too many arguments to function 'int mkdir(const char*)' mkdir("isa", 0777); ^ In file included from C:/msys64/mingw64/x86_64-w64-mingw32/include/sys/stat.h:14:0, from clwrap.h:7, from gpuowl.cpp:4: C:/msys64/mingw64/x86_64-w64-mingw32/include/io.h:271:15: note: declared here int __cdecl mkdir (const char *) __MINGW_ATTRIB_DEPRECATED_MSVC2005; ^ gpuowl.cpp: In function 'double* genBigTrig(int, int)': gpuowl.cpp:51:17: error: 'M_PIl' was not declared in this scope auto base = - M_PIl / (W * H / 2); ^ gpuowl.cpp: In function 'double* genSin(int, int)': gpuowl.cpp:70:17: error: 'M_PIl' was not declared in this scope auto base = - M_PIl / (W * H); ^ gpuowl.cpp: In function 'double* smallTrigBlock(int, int, double*)': gpuowl.cpp:86:22: error: 'M_PIl' was not declared in this scope auto angle = - M_PIl * line * col / (W * H / 2); ^ gpuowl.cpp: In function 'bool checkPrime(int, bool*, u64*)': gpuowl.cpp:369:9: error: 'uint' does not name a type const uint zero = 0; ^ gpuowl.cpp:370:75: error: 'zero' was not declared in this scope Buf bufErr (c, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR, sizeof(int), &zero); ^ gpuowl.cpp:402:5: error: 'uint' was not declared in this scope uint rawErr = 0; ^ gpuowl.cpp:403:43: error: 'rawErr' was not declared in this scope q.read( false, bufErr, sizeof(uint), &rawErr); ^ gpuowl.cpp: In function 'int main(int, char**)': gpuowl.cpp:422:30: error: 'setlinebuf' was not declared in this scope if (logf) { setlinebuf(logf); } ^ Makefile:2: recept voor doel 'gpuowl' is mislukt make: *** [gpuowl] Fout 1 ![]() C:\msys64\home\OpenCL has a copy of the files from C:\Program Files (x86)\AMD APP SDK\2.9\lib\x86_64 which has the libOpenCL.a file the errors are just abracadabra to me ![]() |
![]() |
![]() |
![]() |
#4 |
"Mihai Preda"
Apr 2015
5AE16 Posts |
![]()
Salut :)
Sorry, I don't have a windows build myself -- no access to windows :). But maybe somebody else could prepare one. For similar reasons I don't have a straight speed comparison with clLucas -- I couldn't build it yet -- it depends on SDKApplication.hpp which are part of AMDAPP SDK (that I'm not using), and also openclsdkdefs.mk and openclsdkrules.mk. I do think these build problems can be overcome, but I didn't put the effort in yet. OTOH as both programs print time-per-iteration, such comparison should be easy to do by anybody who gets to build both programs. I would bet mine is faster :), but by how much..? |
![]() |
![]() |
![]() |
#5 |
"Mihai Preda"
Apr 2015
2·727 Posts |
![]()
I just fixed some of those errors for mingw. Would you try again?
|
![]() |
![]() |
![]() |
#6 |
Jul 2009
Tokyo
2×5×61 Posts |
![]()
Great!!!
I hope release Open Source License. |
![]() |
![]() |
![]() |
#7 |
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
320910 Posts |
![]()
AirSquirrels had run some cLucas benchmarks.
A Fury X oc'ed to 1100,500: M75002911 gave 3.9026 ms/iter A Fury X with no oc: M75002911 gave 4.0754 ms/iter And a R9 390X: M75002911 gave 4.3873 ms/iter Last fiddled with by Mark Rose on 2017-04-19 at 21:15 |
![]() |
![]() |
![]() |
#8 |
"Mihai Preda"
Apr 2015
2×727 Posts |
![]()
Yep, licensed under GPL v3 (added License on github).
|
![]() |
![]() |
![]() |
#9 |
"Mihai Preda"
Apr 2015
5AE16 Posts |
![]()
Nice. Let's say 1.8 times faster than clLucas then :)
|
![]() |
![]() |
![]() |
#10 |
"David"
Jul 2015
Ohio
11×47 Posts |
![]()
I will pull this down and run some benchmark comparisons on the same system.
|
![]() |
![]() |
![]() |
#11 |
"David"
Jul 2015
Ohio
11×47 Posts |
![]()
Pretty fast for hand rolled code (clFFT has had a lot of resources put into it by AMD), but I'm definitely not seeing the performance levels indicated above. Still slower than clLucas 1.04. Anything I should check? This is a FuryX
Good news is, the residues match for the first chunk of iterations. Code:
gpuOwL v0.1 GPU Lucas-Lehmer primality checker LL of 75002911 at iteration 0 FFT 1024*2048 (4M words, 17.88 bits per word) OpenCL compile: 1106 ms setup: 1638 ms 00020000 / 75002911 [0.03%], ms/iter: 5.413, ETA: 4d 16:45; 777b6635d6b78b75 error 0.125 (max 0.125) 00040000 / 75002911 [0.05%], ms/iter: 5.422, ETA: 4d 16:54; b9fc5678347cad9f error 0.140625 (max 0.140625) 00060000 / 75002911 [0.08%], ms/iter: 5.413, ETA: 4d 16:40; e7fab5c1f11d0f39 error 0.125 (max 0.140625) 00080000 / 75002911 [0.11%], ms/iter: 5.423, ETA: 4d 16:52; 76a6fb920dd95b71 error 0.140625 (max 0.140625) Code:
Continuing work from a partial result of M75002911 fft length = 4096K iteration = 1001 Iteration 10000 M( 75002911 )C, 0xc9a6d6ecad1fb00c, n = 4096K, clLucas v1.04 err = 0.1875 (0:36 real, 3.5486 ms/iter, ETA 73:55:06) Iteration 20000 M( 75002911 )C, 0x777b6635d6b78b75, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9315 ms/iter, ETA 81:53:04) Iteration 30000 M( 75002911 )C, 0x0f0c343e5174fa89, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9205 ms/iter, ETA 81:38:41) Iteration 40000 M( 75002911 )C, 0xb9fc5678347cad9f, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9280 ms/iter, ETA 81:47:22) Iteration 50000 M( 75002911 )C, 0x992a088a20504a90, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9407 ms/iter, ETA 82:02:32) Iteration 60000 M( 75002911 )C, 0xe7fab5c1f11d0f39, n = 4096K, clLucas v1.04 err = 0.1875 (0:40 real, 3.9230 ms/iter, ETA 81:39:51) Iteration 70000 M( 75002911 )C, 0x89386b82336fc06d, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9320 ms/iter, ETA 81:50:26) Iteration 80000 M( 75002911 )C, 0x76a6fb920dd95b71, n = 4096K, clLucas v1.04 err = 0.1875 (0:39 real, 3.9236 ms/iter, ETA 81:39:18) |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1724 | 2023-06-04 23:31 |
GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |