mersenneforum.org gpuOwL: an OpenCL program for Mersenne primality testing
 Register FAQ Search Today's Posts Mark Forums Read

2017-04-23, 16:07   #23
VictordeHolland

"Victor de Hollander"
Aug 2011
the Netherlands

2×587 Posts

Quote:
 Originally Posted by kladner If I may say, as a spectator, and a non coder, it amazes me to watch this birth process. The cooperation and involvement by several parties is impressive. Seeing this play out is one of the big pay-offs for hanging out on this forum.
Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.

2017-04-23, 17:49   #24

"Kieren"
Jul 2011
In My Own Galaxy!

26×157 Posts

Quote:
 Originally Posted by VictordeHolland Preda deserves all credit for the coding. I'm just trying to compile it for Win64 and reporting the errors I get.
The coding is the "prime" accomplishment <G>, but others taking an interest is like peer review. Both are needed, at least in most cases.

2017-04-23, 19:53   #25
VictordeHolland

"Victor de Hollander"
Aug 2011
the Netherlands

2·587 Posts

Succes here also with compiling it with MINGW64 for Windows. gpuOwL is faster and slightly lower error rates on my AMD HD7950 with the limited testing so far.

gpuOwL v0.1
Quote:
 gpuOwL v0.1 GPU Lucas-Lehmer primality checker Tahiti - OpenCL 1.2 AMD-APP (2079.5) LL FFT 4096K (1024*2048*2) of 76000021 (18.12 bits/word) at iteration 0 OpenCL setup: 656 ms 00020000 / 76000021 [0.03%], ms/iter: 5.618, ETA: 4d 22:35; 000000009c13c393 error 0.185102 (max 0.185102) 00040000 / 76000021 [0.05%], ms/iter: 5.621, ETA: 4d 22:37; 00000000f08b90ef error 0.178388 (max 0.185102) 00060000 / 76000021 [0.08%], ms/iter: 5.627, ETA: 4d 22:42; 00000000d68504f2 error 0.186264 (max 0.186264) 00080000 / 76000021 [0.11%], ms/iter: 5.621, ETA: 4d 22:32; 00000000e93a873a error 0.191096 (max 0.191096) 00100000 / 76000021 [0.13%], ms/iter: 5.609, ETA: 4d 22:15; 0000000035a87d3f error 0.198382 (max 0.198382)
clLucas v1.04
Quote:
 C:\clLucas_x64_1.04>clLucas_x64 Platform 0 : Advanced Micro Devices, Inc. Warning: Couldn't parse ini file option SixStepFFT; using default: off Platform :Advanced Micro Devices, Inc. Device 0 : Tahiti Build Options are : -D KHR_DP_EXTENSION CL_DEVICE_NAME Tahiti CL_DEVICE_VENDOR Advanced Micro Devices, Inc. CL_DEVICE_VERSION OpenCL 1.2 AMD-APP (2079.5) CL_DRIVER_VERSION 2079.5 (VM) CL_DEVICE_MAX_COMPUTE_UNITS 28 CL_DEVICE_MAX_CLOCK_FREQUENCY 900 CL_DEVICE_GLOBAL_MEM_SIZE 3221225472 CL_DEVICE_MAX_WORK_GROUP_SIZE 256 CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE 1 Starting M76000021 fft length = 4096K Running careful round off test for 1000 iterations. If average error >= 0.25, th e test will restart with a larger FFT length. Iteration 100, average error = 0.13540, max error = 0.18750 Iteration 200, average error = 0.16145, max error = 0.18750 Iteration 300, average error = 0.17013, max error = 0.18750 Iteration 400, average error = 0.17448, max error = 0.18750 Iteration 500, average error = 0.17724, max error = 0.20313 Iteration 600, average error = 0.18155, max error = 0.20313 Iteration 700, average error = 0.18463, max error = 0.20313 Iteration 800, average error = 0.18876, max error = 0.21875 Iteration 900, average error = 0.19209, max error = 0.21875 Iteration 1000, average error = 0.19474 < 0.25 (max error = 0.21875), continuing test. Iteration 20000 M( 76000021 )C, 0x27d2e8539c13c393, n = 4096K, clLucas v1.04 err = 0.2188 (2:01 real, 6.0576 ms/iter, ETA 127:50:52) Iteration 40000 M( 76000021 )C, 0xbdeced3ff08b90ef, n = 4096K, clLucas v1.04 err = 0.2188 (2:02 real, 6.0910 ms/iter, ETA 128:31:14) Iteration 60000 M( 76000021 )C, 0x7c3f23c5d68504f2, n = 4096K, clLucas v1.04 err = 0.2188 (2:02 real, 6.0879 ms/iter, ETA 128:25:15) Iteration 80000 M( 76000021 )C, 0x02b75b4ce93a873a, n = 4096K, clLucas v1.04 err = 0.2188 (2:02 real, 6.0903 ms/iter, ETA 128:26:11) Iteration 100000 M( 76000021 )C, 0xa04bf1ff35a87d3f, n = 4096K, clLucas v1.04 er r = 0.2188 (2:02 real, 6.0895 ms/iter, ETA 128:23:10)

 2017-04-23, 19:58 #26 VictordeHolland     "Victor de Hollander" Aug 2011 the Netherlands 2·587 Posts 'Guide' for compiling it on Windows with msys64+MINGW64 1. Assuming you have installed msys64 with MINGW64 and you'll need a texteditor (I prefer notepad++) 2. Download+Install AMD SDK 3.0 from http://developer.amd.com/tools-and-s...ssing-app-sdk/ 3. Download the gpuowl code from https://github.com/preda/gpuowl (easiest is to download the entire map as a .zip) 4. Extract .zip (preferably to somewhere you can navigate to easily with msys64, so for instance msys64\home\gpuowl) 5. open the Makefile (that is in the gpuowlmap) with texteditor (notepad++) 6. edit the path behind "-L" argument to the OpenCL SDK install path standard path is C:\Program Files (x86)\AMD APP SDK\3.0\lib\x86_64 I copied the map contents to msys64\home\OpenCL\SDK3 for easier referencing since I always get confused with referencing with spaces and brackets. In my case the makefile contains: Code: gpuowl: gpuowl.cpp clwrap.h tinycl.h g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL Don't forget to save the changes ;). 6.1 [OPTIONAL] if you don't have a OpenCL2.0 device edit the clwrap.h file and on line 89 change "-cl-std=CL2.0" to "-cl-std=CL1.2" 7. start msys64/MINGW64 shell and navigate to the gpuowl map. If you put the map in the /home directory of msys64 you can easily go there by typing: "cd .." to get to the home directory. and then "cd yourgpuowlmapname" 8. 'make' Should look something like this so far Code: MINGW64 ~ $cd .. MINGW64 /home$ cd gpuOwlv0.1/ MINGW64 /home/gpuOwlv0.1 $make g++ -O2 -std=c++14 gpuowl.cpp -ogpuowl -L\C\msys64\home\OpenCL\SDK3 -lOpenCL MINGW64 /home/gpuOwlv0.1$ 9. Assuming you didn't get any errors, you should now have a gpuowl.exe in the gpuowl directory. 9.1 [OPTIONAL] If you wish to move the gpuowl directory somewhere else, you probably need to copy these 3 .dlls into the directory: libgcc_s_seh-1.dll libstdc++-6.dll libwinpthread-1.dll 10. create a worktodo.txt and start testing. Please note gpuOwl is not production ready.
 2017-04-25, 04:41 #27 LaurV Romulan Interpreter     Jun 2011 Thailand 22·7·11·29 Posts Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools). We are going to give it a test too - we own an old "XFX HD7970 GHz" here. P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas and it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits! Last fiddled with by LaurV on 2017-04-25 at 04:47
2017-04-25, 11:42   #28
VictordeHolland

"Victor de Hollander"
Aug 2011
the Netherlands

117410 Posts

Quote:
 Originally Posted by LaurV Can someone post or PM me a windoze exe? [edit: x64]. We still have trouble compiling it (the same troubles as above - but the troubles are most probably related to our ignorance with the tools). We are going to give it a test too - we own an old "XFX HD7970 GHz" here. P.S. we fully understand that it is not "production ready" yet, but if it is faster than clLucas and it is giving out the same DC residue, for sure we will report it to PrimeNet and get some fast credits!
I've got a HD7950, so I'm hoping my executable should work for you too.
I'll send it when I get home tonight.

2017-04-25, 15:32   #29
LaurV
Romulan Interpreter

Jun 2011
Thailand

22×7×11×29 Posts

Quote:
The CIA hacked stuff, huh?

(I know why I always used pn2... haha)

Last fiddled with by LaurV on 2017-04-25 at 15:36 Reason: link

2017-04-25, 16:48   #30
VictordeHolland

"Victor de Hollander"
Aug 2011
the Netherlands

2·587 Posts

Quote:
 Originally Posted by LaurV The CIA hacked stuff, huh? (I know why I always used pn2... haha)
Windows itself probably has multiple/dozens? of unpatched zero-days only known to the three letter agencies. So I wouldn't lose any sleep over it .

 2017-04-26, 01:51 #31 preda     "Mihai Preda" Apr 2015 132810 Posts Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic'). A small update on where gpuOwL is, and what I'm planning on next. I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error. But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems. The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of: - evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results. - evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors. - using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time"). As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire. Development plan: - implement "offset", and measure performance impact. If impact is small, leave it always-on. - check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always). - add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results). Still missing: - ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU) - get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported - add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")
2017-04-26, 02:06   #32
science_man_88

"Forget I exist"
Jul 2009
Dumbassville

8,369 Posts

Quote:
 Originally Posted by preda Thanks for the MinGW compilation, and the screenshots! The screenshots show there's an error in printing the residue (leading digits being 0) -- that's hopefully fixed now (not a big deal, the problem was just 'cosmetic'). A small update on where gpuOwL is, and what I'm planning on next. I was really worried by the results of some of my own testing -- the LL was failing on known primes (24036583). Thus I decided to do some more serious testing to find the cause of the error. But after all this investigation, my conclusion was that it's not a software bug, but the GPU producing.. an erroneous result very rarely. This is disconcerning, and I'd really like to have a way to detect such problems. The LL involves two distinct computations. One is FFT-Square-IFFT, the second is "round-to-int + Carry-propagation". An error can occur in either of these, and these are the detection mechanisms that I know of: - evaluating the "max rounding error" that occurs when rounding-to-int, after the IFFT and before the carry-propagation. This is cheap to compute on the GPU, thus is always on (and printed on every logstep). This rounding error brings two pieces of information: 1. whether the FFT size is big enough for the chosen exponent, and 2. whether something went completely wrong with the FFT/IFFT. I plan to add provisions in the code for detecting a sudden jump in the rounding error (which may indicate FFT error), and re-run the last batch in that situation to check for consistent results. - evaluating the SUMINP / SUMOUT of the FFT (which is done by the CPU prime95). This is not implemented, because is seems (to me) expensive to do on the GPU. This check would have provided very good detection for FFT/IFFT errors, but no protection against rounding&carry-propagation errors. - using "offset". This changes the values fed to the FFT/IFFT, and again protects against FFT errors (but without detecting them "in real time"). As is, there is no check that I know of that covers the carry-propagation. If an error takes place in that part, it would not be detected by either the max-rounding-error check or the SUMINP/SUMOUT check. I would be interested in finding out about a GPU-cheap way to check that the integer digits of the modulo-convolution done by LL are not completely haywire. Development plan: - implement "offset", and measure performance impact. If impact is small, leave it always-on. - check for sudden jumps in rounding-error, and automatically re-try in that situation. May help detect too-overclocked GPUs (but not always). - add some simple self-test, which would run on know-primes and compare residues with a pre-saved residue list, to detect obvious errors. (to detect more subtle errors, a good but expensive way is to run to-completion on know primes (and check 0 residue), or double-check validated results). Still missing: - ability to select specific GPU in a multi-GPU system (right now, simply uses the first GPU) - get some ISA dumps (produced with "-cl -save-temps") and analyze to investigate the low performance reported - add ability to dump compiled binary (for OpenCLs that do not offer "-save-temps")
I'm not even really that involved in GIMPS but is it returning 0xFFFFF... ( hex for all 1's in binary) that was my first thought if so it may be returning the mersenne number itself ( or a multiple) instead of 0. I believe something similar happened in Prime95 at one point ( unless my memory of what I read is foggy).

Last fiddled with by science_man_88 on 2017-04-26 at 02:09

 2017-04-26, 03:07 #33 preda     "Mihai Preda" Apr 2015 24·83 Posts No, it's not all-1s. I ran 24036583 twice, the second time the result was correct (0). I tracked the difference between the two runs by compared the residues, and at some point around 13% the residues diverged. It means, in the first run an error occurred at that point. Given that the software is supposed to be deterministic (produce identical bits every time), this could be explained by the hardware behaving funny.

 Similar Threads Thread Thread Starter Forum Replies Last Post Bdot GPU Computing 1657 2020-10-27 01:23 xx005fs GpuOwl 0 2019-07-26 21:37 1260 Software 17 2015-08-28 01:35 CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12 Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 11:06.

Fri Nov 27 11:06:51 UTC 2020 up 78 days, 8:17, 4 users, load averages: 1.15, 1.41, 1.42