mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2017-05-10, 22:13   #133
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55B16 Posts
Default

Quote:
Originally Posted by lalera View Post
hi,
done
M( 42424699 )C, 0x32c1a90e903fa63c, offset = 0, n = 4096K, gpuowl v0.1, AID: 42424699
This is the correct residue. So the error (on this exponent) is not consistent..
preda is offline   Reply With Quote
Old 2017-05-21, 23:14   #134
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
So far I've run at least 28 exponents completely through gpuOwl. 2 known bad runs, 2 need TC (Anyone?), and 24 success. 85.7%-92.8% success rate. At least one of the bad runs came from a system power issue ( 42396859), but 42424699 failed on an otherwise rock solid system.

Liquid Cooled Fury X System:
73001809 - MATCH
73001989 - MATCH
73001603 - MATCH
73002113 - MATCH
73002211 - MATCH
73001413 - MATCH
73002079 - MATCH
73002169 - MATCH
73001801 - MATCH
73002341 - MATCH
73001441 - MATCH
73001849 - MisMatch (Needs TC)
42446867 - MATCH
42495623 - MATCH
42852191 - MATCH
42424699 - MisMatch BAD
70000631 - MATCH
70000717 - MATCH
70000549 - MATCH
70000589 - MATCH
70000613 - MATCH
70000481 - MisMatch (Needs TC)
42397447 - Match
42396859 - Match (After failed run 00..02)
42397951 - Match
42397837 - Match
70000739 - Match
70000811 - Match
David, so is your impression that there is a software bug involved?
preda is offline   Reply With Quote
Old 2017-05-22, 00:03   #135
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

137110 Posts
Default Update

I updated gpuOwL on github ( https://github.com/preda/gpuowl ), bumping version to 0.2. Here is a summary of the changes:

1. "amalgamation kernel". I merged 4 previous distinct kernels into one, "big" kernel. This saves about 3 global-memory round-trips. It does not change the double-precision complexity though. As the previous kernels were close to being double-precision bound (and close to memory-bound too), the performance gain from using the "amalgamation" is a modest 5%-10%.

2. Added option -legacy which forces the old behavior (i.e. not using the "amalgamation kernel").

3. Added define -D NO_ERR to disable computation of the max-error. This gains about 1% performance, but I think it's not recommended because the max-error is useful info.
Similarly, added a define -D LOW_LDS to use a variant of the amalgamation with low LDS usage.

These defines are passed on the command line like this (an example):
./gpuowl -logstep 10000 -cl "-DNO_ERR -DLOW_LDS"

(note there is a single argument after -cl, enclosed in quotes if needed)

4. Changed the carry-propagation to stay in double-precision (previously an intermediary integer phase was involved, but the conversion double-to-long is expensive on GCN).

5. The carry propagation length is much shorter, only 3 words now. This raises the exponent lower bound to about 12 bits-per-word. (-legacy is not affected).

The checkpoint (save) format is not changed.

As usual, I'd recommend doing a -selftest (~ 30minutes) and one successful double-check LL before starting first-time LL.

This amalgamation kernel is big. For performance it must be compiled in under 128 VGPRs, but AMD's OpenCL compiler (LLVM-based) is very poor at optimizing VGPR allocation and I had to fight it to fit under 128 VGPRs. If it happens that some compiler does not make the 128VGPR limit, then the amalgamation kernel takes a serious performance hit.

Below, on FuryX, 4M FFT just barely over 2 ms/it:

Quote:
gpuOwL v0.2 GPU Lucas-Lehmer primality checker; Mon May 22 09:08:52 2017
Config: -logstep 20000 -savestep 10000000 -cl "-DNO_ERR"
64x1050MHz Fiji; OpenCL 1.2 AMD-APP (2348.3)
Compile : 807 ms
General setup : 492 ms
Exponent setup: 1145 ms
LL FFT 4096K (1024*2048*2) of 60000659 (14.31 bits/word) at iteration 160000
00180000 / 60000659 [0.30%], ms/iter: 2.005, ETA: 1d 09:19; e0f296f45afb1364 error 0.000671387 (max 0.000671387)
00200000 / 60000659 [0.33%], ms/iter: 2.005, ETA: 1d 09:18; b852f730f1679489 error 0.000732422 (max 0.000732422)
00220000 / 60000659 [0.37%], ms/iter: 2.004, ETA: 1d 09:17; 49365f7e5cd33dcc error 0.000610352 (max 0.000732422)
preda is offline   Reply With Quote
Old 2017-05-22, 00:17   #136
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default

A couple more timings:
Fury Nano, with max-error disabled, 2.24 ms/it
Quote:
gpuOwL v0.2 GPU Lucas-Lehmer primality checker; Mon May 22 09:59:05 2017
Config: -logstep 20000 -savestep 10000000 -cl "-D NO_ERR"
64x1000MHz Fiji; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 74000929 (17.64 bits/word) at iteration 13800000
13820000 / 74000929 [18.68%], ms/iter: 2.229, ETA: 1d 13:16; d9b94cd49b28a4ca error 0.078125 (max 0.078125)
13840000 / 74000929 [18.70%], ms/iter: 2.243, ETA: 1d 13:29; 69ca66f135da2a3c error 0.0664062 (max 0.078125)
390X with max-error enabled, 2.355 ms/it.
Quote:
gpuOwL v0.2 GPU Lucas-Lehmer primality checker; Mon May 22 09:09:57 2017
Config: -logstep 20000 -savestep 10000000 -cl ""
44x1080MHz Hawaii; OpenCL 1.2 AMD-APP (2348.3)
LL FFT 4096K (1024*2048*2) of 60000113 (14.31 bits/word) at iteration 180000
00200000 / 60000113 [0.33%], ms/iter: 2.355, ETA: 1d 15:07; fe5da1028c941f41 error 0.000976562 (max 0.000976562)
00220000 / 60000113 [0.37%], ms/iter: 2.357, ETA: 1d 15:08; 4b288ab18a8abb00 error 0.000976562 (max 0.000976562)
preda is offline   Reply With Quote
Old 2017-05-22, 02:04   #137
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Got this error while trying to compile:
Code:
$ g++ -c gpuowl.cpp
gpuowl.cpp: In function 'int main(int, char**)':
gpuowl.cpp:702:5: error: 'uint' was not declared in this scope
     uint baseBitlen = (int) floorl(E / (long double) N);
     ^~~~
gpuowl.cpp:714:25: error: 'baseBitlen' was not declared in this scope
     mega1K.setArgs     (baseBitlen, buf1, bufCarry, bufReady, bufErr, bufA, bufI, bufTrig1K);
                         ^~~~~~~~~~
kracker is offline   Reply With Quote
Old 2017-05-22, 02:43   #138
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101010110112 Posts
Default

Quote:
Originally Posted by kracker View Post
Got this error while trying to compile:
Code:
$ g++ -c gpuowl.cpp
gpuowl.cpp: In function 'int main(int, char**)':
gpuowl.cpp:702:5: error: 'uint' was not declared in this scope
     uint baseBitlen = (int) floorl(E / (long double) N);
     ^~~~
gpuowl.cpp:714:25: error: 'baseBitlen' was not declared in this scope
     mega1K.setArgs     (baseBitlen, buf1, bufCarry, bufReady, bufErr, bufA, bufI, bufTrig1K);
                         ^~~~~~~~~~
Thanks, fixed.
preda is offline   Reply With Quote
Old 2017-05-22, 03:00   #139
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by preda View Post
Thanks, fixed.
Thanks!

I'm getting this now though when i start it...
Code:
...
Compile : 2160 ms
General setup : 476 ms
Assertion failed!

Program: C:\Users\Back\Desktop\gpuowl\gpuowl.exe
File: gpuowl.cpp, Line 100

Expression: bits == baseBits || bits == baseBits +1

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
kracker is offline   Reply With Quote
Old 2017-05-22, 03:38   #140
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by kracker View Post
Thanks!

I'm getting this now though when i start it...
Code:
Assertion failed!

Program: C:\Users\Back\Desktop\gpuowl\gpuowl.exe
File: gpuowl.cpp, Line 100

Expression: bits == baseBits || bits == baseBits +1
Could you please tell me, which c++ compiler, and which exponent? also, which platform?(mingw?). I'd like to reproduce this. (probably the exponent is all I need to repro...)

Last fiddled with by preda on 2017-05-22 at 03:40
preda is offline   Reply With Quote
Old 2017-05-22, 13:19   #141
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

Quote:
Originally Posted by preda View Post
I updated gpuOwL on github ( https://github.com/preda/gpuowl ), bumping version to 0.2. Here is a summary of the changes:

1. "amalgamation kernel". I merged 4 previous distinct kernels into one, "big" kernel. This saves about 3 global-memory round-trips. It does not change the double-precision complexity though. As the previous kernels were close to being double-precision bound (and close to memory-bound too), the performance gain from using the "amalgamation" is a modest 5%-10%.

2. Added option -legacy which forces the old behavior (i.e. not using the "amalgamation kernel").

3. Added define -D NO_ERR to disable computation of the max-error. This gains about 1% performance, but I think it's not recommended because the max-error is useful info.
Similarly, added a define -D LOW_LDS to use a variant of the amalgamation with low LDS usage.

These defines are passed on the command line like this (an example):
./gpuowl -logstep 10000 -cl "-DNO_ERR -DLOW_LDS"

(note there is a single argument after -cl, enclosed in quotes if needed)

4. Changed the carry-propagation to stay in double-precision (previously an intermediary integer phase was involved, but the conversion double-to-long is expensive on GCN).

5. The carry propagation length is much shorter, only 3 words now. This raises the exponent lower bound to about 12 bits-per-word. (-legacy is not affected).

The checkpoint (save) format is not changed.

As usual, I'd recommend doing a -selftest (~ 30minutes) and one successful double-check LL before starting first-time LL.

This amalgamation kernel is big. For performance it must be compiled in under 128 VGPRs, but AMD's OpenCL compiler (LLVM-based) is very poor at optimizing VGPR allocation and I had to fight it to fit under 128 VGPRs. If it happens that some compiler does not make the 128VGPR limit, then the amalgamation kernel takes a serious performance hit.

Below, on FuryX, 4M FFT just barely over 2 ms/it:
This is great! Thank you so much for all your hard work on this. I will pull the results of my gpuowl stress testing soon, and give this version some exercise on known reliable hardware.

I can say just from looking over the high level numbers it does look like there is something about gpuowl that results in occasional bad results where clLucas had been reliable. It is hard to know if this is just from stressing cards closer to their limit by pulling all the performance, or some other factor. So far I haven't found an issue that is repeatable for any one exponent, so I don't believe there are math/logic issues unless timing related.

It is worth noting that we should cause gpuowl to fail if it reaches 00...0002 or all zero at any point in the calculation. One of the most interesting results I have had lately hit the 00...02 residue at one point. What made this most interesting is that this was on a FirePro W8100 which should be more resilient than the typical card due to ECC memory and better binning.
airsquirrels is offline   Reply With Quote
Old 2017-05-22, 14:56   #142
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23×271 Posts
Default

Quote:
Originally Posted by preda View Post
Could you please tell me, which c++ compiler, and which exponent? also, which platform?(mingw?). I'd like to reproduce this. (probably the exponent is all I need to repro...)
Using GCC 6.3.0/msys2/M76100027
I'll tinker around with it some more later though..
kracker is offline   Reply With Quote
Old 2017-05-22, 15:42   #143
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

181 successful LL tests with gpuOwl.

7 bad 4 good from one particular Fury X, vs. a perfect record on clLucas. I've dialed core clock back to 975 and will continue to monitor.

1 bad on a Fury X from a bad power supply system.

1 bad possibly from a power outage on one core of a 295x2. (Other core matched)

2 bad (Out of 4 total completed) on my FirePro W9100/W8100 system. I'm monitoring this, it may be a driver problem or other issue. This system is rock solid on clLucas, but I updated drivers around the same time I switched it to gpuOwl.

I also discovered another one of my Titan Blacks seems to have reached failure mode, which was clouding my bad results list. I feel bad for blaming gpuOwl for those red marks.

In total v0.1 seems very reliable aside from the above listed cases. Even including them, I see 181 good/ 11 bad for a 94.27% success rate. The new version 0.2 is definitely running ~12.5% faster for me on most cards. I will report back in a week or so if I see any errors from that version.
airsquirrels is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 16:56.


Mon Aug 2 16:56:30 UTC 2021 up 10 days, 11:25, 0 users, load averages: 2.53, 2.37, 2.23

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.