mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
Thread Tools
Old 2018-07-10, 14:33   #463
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by SELROC View Post
I am computing 85M-86M so 4M would fail (?), while 16M should slow down.

Looking forward for 5M (before automatic work fetch).

Or run v2.0 which has only 5000K, while you wait. It should work up to around 93M. It's around 5.25ms/iter on an RX-480.

Code:
gpuOwL v2.0- GPU Mersenne primality checker
Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
Note: using long carry and fused tail kernels
OpenCL compilation in 1544 ms, with " -DEXP=83871259u  -I. -cl-fast-relaxed-math -cl-kernel-arg-info "
PRP-3: FFT 5000K (625 * 4096 * 2) of 83871259 (16.38 bits/word) [2018-07-05 16:49:02 Central Daylight Time]
kriesel is online now   Reply With Quote
Old 2018-07-10, 15:08   #464
SELROC
 

2·4,391 Posts
Default

Quote:
Originally Posted by kriesel View Post
Or run v2.0 which has only 5000K, while you wait. It should work up to around 93M. It's around 5.25ms/iter on an RX-480.

Code:
gpuOwL v2.0- GPU Mersenne primality checker
Ellesmere-36x1266-@28:0.0 Radeon (TM) RX 480 Graphics
Note: using long carry and fused tail kernels
OpenCL compilation in 1544 ms, with " -DEXP=83871259u  -I. -cl-fast-relaxed-math -cl-kernel-arg-info "
PRP-3: FFT 5000K (625 * 4096 * 2) of 83871259 (16.38 bits/word) [2018-07-05 16:49:02 Central Daylight Time]
This is an attempt that I was tempted to make, but I think the old checkpoint format is incompatible with the new one...
  Reply With Quote
Old 2018-07-10, 19:58   #465
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,437 Posts
Default

Quote:
Originally Posted by SELROC View Post
This is an attempt that I was tempted to make, but I think the old checkpoint format is incompatible with the new one...
Just thinking about some possibilities:
a) save a safety duplicate copy of the checkpoint file and then try the current exponent on v2.0
b) run the current exponent on 8M to completion in 3.x
c) switch to a different exponent that can run on v2.0 from the start in 5000K fft length
d) wait on the current exponent until Preda provides a 5M length in 3.x

Last fiddled with by kriesel on 2018-07-10 at 19:59
kriesel is online now   Reply With Quote
Old 2018-07-10, 20:12   #466
SELROC
 

3×389 Posts
Default

Quote:
Originally Posted by kriesel View Post
Just thinking about some possibilities:
a) save a safety duplicate copy of the checkpoint file and then try the current exponent on v2.0
b) run the current exponent on 8M to completion in 3.x
c) switch to a different exponent that can run on v2.0 from the start in 5000K fft length
d) wait on the current exponent until Preda provides a 5M length in 3.x
Currently doing option b :-)
  Reply With Quote
Old 2018-07-11, 07:55   #467
SELROC
 

2×3×967 Posts
Default

Quote:
Originally Posted by SELROC View Post
Currently doing option b :-)

gpuOwl Memory Consumption (smemstat -m):

PID Swap USS PSS RSS D User Command
1230 0.000 57.246 69.139 116.152 ↑ sel ./openowl
1301 0.000 55.887 67.527 113.723 ↑ sel ./openowl
1278 0.000 52.070 64.017 111.398 ↑ sel ./openowl
1253 0.000 51.504 63.448 110.934 ↑ sel ./openowl
1326 0.000 43.871 55.805 103.395 ↑ sel ./openowl
4078 0.000 0.246 0.445 2.848 ↑ root smemstat -m

Note: Memory reported in units of megabytes.
  Reply With Quote
Old 2018-07-13, 09:37   #468
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·457 Posts
Default openOwl news v3.3

Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.

What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP.

The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT.

The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.:
"-fft 8M"
"-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size).

Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%.

The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400.
An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent.

Bugs are expected.

Last fiddled with by preda on 2018-07-13 at 09:38
preda is offline   Reply With Quote
Old 2018-07-13, 11:05   #469
axn
 
axn's Avatar
 
Jun 2003

5,087 Posts
Default

Quote:
Originally Posted by preda View Post
I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.
Is it difficult to do a factor 3 FFT? That would give you 3M & 6M (although 3M wouldn't be useful for PRP tests as there are no candidates available for it).
axn is online now   Reply With Quote
Old 2018-07-13, 11:24   #470
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

2·33·109 Posts
Default

Quote:
Originally Posted by preda View Post
Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.

What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP.

The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT.

The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.:
"-fft 8M"
"-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size).

Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%.

The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400.
An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent.

Bugs are expected.
How do those timings compare to the power of 2 ffts?
henryzz is offline   Reply With Quote
Old 2018-07-13, 11:58   #471
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

137110 Posts
Default

Quote:
Originally Posted by axn View Post
Is it difficult to do a factor 3 FFT? That would give you 3M & 6M (although 3M wouldn't be useful for PRP tests as there are no candidates available for it).

It *should* be easy to replace the "5" with either 3 or 3*3. But a 6M FFT is not really "hot" yet. (may become useful later, when the wavefront moves past 5M FFT, but that may take years).


OTOH 9 would allow a 18M FFT, which might be fastest for 100M-digits.
preda is offline   Reply With Quote
Old 2018-07-13, 12:02   #472
SELROC
 

7·557 Posts
Default

Quote:
Originally Posted by preda View Post
Hi, I'm happy to bring some exciting news: I have upgraded openOwl's FFT framework to make incorporating some NPOT sizes easier; and added a new factor-5 "middle step". Thus, openOwl should now support these FFT sizes: 4M, 5M, 8M, 10M, 16M, 20M.

What is good is that two of these sizes are particularly useful: 5M for wavefront PRP (80M -- 96M), and 20M for "100 million digits" PRP.

The speeds (on my Vega64, stock, air, 1400MHz, 150W) are roughly 2.5ms/it for 5M FFT, and 9.77ms/it for 20M FFT.

The FFT size is by default chosen automatically based on the exponent, but can be also be "forced" on the command line with -fft, e.g.:
"-fft 8M"
"-fft +1" or "-fft -1" (use the next higher/lower size, relative to the auto-selected size).

Another piece of "news" is that openOwl uses "rolling offset", which means that it dynamically changes the offset when an error is encountered. This trick allows to continue an exponent at the very upper edge of a given FFT size, where numerical errors are present. In my observations, the benefit is small, allowing an exponent-size extension of (less than) 0.5%.

The "-block" command line argument sets the "block size" of the GEC ("Gerbicz Error Checking"). The values accepted now are 100, 200, 400.
An error check is done at every block^2 iterations (thus, 10K iterations for -block 100, and 160K iterations for block 400). So, a smaller block detects errors earlier because it checks more often. The drawback is the cost, block 100 having an overhead of roughly 3%, while block 400 an overhead of roughly 0.75%). Default is block 200, overhead 1.5%, check every 40K iterations. The block size can only be set when starting a new exponent, it being fixed afterwards for the exponent.

Bugs are expected.

Currently testing latest.
It selected fft 5M on the current exponents 85M. The timing is 4-5 ms/it.
Waiting for completion.
  Reply With Quote
Old 2018-07-13, 12:03   #473
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by henryzz View Post
How do those timings compare to the power of 2 ffts?
(roughly)


16M FFT: 7.8 ms/it.
20M FFT: 9.8 ms/it
So it's mostly linearly with the FFT size, which is about the best I could hope for. In fact under 10ms/it for 100M-digit PRP is not a bad baseline.
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 14:20.


Fri Aug 6 14:20:31 UTC 2021 up 14 days, 8:49, 1 user, load averages: 3.76, 2.85, 2.59

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.