mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2019-09-08, 18:05   #1354
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

541910 Posts
Default relative performance of Gpuowl P-1 and CUDAPm1

On the same GTX1080Ti,

CUDPm1 V0.20 exponent 300M 1.2 days to complete both stages. (Includes stage 2 gcd time). Empirical fit is time proportional to exponent to the 1.986 power)

Gpuowl V6.7-4 exponent 298M 1.068 days to compelte both stages. (Not charged for stage 2 gcd time because that occurs in parallel with another exponent's stage 1 run if more work is queued and running from worktodo.txt. Empirical fit is time proportional to exponent tothe 1.83 power.)
(Fits are likely to bend upward to ~2.1 power at higher exponent as more data are acquired at higher exponents. That's how it went with LL and PRP applications.)
Extrapolating the gpuowl result to 300M at p2 would give ~1.082 days, 9.8% faster than CUDAPm1 on the same hardware and exponent.

Gpuowl may be able to run higher P-1 2-stage exponents than CUDAPm1 on the same hardware. That is under test now.

Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 1.2 and therefore a CUDA compute capability above 2.x. An explicit test for opencl version by gpuowl might be a good thing. ("Gpuowl requires OpenCL 1.2 support for atomics, which this gpu does not support. Exiting now." or some such helpful message.)
kriesel is online now   Reply With Quote
Old 2019-09-08, 20:15   #1355
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10101001010112 Posts
Default

Gpuowl v 6.7 on 4GB GTX1050Ti with -maxAlloc 3072 bailed on stage 2 of a 223M exponent P-1 run, indicating not enough memory. https://www.mersenne.org/M223000051. CUDAPm1 can run up to 384M on that same gpu.
kriesel is online now   Reply With Quote
Old 2019-09-09, 12:47   #1356
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55B16 Posts
Default

Quote:
Originally Posted by kriesel View Post
What is a block?
What is a round?
What determines how many rounds are required for stage 2?
What Brent Suyama extension exponent is used? Does it vary?

What determines the maximum exponent that gpuowl can complete in P-1 stage 1 or stage 2, other than run time or available fft lengths? (No doubt available gpu ram is a constraint, but without more info, that does not enable computing or predicting max exponent per gpu model based on gpu specifications. "Just try it" is an unsatisfying answer when run times may be weeks or longer, depending on gpu model and exponent)
I'll try to answer some of those questions, but I realize this is a bit obscure.

In stage1 we compute Base = 3^powerSmooth(B1).

the "Brent-Suyama exponent" is E=2, meaning:

In stage2 we multiply with
Base^(a^2) - Base^(b^2) == Base^(b^2) * (Base^(a^2 - b^2) - 1), and
Base^(a^2 - b^2) == Base^((a - b)*(a + b))

We try to cover all the primes in the range [B1, B2] with numbers of the form a-b or a+b as above, where "a" is a multiple of D: a==k*D.

In GpuOwl, D is a primorial D = 2*3*5*7*11*13 == 30030.
When D has such a form, it turns out that all the primes can be covered with values (a+b) or (a-b) where b<D/2 and b is relative prime to all the prime factors of D, i.e. the number of possible values of b needed to cover any prime is:
J = 1*2*4*6*10*12 / 2 == 2880

The abstract idea is to precompute all the 2880 values of Base^(b^2), and next iterate k with a=k*D over regions of size D. The range of "k" is given by the need to cover the range [B1,B2] with intervals of size D. Such intervals of size D are called "Blocks", and thus the number of Blocks is roughly equal to (B2 - B1)/30030.

Now, to precompute the 2880 values we'd need 2880 "buffers" (in the PRP sense), which are about 40MB in size each. 2880 * 40MB is a bit too much for the GPU memory, so we divide 2880 into a number of *rounds*.

We compute a number of buffers N that fit in GPU RAM, and N divides 2880. Now Rounds == 2880/N.

So:
- Blocks is (roughly) given by (B2 - B1)/D, (D==30030)
- Rounds is given by: 2880 / (nb. of buffers that fit in RAM)
- E=2 (Brent-Suyama exponent)

The above design works well with large amounts of RAM, where the RAM can fit 200-400 buffers (8GB-16GB with 40MB buffers) or more, such that the number of rounds stays small. If there's too little RAM, let's say enough to fit 4 buffers, then the number of rounds 2880/4 would be too large and a lot of time would be spent just switching from one round to the next. GpuOwl refuses to run stage2 if at least 24 buffers can't be fit (120 rounds).

Last fiddled with by preda on 2019-09-09 at 12:48
preda is offline   Reply With Quote
Old 2019-09-09, 12:57   #1357
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×457 Posts
Default

Quote:
Originally Posted by kriesel View Post
Gpuowl v 6.7 on 4GB GTX1050Ti with -maxAlloc 3072 bailed on stage 2 of a 223M exponent P-1 run, indicating not enough memory. https://www.mersenne.org/M223000051. CUDAPm1 can run up to 384M on that same gpu.
I think this is because of GpuOwl's protection to have at least 24 buffers in stage2; otherwise the number of rounds would be large and wasteful. You could increase the -maxAlloc to e.g. 3900 or 4000, and test with a very low B1=1000 (to not waste time before stage2 starts), maybe it would start stage2. You can estimate the buffer size from the FFT size used (FFT-size * 8).
preda is offline   Reply With Quote
Old 2019-09-09, 14:27   #1358
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,419 Posts
Default

Quote:
Originally Posted by preda View Post
I'll try to answer some of those questions, but I realize this is a bit obscure.
...
The above design works well with large amounts of RAM, where the RAM can fit 200-400 buffers (8GB-16GB with 40MB buffers) or more, such that the number of rounds stays small. If there's too little RAM, let's say enough to fit 4 buffers, then the number of rounds 2880/4 would be too large and a lot of time would be spent just switching from one round to the next. GpuOwl refuses to run stage2 if at least 24 buffers can't be fit (120 rounds).
Thanks for the detailed explanation.
So, # of relative primes/round ~ 2880/rounds. I've seen at least the following numbers of rounds in stage 2 in initial testing of gpuowl 6.6 & 6.7 P-1: 9 13 18 40 41 45 72 90 92 144 261. 13 41 92 261 don't evenly divide 2880.

On an 11GB GTX1080Ti in gpuowl V6.7-4-278407a I got the following:
Code:
...
2019-09-06 16:50:52 298000033     3400000 98.20%; 12685 us/sq; ETA 0d 00:13; ce82eeba45fe1874
2019-09-06 16:52:59 298000033     3410000 98.49%; 12680 us/sq; ETA 0d 00:11; 15dcd7efd36a1a41
2019-09-06 16:55:06 298000033     3420000 98.78%; 12678 us/sq; ETA 0d 00:09; 62e9338608a7eaba
2019-09-06 16:57:12 298000033     3430000 99.06%; 12667 us/sq; ETA 0d 00:07; da9050a0970d90ae
2019-09-06 16:59:19 298000033     3440000 99.35%; 12673 us/sq; ETA 0d 00:05; fefb68a0f511ba43
2019-09-06 17:01:26 298000033     3450000 99.64%; 12683 us/sq; ETA 0d 00:03; 9361be5349669b91
2019-09-06 17:03:33 298000033     3460000 99.93%; 12693 us/sq; ETA 0d 00:01; 57dd39e62c29b825
2019-09-06 17:04:04 P-1 stage2 using 11 buffers of 144.0 MB each
2019-09-06 17:04:05 P-1 (B1=2400000, B2=55200000, D=30030): primes 3117147, expanded 3208391, doubles 496457 (left 2156356), singles 2124233, total 2620690 (84%)
2019-09-06 17:04:05 298000033 P-1 stage2: 1759 blocks starting at block 80 (2620690 selected)
2019-09-06 17:07:09 Round 1 of 261: init 1.19 s; 18.34 ms/mul; 9981 muls
2019-09-06 17:10:14 Round 2 of 261: init 1.33 s; 18.34 ms/mul; 10017 muls
2019-09-06 17:13:21 Round 3 of 261: init 1.56 s; 18.28 ms/mul; 10111 muls
2019-09-06 17:16:28 Round 4 of 261: init 1.64 s; 18.23 ms/mul; 10171 muls
 ...
2019-09-07 06:27:14 Round 260 of 261: init 1.67 s; 18.59 ms/mul; 10046 muls
2019-09-07 06:30:22 Round 261 of 261: init 1.93 s; 18.59 ms/mul; 10008 muls
2019-09-07 06:30:24 406000003 FFT 24576K: Width 256x4, Height 256x4, Middle 12; 16.13 bits/word
2019-09-07 06:30:24 using short carry kernels
2019-09-07 06:30:24 OpenCL args "-DEXP=406000003u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=12u -DWEIGHT_STEP=0xe.974d6f9929278p-3 -DIWEIGHT_STEP=0x8.c5c3982c20ae8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-07 06:30:28 
2019-09-07 06:30:28 OpenCL compilation in 3666 ms
2019-09-07 06:30:34 406000003 P-1 starting stage1
2019-09-07 06:33:24 406000003       10000  0.21%; 16863 us/sq; ETA 0d 21:47; 4b5360aace601e72
2019-09-07 06:36:13 406000003       20000  0.43%; 16894 us/sq; ETA 0d 21:47; 6d8775dbd7676169
2019-09-07 06:39:02 406000003       30000  0.64%; 16902 us/sq; ETA 0d 21:45; 0339219ee80c94e3
2019-09-07 06:40:39 298000033 P-1 final GCD: no factor
2019-09-07 06:40:39 {"exponent":"298000033", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-4-g278407a"}, "timestamp":"2019-09-07 11:40:39 UTC", "user":"kriesel", "computer":"dodo-gtx1080ti", "aid":"0", "fft-length":18874368, "B1":2400000, "B2":55200000}
2019-09-07 06:41:51 406000003       40000  0.86%; 16904 us/sq; ETA 0d 21:42; e23d26ca2c9c1976
At what version was the 120 round limit instituted?
This example above is a considerable exception to the 120 rounds limit described. It's also seemingly not using much of the gpu ram or the 10240MB maxAlloc. 11 x 144 = 1584MB.
11 buffers x 261 rounds = 2871 not 2880. I'm familiar from running lots of CUDAPm1 with the last round being a smaller "runt" round at times, when # of buffers does not exactly divide the total count. How is that handled in gpuowl?
On a gpuowl V6.7-4-278407a RX550 (4GB; -maxAlloc 3072 I think) run on 100002769 it used 144 rounds.
I have no issue with higher number of rounds, having run CUDAPm1 to 480 or 960 (NRP down to 1); just thought you would want to know.
(Such runs in CUDAPm1 are not normally recommended, because its selection of bounds usually is too low when memory is that limiting. Bounds too low don't retire the primenet task.)

I have gpus with as little as 1GB, but they are not an issue since they're opencl 1.1/1.2 and won't run gpuowl at all due to errors on atomic uint during opencl compile at launch. But I also have 1.5, 2, 2.5, 3, 4, & 5.25 GB gpus. One of the 2GB gpus is an RX550 which can't run CUDAPm1, while many of the other low-ram gpus can.

Last fiddled with by kriesel on 2019-09-09 at 14:55
kriesel is online now   Reply With Quote
Old 2019-09-10, 11:36   #1359
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25338 Posts
Default

Quote:
Originally Posted by kriesel View Post
Thanks for the detailed explanation.
So, # of relative primes/round ~ 2880/rounds. I've seen at least the following numbers of rounds in stage 2 in initial testing of gpuowl 6.6 & 6.7 P-1: 9 13 18 40 41 45 72 90 92 144 261. 13 41 92 261 don't evenly divide 2880.
That was a bug I introduced recently. Fixed now. Thanks for reporting it!
The number of buffers, and the number of rounds, should both divide 2880.

BTW, there is a set of know-factors P-1 tasks in the test-pm1/ folder. All the factors there should be found [with a couple of exceptions*]. If even one is not detected, that's a bug to be addressed. The tasks there are designed to be very small (fast).

In the future I would like to integrate this self-test in GpuOwl to make it easier to run.

*the exeptions are of the form: the factor f is B1-smooth, meaning all prime factors of f-1 are <=B1, but not powerSmooth(B1), i.e. there is a prime factor of f-1 with a multiplicity that pushes it above B1. I should clear these from the list.

Last fiddled with by preda on 2019-09-10 at 11:39
preda is offline   Reply With Quote
Old 2019-09-11, 21:30   #1360
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124538 Posts
Default v6.6 to 6.9

I've managed to run to ~500M P-1 two stages in v6.6 on an RX480 and v6.7-4 on GTX1080Ti; well almost; the RX480 run has completed stage 2 of a 500M exponent, while the GTX1080Ti run is in stage 2 at round 264 of 288 on a 502M exponent.
Gpuowl V6.8-2-g0f3059b runs on a 4GB RX550, but fails on an opencl 1.2 capable Quadro K4000 and NVIDIA driver 388.13 on Win 10 X64 as follows:
Code:
2019-09-11 13:09:03 config: -device 0 -user kriesel -cpu roa/quadro-k4000 -use ORIG_X2 -maxAlloc 3000 -pm1 24000577 -B1 220000 -B2 3960000 
2019-09-11 13:09:03 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word
2019-09-11 13:09:03 using short carry kernels
2019-09-11 13:09:04 OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-11 13:09:04 OpenCL compilation error -11 (args -DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1  -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-09-11 13:09:04 <kernel>:1375:44: error: use of undeclared identifier 'memory_scope_device'
  work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device);
                                           ^
<kernel>:1384:44: error: use of undeclared identifier 'memory_scope_device'
  work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device);
                                           ^
<kernel>:1428:5: warning: implicit declaration of function 'atomic_store_explicit' is invalid in C99
    atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); 
    ^
<kernel>:1428:28: error: use of undeclared identifier 'atomic_uint'
    atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); 
                           ^
<kernel>:1428:41: error: expected expression
    atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); 
                                        ^
<kernel>:1437:12: warning: implicit declaration of function 'atomic_load_explicit' is invalid in C99
    while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
           ^
<kernel>:1437:34: error: use of undeclared identifier 'atomic_uint'
    while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
                                 ^
<kernel>:1437:47: error: expected expression
    while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
                                              ^
<kernel>:1496:28: error: use of undeclared identifier 'atomic_uint'
    atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); 
                           ^
<kernel>:1496:41: error: expected expression
    atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device); 
                                        ^
<kernel>:1504:34: error: use of undeclared identifier2019-09-11 13:09:04 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:216 build
2019-09-11 13:09:04 Bye
(results with gpouowl-6.7-4-278407a are similar)

On an NVIDIA GTX1060 3GB, with -maxAlloc 3000, gpuowl v6.9-0-c137007 loads and runs, but even for a 24M exponent, claims there is not enough memory for stage 2.
This makes gpuowl P-1 inappropriate for use on the GTX1060 3GB or any smaller-memory gpu for current or future wavefront exponents.
Code:
2019-09-11 14:58:19 24000577      290000 91.32%; 3748 us/sq; ETA 0d 00:02; 740645b16c6380fe
2019-09-11 14:58:57 24000577      300000 94.47%; 3746 us/sq; ETA 0d 00:01; 50ed1a7837d59607
2019-09-11 14:59:34 24000577      310000 97.62%; 3745 us/sq; ETA 0d 00:00; 21a38a3fd1fa6582
2019-09-11 15:00:02 Not enough GPU memory, will skip stage2. Please wait for stage1 GCD
2019-09-11 15:00:17 24000577 P-1 stage1 GCD: no factor
It therefore misses the known factor for this exponent.
Worktodo entry was
Quote:
B1=220000,B2=3960000;PFactor=0,1,2,24000577,-1,77,2
It requires B1 higher than GPU72 or PrimeNet target B1 to find the factor. https://www.mersenne.ca/exponent/24000577
It will also miss the known factor for 50001781 for the same reason. https://www.mersenne.ca/exponent/50001781
This identical gpu can complete both stages in CUDAPM1 v0.20 for exponents up to approximately 432.5M.

It would be useful, especially if a 2-stage (B2 specified) P-1 command line or worktodo entry is present, that the program generated a very early warning that it will not run stage 2.

V6.9-0 seems to have a bug relating to allowing stage 2, since after stage 1 completes, it also refuses to run stage 2 on an 8GB gtx1080 with -maxAlloc 8000 for even the mere 24M test exponent above.

Last fiddled with by kriesel on 2019-09-11 at 21:35
kriesel is online now   Reply With Quote
Old 2019-09-11, 21:58   #1361
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

25338 Posts
Default

Quote:
Originally Posted by kriesel View Post
V6.9-0 seems to have a bug relating to allowing stage 2, since after stage 1 completes, it also refuses to run stage 2 on an 8GB gtx1080 with -maxAlloc 8000 for even the mere 24M test exponent above.
Yes, pushed a fix. Should run now on lower memory, but I think there is another problem still, need to investigate/validate a bit more.
preda is offline   Reply With Quote
Old 2019-09-13, 22:37   #1362
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

55B16 Posts
Default

Quote:
Originally Posted by preda View Post
Yes, pushed a fix. Should run now on lower memory, but I think there is another problem still, need to investigate/validate a bit more.
More changes to P-1 stage2:
- now the number of buffers does not have to divide 2880 anymore,
- small fixes to logging format

Looking for bug reports. P-1 savefile not implemented yet.
preda is offline   Reply With Quote
Old 2019-09-13, 22:41   #1363
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5×11×137 Posts
Default

Bug report, but not P-1: Gerbicz error count not reported in JSON result.

It used to, see this commit:

https://github.com/preda/gpuowl/comm...37df13563a0f0f
Prime95 is online now   Reply With Quote
Old 2019-09-13, 23:11   #1364
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

5·11·137 Posts
Default

I'd like a volunteer with a non-Radeon VII to test a gpuowl version for me. A couple months ago I tried merging the transpose and fftMiddle steps into one kernel. It worked but was slower on my Radeon VII, so I shelved the idea. I'm now wondering if it would be faster on a GPU without HBM memory. If it is faster I'll go back and finish the implementation with a #define to easily turn it on and off.

I think the code works. I'd like timings on some random 5M FFT exponent -- both the current version and this test version. Also, runs with the -time command line argument would be nice.

Zip file is suitable for Linux gpuowl build.
Attached Files
File Type: zip gpuowl5.zip (93.3 KB, 82 views)

Last fiddled with by Prime95 on 2019-09-13 at 23:27
Prime95 is online now   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 20:30.


Sun Aug 1 20:30:47 UTC 2021 up 9 days, 14:59, 0 users, load averages: 2.73, 2.32, 1.95

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.