![]() |
|
|
#1354 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
541910 Posts |
On the same GTX1080Ti,
CUDPm1 V0.20 exponent 300M 1.2 days to complete both stages. (Includes stage 2 gcd time). Empirical fit is time proportional to exponent to the 1.986 power) Gpuowl V6.7-4 exponent 298M 1.068 days to compelte both stages. (Not charged for stage 2 gcd time because that occurs in parallel with another exponent's stage 1 run if more work is queued and running from worktodo.txt. Empirical fit is time proportional to exponent tothe 1.83 power.) (Fits are likely to bend upward to ~2.1 power at higher exponent as more data are acquired at higher exponents. That's how it went with LL and PRP applications.) Extrapolating the gpuowl result to 300M at p2 would give ~1.082 days, 9.8% faster than CUDAPm1 on the same hardware and exponent. Gpuowl may be able to run higher P-1 2-stage exponents than CUDAPm1 on the same hardware. That is under test now. Gpuowl will not run on a test gpu Quadro 2000 (compute capability 2.1, opencl 1.1/1.2), producing a shower of cl compile errors relating to atomics. I think it requires at least OpenCL 1.2 and therefore a CUDA compute capability above 2.x. An explicit test for opencl version by gpuowl might be a good thing. ("Gpuowl requires OpenCL 1.2 support for atomics, which this gpu does not support. Exiting now." or some such helpful message.) |
|
|
|
|
|
#1355 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
10101001010112 Posts |
Gpuowl v 6.7 on 4GB GTX1050Ti with -maxAlloc 3072 bailed on stage 2 of a 223M exponent P-1 run, indicating not enough memory. https://www.mersenne.org/M223000051. CUDAPm1 can run up to 384M on that same gpu.
|
|
|
|
|
|
#1356 | |
|
"Mihai Preda"
Apr 2015
55B16 Posts |
Quote:
In stage1 we compute Base = 3^powerSmooth(B1). the "Brent-Suyama exponent" is E=2, meaning: In stage2 we multiply with Base^(a^2) - Base^(b^2) == Base^(b^2) * (Base^(a^2 - b^2) - 1), and Base^(a^2 - b^2) == Base^((a - b)*(a + b)) We try to cover all the primes in the range [B1, B2] with numbers of the form a-b or a+b as above, where "a" is a multiple of D: a==k*D. In GpuOwl, D is a primorial D = 2*3*5*7*11*13 == 30030. When D has such a form, it turns out that all the primes can be covered with values (a+b) or (a-b) where b<D/2 and b is relative prime to all the prime factors of D, i.e. the number of possible values of b needed to cover any prime is: J = 1*2*4*6*10*12 / 2 == 2880 The abstract idea is to precompute all the 2880 values of Base^(b^2), and next iterate k with a=k*D over regions of size D. The range of "k" is given by the need to cover the range [B1,B2] with intervals of size D. Such intervals of size D are called "Blocks", and thus the number of Blocks is roughly equal to (B2 - B1)/30030. Now, to precompute the 2880 values we'd need 2880 "buffers" (in the PRP sense), which are about 40MB in size each. 2880 * 40MB is a bit too much for the GPU memory, so we divide 2880 into a number of *rounds*. We compute a number of buffers N that fit in GPU RAM, and N divides 2880. Now Rounds == 2880/N. So: - Blocks is (roughly) given by (B2 - B1)/D, (D==30030) - Rounds is given by: 2880 / (nb. of buffers that fit in RAM) - E=2 (Brent-Suyama exponent) The above design works well with large amounts of RAM, where the RAM can fit 200-400 buffers (8GB-16GB with 40MB buffers) or more, such that the number of rounds stays small. If there's too little RAM, let's say enough to fit 4 buffers, then the number of rounds 2880/4 would be too large and a lot of time would be spent just switching from one round to the next. GpuOwl refuses to run stage2 if at least 24 buffers can't be fit (120 rounds). Last fiddled with by preda on 2019-09-09 at 12:48 |
|
|
|
|
|
|
#1357 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
|
|
|
|
|
|
|
#1358 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
5,419 Posts |
Quote:
So, # of relative primes/round ~ 2880/rounds. I've seen at least the following numbers of rounds in stage 2 in initial testing of gpuowl 6.6 & 6.7 P-1: 9 13 18 40 41 45 72 90 92 144 261. 13 41 92 261 don't evenly divide 2880. On an 11GB GTX1080Ti in gpuowl V6.7-4-278407a I got the following: Code:
...
2019-09-06 16:50:52 298000033 3400000 98.20%; 12685 us/sq; ETA 0d 00:13; ce82eeba45fe1874
2019-09-06 16:52:59 298000033 3410000 98.49%; 12680 us/sq; ETA 0d 00:11; 15dcd7efd36a1a41
2019-09-06 16:55:06 298000033 3420000 98.78%; 12678 us/sq; ETA 0d 00:09; 62e9338608a7eaba
2019-09-06 16:57:12 298000033 3430000 99.06%; 12667 us/sq; ETA 0d 00:07; da9050a0970d90ae
2019-09-06 16:59:19 298000033 3440000 99.35%; 12673 us/sq; ETA 0d 00:05; fefb68a0f511ba43
2019-09-06 17:01:26 298000033 3450000 99.64%; 12683 us/sq; ETA 0d 00:03; 9361be5349669b91
2019-09-06 17:03:33 298000033 3460000 99.93%; 12693 us/sq; ETA 0d 00:01; 57dd39e62c29b825
2019-09-06 17:04:04 P-1 stage2 using 11 buffers of 144.0 MB each
2019-09-06 17:04:05 P-1 (B1=2400000, B2=55200000, D=30030): primes 3117147, expanded 3208391, doubles 496457 (left 2156356), singles 2124233, total 2620690 (84%)
2019-09-06 17:04:05 298000033 P-1 stage2: 1759 blocks starting at block 80 (2620690 selected)
2019-09-06 17:07:09 Round 1 of 261: init 1.19 s; 18.34 ms/mul; 9981 muls
2019-09-06 17:10:14 Round 2 of 261: init 1.33 s; 18.34 ms/mul; 10017 muls
2019-09-06 17:13:21 Round 3 of 261: init 1.56 s; 18.28 ms/mul; 10111 muls
2019-09-06 17:16:28 Round 4 of 261: init 1.64 s; 18.23 ms/mul; 10171 muls
...
2019-09-07 06:27:14 Round 260 of 261: init 1.67 s; 18.59 ms/mul; 10046 muls
2019-09-07 06:30:22 Round 261 of 261: init 1.93 s; 18.59 ms/mul; 10008 muls
2019-09-07 06:30:24 406000003 FFT 24576K: Width 256x4, Height 256x4, Middle 12; 16.13 bits/word
2019-09-07 06:30:24 using short carry kernels
2019-09-07 06:30:24 OpenCL args "-DEXP=406000003u -DWIDTH=1024u -DSMALL_HEIGHT=1024u -DMIDDLE=12u -DWEIGHT_STEP=0xe.974d6f9929278p-3 -DIWEIGHT_STEP=0x8.c5c3982c20ae8p-4 -DWEIGHT_BIGSTEP=0x9.837f0518db8a8p-3 -DIWEIGHT_BIGSTEP=0xd.744fccad69d68p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-07 06:30:28
2019-09-07 06:30:28 OpenCL compilation in 3666 ms
2019-09-07 06:30:34 406000003 P-1 starting stage1
2019-09-07 06:33:24 406000003 10000 0.21%; 16863 us/sq; ETA 0d 21:47; 4b5360aace601e72
2019-09-07 06:36:13 406000003 20000 0.43%; 16894 us/sq; ETA 0d 21:47; 6d8775dbd7676169
2019-09-07 06:39:02 406000003 30000 0.64%; 16902 us/sq; ETA 0d 21:45; 0339219ee80c94e3
2019-09-07 06:40:39 298000033 P-1 final GCD: no factor
2019-09-07 06:40:39 {"exponent":"298000033", "worktype":"PM1", "status":"NF", "program":{"name":"gpuowl", "version":"v6.7-4-g278407a"}, "timestamp":"2019-09-07 11:40:39 UTC", "user":"kriesel", "computer":"dodo-gtx1080ti", "aid":"0", "fft-length":18874368, "B1":2400000, "B2":55200000}
2019-09-07 06:41:51 406000003 40000 0.86%; 16904 us/sq; ETA 0d 21:42; e23d26ca2c9c1976
This example above is a considerable exception to the 120 rounds limit described. It's also seemingly not using much of the gpu ram or the 10240MB maxAlloc. 11 x 144 = 1584MB. 11 buffers x 261 rounds = 2871 not 2880. I'm familiar from running lots of CUDAPm1 with the last round being a smaller "runt" round at times, when # of buffers does not exactly divide the total count. How is that handled in gpuowl? On a gpuowl V6.7-4-278407a RX550 (4GB; -maxAlloc 3072 I think) run on 100002769 it used 144 rounds. I have no issue with higher number of rounds, having run CUDAPm1 to 480 or 960 (NRP down to 1); just thought you would want to know. (Such runs in CUDAPm1 are not normally recommended, because its selection of bounds usually is too low when memory is that limiting. Bounds too low don't retire the primenet task.) I have gpus with as little as 1GB, but they are not an issue since they're opencl 1.1/1.2 and won't run gpuowl at all due to errors on atomic uint during opencl compile at launch. But I also have 1.5, 2, 2.5, 3, 4, & 5.25 GB gpus. One of the 2GB gpus is an RX550 which can't run CUDAPm1, while many of the other low-ram gpus can. Last fiddled with by kriesel on 2019-09-09 at 14:55 |
|
|
|
|
|
|
#1359 | |
|
"Mihai Preda"
Apr 2015
25338 Posts |
Quote:
The number of buffers, and the number of rounds, should both divide 2880. BTW, there is a set of know-factors P-1 tasks in the test-pm1/ folder. All the factors there should be found [with a couple of exceptions*]. If even one is not detected, that's a bug to be addressed. The tasks there are designed to be very small (fast). In the future I would like to integrate this self-test in GpuOwl to make it easier to run. *the exeptions are of the form: the factor f is B1-smooth, meaning all prime factors of f-1 are <=B1, but not powerSmooth(B1), i.e. there is a prime factor of f-1 with a multiplicity that pushes it above B1. I should clear these from the list. Last fiddled with by preda on 2019-09-10 at 11:39 |
|
|
|
|
|
|
#1360 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
124538 Posts |
I've managed to run to ~500M P-1 two stages in v6.6 on an RX480 and v6.7-4 on GTX1080Ti; well almost; the RX480 run has completed stage 2 of a 500M exponent, while the GTX1080Ti run is in stage 2 at round 264 of 288 on a 502M exponent.
Gpuowl V6.8-2-g0f3059b runs on a 4GB RX550, but fails on an opencl 1.2 capable Quadro K4000 and NVIDIA driver 388.13 on Win 10 X64 as follows: Code:
2019-09-11 13:09:03 config: -device 0 -user kriesel -cpu roa/quadro-k4000 -use ORIG_X2 -maxAlloc 3000 -pm1 24000577 -B1 220000 -B2 3960000
2019-09-11 13:09:03 24000577 FFT 1280K: Width 8x8, Height 256x4, Middle 10; 18.31 bits/word
2019-09-11 13:09:03 using short carry kernels
2019-09-11 13:09:04 OpenCL args "-DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0"
2019-09-11 13:09:04 OpenCL compilation error -11 (args -DEXP=24000577u -DWIDTH=64u -DSMALL_HEIGHT=1024u -DMIDDLE=10u -DWEIGHT_STEP=0xc.e5beac96a0b88p-3 -DIWEIGHT_STEP=0x9.eca8ba4660afp-4 -DWEIGHT_BIGSTEP=0xe.ac0c6e7dd2438p-3 -DIWEIGHT_BIGSTEP=0x8.b95c1e3ea8bd8p-4 -DORIG_X2=1 -I. -cl-fast-relaxed-math -cl-std=CL2.0)
2019-09-11 13:09:04 <kernel>:1375:44: error: use of undeclared identifier 'memory_scope_device'
work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device);
^
<kernel>:1384:44: error: use of undeclared identifier 'memory_scope_device'
work_group_barrier(CLK_GLOBAL_MEM_FENCE, memory_scope_device);
^
<kernel>:1428:5: warning: implicit declaration of function 'atomic_store_explicit' is invalid in C99
atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device);
^
<kernel>:1428:28: error: use of undeclared identifier 'atomic_uint'
atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device);
^
<kernel>:1428:41: error: expected expression
atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device);
^
<kernel>:1437:12: warning: implicit declaration of function 'atomic_load_explicit' is invalid in C99
while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
^
<kernel>:1437:34: error: use of undeclared identifier 'atomic_uint'
while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
^
<kernel>:1437:47: error: expected expression
while(!atomic_load_explicit((atomic_uint *) &ready[gr - 1], memory_order_acquire, memory_scope_device));
^
<kernel>:1496:28: error: use of undeclared identifier 'atomic_uint'
atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device);
^
<kernel>:1496:41: error: expected expression
atomic_store_explicit((atomic_uint *) &ready[gr], 1, memory_order_release, memory_scope_device);
^
<kernel>:1504:34: error: use of undeclared identifier2019-09-11 13:09:04 Exception gpu_error: BUILD_PROGRAM_FAILURE clBuildProgram at clwrap.cpp:216 build
2019-09-11 13:09:04 Bye
On an NVIDIA GTX1060 3GB, with -maxAlloc 3000, gpuowl v6.9-0-c137007 loads and runs, but even for a 24M exponent, claims there is not enough memory for stage 2. This makes gpuowl P-1 inappropriate for use on the GTX1060 3GB or any smaller-memory gpu for current or future wavefront exponents. Code:
2019-09-11 14:58:19 24000577 290000 91.32%; 3748 us/sq; ETA 0d 00:02; 740645b16c6380fe 2019-09-11 14:58:57 24000577 300000 94.47%; 3746 us/sq; ETA 0d 00:01; 50ed1a7837d59607 2019-09-11 14:59:34 24000577 310000 97.62%; 3745 us/sq; ETA 0d 00:00; 21a38a3fd1fa6582 2019-09-11 15:00:02 Not enough GPU memory, will skip stage2. Please wait for stage1 GCD 2019-09-11 15:00:17 24000577 P-1 stage1 GCD: no factor Worktodo entry was Quote:
It will also miss the known factor for 50001781 for the same reason. https://www.mersenne.ca/exponent/50001781 This identical gpu can complete both stages in CUDAPM1 v0.20 for exponents up to approximately 432.5M. It would be useful, especially if a 2-stage (B2 specified) P-1 command line or worktodo entry is present, that the program generated a very early warning that it will not run stage 2. V6.9-0 seems to have a bug relating to allowing stage 2, since after stage 1 completes, it also refuses to run stage 2 on an 8GB gtx1080 with -maxAlloc 8000 for even the mere 24M test exponent above. Last fiddled with by kriesel on 2019-09-11 at 21:35 |
|
|
|
|
|
|
#1361 |
|
"Mihai Preda"
Apr 2015
25338 Posts |
Yes, pushed a fix. Should run now on lower memory, but I think there is another problem still, need to investigate/validate a bit more.
|
|
|
|
|
|
#1362 | |
|
"Mihai Preda"
Apr 2015
55B16 Posts |
Quote:
- now the number of buffers does not have to divide 2880 anymore, - small fixes to logging format Looking for bug reports. P-1 savefile not implemented yet. |
|
|
|
|
|
|
#1363 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5×11×137 Posts |
Bug report, but not P-1: Gerbicz error count not reported in JSON result.
It used to, see this commit: https://github.com/preda/gpuowl/comm...37df13563a0f0f |
|
|
|
|
|
#1364 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5·11·137 Posts |
I'd like a volunteer with a non-Radeon VII to test a gpuowl version for me. A couple months ago I tried merging the transpose and fftMiddle steps into one kernel. It worked but was slower on my Radeon VII, so I shelved the idea. I'm now wondering if it would be faster on a GPU without HBM memory. If it is faster I'll go back and finish the implementation with a #define to easily turn it on and off.
I think the code works. I'd like timings on some random 5M FFT exponent -- both the current version and this test version. Also, runs with the -time command line argument would be nice. Zip file is suitable for Linux gpuowl build. Last fiddled with by Prime95 on 2019-09-13 at 23:27 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| GPUOWL AMD Windows OpenCL issues | xx005fs | GpuOwl | 0 | 2019-07-26 21:37 |
| Testing an expression for primality | 1260 | Software | 17 | 2015-08-28 01:35 |
| Testing Mersenne cofactors for primality? | CRGreathouse | Computer Science & Computational Number Theory | 18 | 2013-06-08 19:12 |
| Primality-testing program with multiple types of moduli (PFGW-related) | Unregistered | Information & Answers | 4 | 2006-10-04 22:38 |