mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

kriesel 2018-01-28 17:30

readme again
 
[QUOTE=preda;478205]Related to "-dump folder": this relies on the non-standard -save-temps OpenCL option, which works on ROCm and AMDGPU-pro, but as seen does not work on Nvidia. ...[/QUOTE]

Thanks for the info re -dump.

Please note in the readme, when it's known a feature or option is tested on a subset of the available hardware or other considerations, or known not to work on a subset.

The supported syntax for folder for the -dump option is unclear. (Does it grab a string, and pass it unchanged to the OS, prepended to the filenames to be saved, so any OS-supported path form should work?)

kriesel 2018-01-28 18:27

[QUOTE=preda;474511]Sorry I don't know why this happens. The error code -5 is CL_OUT_OF_RESOURCES, but why get that on clEnqueueReadBuffer I don't know.

[/QUOTE]

Note, something similar sometimes occurs with mfakto. One occurrence so far in about 3 days of run, on Windows 10, Intel HD620, mfakto 0.15pre6-Win (64bit build) here
(output redirected to log file):
got assignment: exp=165536087 bit_min=71 bit_max=76 (89.56 GHz-days)
Starting trial factoring M165536087 from 2^71 to 2^72 (2.89GHz-days)
Using GPU kernel "cl_barrett32_76_gs_2"
Date Time | class Pct | time ETA | GHz-d/day Sieve Wait
Jan 25 17:07 | 0 0.1% | 14.666 3h54m | 17.73 81206 0.00%
...
Jan 25 20:20 | 3877 84.2% | 14.084 35m41s | 18.46 81206 0.00%

Error -5 (Out of resources): clEnqueueReadBuffer RES failed.
ERROR from tf_class.

console displayed these messages:
Error -5 (Out of resources): Enqueuing kernel (clEnqueueNDRangeKernel) SegSieve
Error -5 (Out of resources): Enqueuing kernel(clEnqueueNDRangeKernel) CalcBitToClear

Looks like that sort of thing has stumped others. A few things to try are mentioned.
[URL]https://forums.khronos.org/showthread.php/6072-out-of-resources-when-clEnqueueReadBuffer[/URL]
[URL]https://stackoverflow.com/questions/17633727/opencl-ridiculous-cl-out-of-resources[/URL]
[URL]https://streamhpc.com/blog/2013-10-15/basic-concepts-resources-clenqueuereadbuffer/[/URL]

preda 2018-01-31 08:54

FFT size
 
GpuOwl is using by default 4M FFT, and the current exponent wavefront is really pushing the limit of 4M-FFT. Right now I'm testing an exponent==78.12M and it's still OK, but I expect it is near the limit. I'll report here when I see the limit breached.

Explicit FFT size can be specified: "-size 4M" or "-size 8M". 8M is slow. I'm looking into providing an intermediary FFT size, but not done yet.

preda 2018-01-31 12:16

[QUOTE=kriesel;478634]It happens. I try to be accurate and clear and sometimes succeed, sometimes not.

I remember reading a post by Preda listing the steps and stating a transform each way were avoided in PRP (7 steps) relative to LL (11). Didn't find it when I searched yesterday. Perhaps it's for a version before the Gerbicz check was added, and no longer relevant.
(I don't think I just dreamed it ....)[/QUOTE]

The "trick" used by GpuOwl is to store the data in a transposed representation, which fits better the initial and final FFT steps of the "matrix FFT algorithm" (and, it turns out, the carry propagation can be done well on this transposed representation). But this techniques applies equally well to LL and to PRP; the two are quite similar, and have similar cost.

PRP, though it doesn't need the "-2", still needs to do the carry propagation on each iteration. (like LL does).

kriesel 2018-01-31 15:28

[QUOTE=preda;478869]GpuOwl is using by default 4M FFT, and the current exponent wavefront is really pushing the limit of 4M-FFT. Right now I'm testing an exponent==78.12M and it's still OK, but I expect it is near the limit. I'll report here when I see the limit breached.

Explicit FFT size can be specified: "-size 4M" or "-size 8M". 8M is slow. I'm looking into providing an intermediary FFT size, but not done yet.[/QUOTE]

Just finished a GpuOwL V1.9 PRP on [URL]https://www.mersenne.org/report_exponent/?exp_lo=76812401&full=1[/URL] on an RX550. The exponent is also under way in LLtest in Prime95 on a slow cpu as originally assigned via primenet.

Ran some fft benchmarks in CUDALucas on GTX1070 and ClLucas on RX550 to look at where may be a good intermediate length in the 4M-8M interval. These were extensive, checking every 7-smooth multiple of 1K from 1 to 2^16. The programs are very different in their response to changing fft length. I'm skeptical of the ClLucas results, since they vary significantly in two runs, particularly at shorter lengths than 4M (first run set done in batches in the program via start end and increment values, second run set done one length per program launch). Timings are not comparable between ClLucas and CUDALucas, because the GPUs on which they were run are very different speeds, and because ClLucas benchmarks a single fft transform, not a full iteration. (See [URL]http://mersenneforum.org/showpost.php?p=417820&postcount=349[/URL] and 350.) Significant timings on ClLucas are:
4096K 7.332 msec
4608K 14.196 msec
8192K 14.976 msec
A couple of other fft lengths (4116K, 4536K) produced 14.82 msec in ClLucas; all other ClLucas timings obtained in the interval were >= the 8M timing.

ClLucas declined to run some lengths, with threads=256 in the ini file. The fft lengths ClLucas did not run in the 4M-8M interval were:
[CODE]Not support FFT length = 4480000
Not support FFT length = 4838400
Not support FFT length = 5225472
Not support FFT length = 5268480
Not support FFT length = 5760000
Not support FFT length = 6220800
Not support FFT length = 6272000
Not support FFT length = 6718464
Not support FFT length = 6773760
Not support FFT length = 7375872
Not support FFT length = 8064000[/CODE](4375K, 4725K, 5103K, 5145K, 5625K, 6075K, 6125K, 6561K, 6615K, 7203K, 7875K, respectively)
CUDALucas produced many useful "stairstep" values in the interval; 4608K, 5184K, 5600K, 5832K, 6144K, 6272K, 6480K, 6561K, 7168K, 7200K.

ClLucas benchmarks run in batches:[CODE]
clFFT bench start = 3145728 end = 5242880 distance = 524288
clFFT size= 3145728 time= 7.332010 msec
clFFT size= 3670016 time= 7.644010 msec
[B]clFFT size= 4194304 time= 7.176020 msec[/B]
clFFT size= 4718592 time= 14.352020 msec
[B]clFFT size= 5242880 time= 16.380030 msec[/B]

clFFT bench start = 6291456 end = 10485760 distance = 1048576
[B]clFFT size= 6291456 time= 16.224030 msec
clFFT size= 7340032 time= 21.528040 msec
clFFT size= 8388608 time= 14.976030 msec[/B]
clFFT size= 9437184 time= 27.300050 msec
clFFT size= 10485760 time= 31.668050 msec
[/CODE]ClLucas fft benchmarks run individually via a big batch script:
[CODE]clFFT bench start = 4194304 end = 4194304 distance = 1024
clFFT size= 4194304 time= 7.332010 msec

clFFT bench start = 5242880 end = 5242880 distance = 1024
clFFT size= 5242880 time= 15.912030 msec

clFFT bench start = 6291456 end = 6291456 distance = 1024
clFFT size= 6291456 time= 16.536030 msec

clFFT bench start = 7340032 end = 7340032 distance = 1024
clFFT size= 7340032 time= 20.904030 msec

clFFT bench start = 8388608 end = 8388608 distance = 1024
clFFT size= 8388608 time= 14.976020 msec
[/CODE]Not sure why the relative lack of fast intermediate fft lengths is occurring in cllucas. It's using clfft.h per [URL]http://mersenneforum.org/showpost.php?p=353757&postcount=151and[/URL] 160 (but whose?)

For comparison, CUDALucas provides many intermediate fft lengths from which to choose for speed or error level. An excerpt of the output fft file shows maximum exponent and relative timings on the 4M-8M interval on a GTX1070 (all fft lengths in units of 1K, that is 4096 below means 4096K or 4M):
[CODE]Device GeForce GTX 1070
Compatibility 6.1
clockRate (MHz) 1708
memClockRate (MHz) 4004

fft max exp ms/iter
4096 75846319 4.4638
4608 85111207 5.3085
4800 88579669 5.9685
5184 95507747 5.9741
5600 103000823 6.8427
5832 107174381 7.0113
6144 112781477 7.3980
6272 115080019 7.5701
6480 118813021 8.0600
6912 126558077 8.2183
7168 131142761 8.5248
7200 131715607 8.8264
8192 149447533 9.0781[/CODE]I'll have a second RX550 on which to test shortly.

GpuOwL 1.9-74f1a38 iteration timings, msec/iter on RX550:
2M 6.04
4M 12.04
8M 27.22

(end)

kriesel 2018-01-31 16:58

8M errors
 
Sailing along, great. A few hours in, uh oh, frequent errors.
Persists after an application stop and restart. And after an entire system restart.

Also, oddly, for some reason -verbosity 2 is no longer having an effect.

I wonder if the checkpoint files are any good.
01/31/2018 10:00 AM 38,500,038 154000001-prev.owl
01/31/2018 10:52 AM 38,500,038 154000001.owl

OpenCL compilation in 2574 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=154000001u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_
NWORDS=23u -DFP_DP=1 -save-temps=df/DP_8M"
PRP-3: FFT 8M (2048 * 2048 * 2) of 154000001 (18.36 bits/word) [2018-01-31 05:16:34 Central Standard Time]
Starting at iteration 10000
OK 10000 / 154000001 [ 0.01%], 0.00 ms/it; ETA 0d 00:00; 31fbcefd1c96edad [05:16:51]
OK 11000 / 154000001 [ 0.01%], 27.19 ms/it [27.05, 27.33] CV 0.7%, check 16.33s; ETA 48d 11:03; 7b7e444c5717a4eb [05:17:35]
OK 15000 / 154000001 [ 0.01%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.32s; ETA 48d 04:31; 3d0be4fd236ac3c1 [05:19:39]
OK 20000 / 154000001 [ 0.01%], 27.00 ms/it [26.99, 27.05] CV 0.1%, check 16.66s; ETA 48d 02:44; 7b373c0a0617023f [05:22:11]
OK 30000 / 154000001 [ 0.02%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.30s; ETA 48d 04:30; 572924f2fa2a37d2 [05:26:58]
OK 40000 / 154000001 [ 0.03%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.27s; ETA 48d 04:22; 92200114f8b895e3 [05:31:44]
OK 60000 / 154000001 [ 0.04%], 27.03 ms/it [27.02, 27.08] CV 0.1%, check 16.32s; ETA 48d 04:01; 9995741f85279033 [05:41:01]
OK 80000 / 154000001 [ 0.05%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.60s; ETA 48d 03:56; ef5f93484d4e7c7e [05:50:19]
OK 100000 / 154000001 [ 0.06%], 26.99 ms/it [26.97, 27.05] CV 0.0%, check 16.74s; ETA 48d 01:59; a3578b41a4a46818 [05:59:35]
OK 120000 / 154000001 [ 0.08%], 27.04 ms/it [27.02, 27.30] CV 0.2%, check 16.30s; ETA 48d 03:52; 6abed72e586d3852 [06:08:53]
OK 150000 / 154000001 [ 0.10%], 27.03 ms/it [27.02, 27.27] CV 0.1%, check 16.49s; ETA 48d 03:20; 5813a71d3ffa36e8 [06:22:40]
OK 200000 / 154000001 [ 0.13%], 27.03 ms/it [26.99, 27.30] CV 0.1%, check 16.66s; ETA 48d 02:47; 2a8485d925439d56 [06:45:28]
OK 250000 / 154000001 [ 0.16%], 27.03 ms/it [26.99, 27.27] CV 0.1%, check 16.44s; ETA 48d 02:26; 82ef7ca5b97d86d0 [07:08:16]
OK 300000 / 154000001 [ 0.19%], 27.03 ms/it [27.02, 27.30] CV 0.1%, check 16.50s; ETA 48d 02:14; 27458918ebc4797f [07:31:05]
EE 350000 / 154000001 [ 0.23%], 27.00 ms/it [26.99, 27.42] CV 0.2%, check 16.54s; ETA 48d 00:23; 0ea3c0d18b7bb6a5 [07:53:51]
OK 320000 / 154000001 [ 0.21%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.49s; ETA 48d 02:14; 1b0eb65bd9e0c4bf [08:03:08] (1 errors)
EE 340000 / 154000001 [ 0.22%], 27.01 ms/it [26.96, 27.27] CV 0.3%, check 16.21s; ETA 48d 00:48; 220e3b25ea454b27 [08:12:25] (1 errors)
OK 330000 / 154000001 [ 0.21%], 27.03 ms/it [27.02, 27.08] CV 0.1%, check 16.50s; ETA 48d 01:59; da6fa4d784199dcc [08:17:12] (2 errors)
EE 340000 / 154000001 [ 0.22%], 27.02 ms/it [27.02, 27.08] CV 0.1%, check 16.44s; ETA 48d 01:26; 220e3b25ea454b27 [08:21:58] (2 errors)
EE 340000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.58s; ETA 48d 02:06; 220e3b25ea454b27 [08:26:45] (3 errors)
EE 340000 / 154000001 [ 0.22%], 27.06 ms/it [27.02, 27.52] CV 0.4%, check 16.60s; ETA 48d 03:02; 220e3b25ea454b27 [08:31:32] (4 errors)
EE 340000 / 154000001 [ 0.22%], 27.03 ms/it [27.02, 27.11] CV 0.1%, check 16.36s; ETA 48d 01:50; 220e3b25ea454b27 [08:36:19] (5 errors)
EE 340000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.60s; ETA 48d 01:58; 220e3b25ea454b27 [08:41:06] (6 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.33s; ETA 48d 02:04; 5b972f21b3814a0b [08:43:38] (7 errors)
EE 335000 / 154000001 [ 0.22%], 27.05 ms/it [27.02, 27.11] CV 0.1%, check 16.36s; ETA 48d 02:29; 5b972f21b3814a0b [08:46:09] (8 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.27s; ETA 48d 02:04; 5b972f21b3814a0b [08:48:41] (9 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.25s; ETA 48d 02:13; 5b972f21b3814a0b [08:51:12] (10 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.25s; ETA 48d 02:21; 5b972f21b3814a0b [08:53:44] (11 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.43s; ETA 48d 02:04; 5b972f21b3814a0b [08:56:15] (12 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.46s; ETA 48d 02:13; 5b972f21b3814a0b [08:58:47] (13 errors)
EE 335000 / 154000001 [ 0.22%], 27.07 ms/it [27.02, 27.27] CV 0.3%, check 16.58s; ETA 48d 03:17; 5b972f21b3814a0b [09:01:19] (14 errors)
EE 335000 / 154000001 [ 0.22%], 27.00 ms/it [26.96, 27.08] CV 0.1%, check 16.65s; ETA 48d 00:22; 5b972f21b3814a0b [09:03:50] (15 errors)
EE 335000 / 154000001 [ 0.22%], 27.00 ms/it [26.99, 27.08] CV 0.1%, check 16.40s; ETA 48d 00:30; 5b972f21b3814a0b [09:06:22] (16 errors)
EE 335000 / 154000001 [ 0.22%], 27.03 ms/it [27.02, 27.08] CV 0.1%, check 16.25s; ETA 48d 01:48; 5b972f21b3814a0b [09:08:53] (17 errors)
EE 335000 / 154000001 [ 0.22%], 27.03 ms/it [27.02, 27.08] CV 0.1%, check 16.60s; ETA 48d 01:48; 5b972f21b3814a0b [09:11:25] (18 errors)
EE 335000 / 154000001 [ 0.22%], 27.07 ms/it [27.02, 27.30] CV 0.3%, check 16.22s; ETA 48d 03:17; 5b972f21b3814a0b [09:13:57] (19 errors)
EE 335000 / 154000001 [ 0.22%], 26.99 ms/it [26.96, 27.02] CV 0.1%, check 16.36s; ETA 47d 23:58; 5b972f21b3814a0b [09:16:28] (20 errors)
EE 335000 / 154000001 [ 0.22%], 27.00 ms/it [26.99, 27.05] CV 0.1%, check 16.29s; ETA 48d 00:23; 5b972f21b3814a0b [09:18:59] (21 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.47s; ETA 48d 02:13; 5b972f21b3814a0b [09:21:31] (22 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.44s; ETA 48d 02:04; 5b972f21b3814a0b [09:24:02] (23 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.72s; ETA 48d 02:04; 5b972f21b3814a0b [09:26:34] (24 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.27s; ETA 48d 02:04; 5b972f21b3814a0b [09:29:06] (25 errors)
EE 335000 / 154000001 [ 0.22%], 26.99 ms/it [26.99, 27.05] CV 0.1%, check 16.41s; ETA 48d 00:15; 5b972f21b3814a0b [09:31:37] (26 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.47s; ETA 48d 02:21; 5b972f21b3814a0b [09:34:09] (27 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.65s; ETA 48d 02:13; 5b972f21b3814a0b [09:36:41] (28 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.36s; ETA 48d 02:21; 5b972f21b3814a0b [09:39:12] (29 errors)
EE 335000 / 154000001 [ 0.22%], 27.00 ms/it [26.99, 27.08] CV 0.1%, check 16.58s; ETA 48d 00:30; 5b972f21b3814a0b [09:41:44] (30 errors)
EE 335000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.11] CV 0.1%, check 16.33s; ETA 48d 02:13; 5b972f21b3814a0b [09:44:15] (31 errors)
OK 332000 / 154000001 [ 0.22%], 27.02 ms/it [26.99, 27.08] CV 0.2%, check 16.49s; ETA 48d 01:18; 59bcd6a5dccc40c7 [09:45:26] (32 errors)
EE 335000 / 154000001 [ 0.22%], 27.01 ms/it [26.99, 27.05] CV 0.1%, check 16.65s; ETA 48d 00:51; 5b972f21b3814a0b [09:47:04] (32 errors)
EE 334000 / 154000001 [ 0.22%], 27.06 ms/it [27.02, 27.11] CV 0.1%, check 16.38s; ETA 48d 02:57; d6030fba97b08960 [09:48:14] (33 errors)
EE 334000 / 154000001 [ 0.22%], 27.05 ms/it [27.02, 27.11] CV 0.2%, check 16.19s; ETA 48d 02:37; d6030fba97b08960 [09:49:24] (34 errors)
EE 334000 / 154000001 [ 0.22%], 27.05 ms/it [27.02, 27.11] CV 0.2%, check 16.54s; ETA 48d 02:37; d6030fba97b08960 [09:50:35] (35 errors)
EE 334000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.18s; ETA 48d 02:16; d6030fba97b08960 [09:51:45] (36 errors)
EE 334000 / 154000001 [ 0.22%], 27.05 ms/it [27.02, 27.11] CV 0.2%, check 16.54s; ETA 48d 02:37; d6030fba97b08960 [09:52:56] (37 errors)
EE 334000 / 154000001 [ 0.22%], 27.05 ms/it [27.02, 27.11] CV 0.2%, check 16.66s; ETA 48d 02:37; d6030fba97b08960 [09:54:07] (38 errors)
EE 334000 / 154000001 [ 0.22%], 27.04 ms/it [27.02, 27.08] CV 0.1%, check 16.58s; ETA 48d 02:16; d6030fba97b08960 [09:55:17] (39 errors)
500 / 2000, 27.08 ms/it
Stopping, please wait..
OK 332500 / 154000001 [ 0.22%], 27.08 ms/it; ETA 48d 03:56; e8cb32be9a443cbe [09:55:47] (40 errors)

Bye

\gpuowl-1.9>echo exiting gpuowl at Wed 01/31/2018 9:55:53.16 1>>gpuowlrun.txt

\gpuowl-1.9>gpo

\gpuowl-1.9>echo starting gpuowl at Wed 01/31/2018 10:00:06.10 1>>gpuowlrun.txt

\gpuowl-1.9>gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df
gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz

OpenCL compilation in 2531 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=154000001u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_
NWORDS=23u -DFP_DP=1 -save-temps=df/DP_8M"
PRP-3: FFT 8M (2048 * 2048 * 2) of 154000001 (18.36 bits/word) [2018-01-31 10:00:10 Central Standard Time]
Starting at iteration 332500
OK 332500 / 154000001 [ 0.22%], 0.00 ms/it; ETA 0d 00:00; e8cb32be9a443cbe [10:00:28] (40 errors)
EE 333000 / 154000001 [ 0.22%], 27.30 ms/it; ETA 48d 13:24; c7b42ff699a5423f [10:00:59] (40 errors)
EE 333000 / 154000001 [ 0.22%], 27.26 ms/it; ETA 48d 11:37; c7b42ff699a5423f [10:01:30] (41 errors)
EE 333000 / 154000001 [ 0.22%], 27.32 ms/it; ETA 48d 14:00; c7b42ff699a5423f [10:02:00] (42 errors)
EE 333000 / 154000001 [ 0.22%], 27.31 ms/it; ETA 48d 13:50; c7b42ff699a5423f [10:02:31] (43 errors)
EE 333000 / 154000001 [ 0.22%], 27.29 ms/it; ETA 48d 12:58; c7b42ff699a5423f [10:03:02] (44 errors)
EE 333000 / 154000001 [ 0.22%], 27.31 ms/it; ETA 48d 13:39; c7b42ff699a5423f [10:03:32] (45 errors)

Bye

(run mfakto a while, then system restart)

\gpuowl-1.9>gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df
gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz

OpenCL compilation in 2587 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=154000001u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u
-DFP_DP=1 -save-temps=df/DP_8M"
PRP-3: FFT 8M (2048 * 2048 * 2) of 154000001 (18.36 bits/word) [2018-01-31 10:51:44 Central Standard Time]
Starting at iteration 332500
OK 332500 / 154000001 [ 0.22%], 0.00 ms/it; ETA 0d 00:00; e8cb32be9a443cbe [10:52:03] (40 errors)
EE 333000 / 154000001 [ 0.22%], 27.44 ms/it; ETA 48d 19:07; c7b42ff699a5423f [10:52:33] (40 errors)
EE 333000 / 154000001 [ 0.22%], 27.44 ms/it; ETA 48d 19:23; c7b42ff699a5423f [10:53:04] (41 errors)
EE 333000 / 154000001 [ 0.22%], 27.42 ms/it; ETA 48d 18:37; c7b42ff699a5423f [10:53:35] (42 errors)
EE 333000 / 154000001 [ 0.22%], 27.44 ms/it; ETA 48d 19:12; c7b42ff699a5423f [10:54:05] (43 errors)

Bye

preda 2018-01-31 20:22

[QUOTE=kriesel;478894]EE 340000 / 154000001 [ 0.22%], 27.02 ms/it [27.02, 27.08] CV 0.1%, check 16.44s; ETA 48d 01:26; 220e3b25ea454b27 [08:21:58] (2 errors)[/QUOTE]

This exponent is too large for 8M FFT (it has 18.35 bits/word).

Madpoo 2018-01-31 22:03

[QUOTE=preda;478869]...I'm looking into providing an intermediary FFT size, but not done yet.[/QUOTE]

The source for Prime95 might be of some help, but even better you might hit up George for any insights on particular tips or tricks.

Having a good selection of FFT sizes is pretty cool because Prime95 does that whole thing of doing a test on exponents near a boundary to see if the smaller FFT is doing okay or not after however many iterations and switching to the next larger one up if not.

Although, I have a sneaking suspicion there may be more bad results on exponents around those FFT boundaries compared to the ratio of bad results smack dab in the middle of a range. Just a hunch though, I haven't crunched the #'s and with Prime95 results it can be hard to squeeze out the FFT size it used for the test.

kriesel 2018-01-31 22:54

[QUOTE=preda;478922]This exponent is too large for 8M FFT (it has 18.35 bits/word).[/QUOTE]

(Just a moment while I wash the egg off my face. There, that's better.)

I suggest adding a check for exponent limits and a message about it, to protect us inattentive users from ourselves a bit, and our hapless gpus from having their hard work wasted. Oops, no, there seems to already be a check.

Oh, and maybe update guidance for the program's limits, compared to [URL]http://www.mersenneforum.org/showpost.php?p=468932&postcount=223[/URL]
"About FFT size selection:
by default, FFT size is automatically selected based on exponent size, with these cutoffs:
- under 40'000'000, use FFT 2M,
- under 78'000'000, use FFT 4M,
- under 155'000'000, use FFT 8M"
(which I had taken to mean it would be able to handle 154000001 with almost a million to spare)
Granted, that was for V1.5 and I ran V1.9.

It's impressive it was able to run hundreds of thousands of iterations before hitting errors.

Dropped down to CUDALucas' 8M limit, verbosity 2 is working again, and CV is nicely low.
[CODE]\gpuowl-1.9>gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df
gpuOwL v1.9- GPU Mersenne primality checker
Radeon 500 Series 8 @f:0.0, gfx804 1203MHz

OpenCL compilation in 2714 ms, with "-I. -cl-fast-relaxed-math -cl-std=CL2.0 -DEXP=149447533u -DWIDTH=2048u -DHEIGHT=2048u -DLOG_NWORDS=23u
-DFP_DP=1 -save-temps=df/DP_8M"
PRP-3: FFT 8M (2048 * 2048 * 2) of 149447533 (17.82 bits/word) [2018-01-31 16:23:13 Central Standard Time]
Starting at iteration 0
OK 0 / 149447533 [ 0.00%], 0.00 ms/it; ETA 0d 00:00; 0000000000000003 [16:23:29]
OK 1000 / 149447533 [ 0.00%], 27.36 ms/it [27.33, 27.39] CV 0.2%, check 16.40s; ETA 47d 07:50; a8d4ca87631cca02 [16:24:13]
OK 5000 / 149447533 [ 0.00%], 27.37 ms/it [27.36, 27.42] CV 0.1%, check 16.66s; ETA 47d 08:10; 4fa6dcbb24aeed07 [16:26:19]
OK 10000 / 149447533 [ 0.01%], 27.34 ms/it [27.33, 27.39] CV 0.1%, check 16.60s; ETA 47d 06:44; 91ee6149098cfb93 [16:28:52]
OK 20000 / 149447533 [ 0.01%], 27.29 ms/it [27.27, 27.33] CV 0.1%, check 16.41s; ETA 47d 04:46; f7eb1de0aaf58d67 [16:33:42]
OK 40000 / 149447533 [ 0.03%], 27.33 ms/it [27.30, 27.39] CV 0.1%, check 16.55s; ETA 47d 06:08; 5b94f9430a7f1f13 [16:43:05]
OK 60000 / 149447533 [ 0.04%], 27.34 ms/it [27.33, 27.80] CV 0.3%, check 16.83s; ETA 47d 06:35; f7b192b9d3301ace [16:52:29]
OK 80000 / 149447533 [ 0.05%], 27.33 ms/it [27.30, 27.36] CV 0.0%, check 16.40s; ETA 47d 05:57; fc68e9f4c1e60f2e [17:01:52]
16000 / 20000, 27.36 ms/it[/CODE]

ewmayer 2018-02-01 02:35

[QUOTE=kriesel;478949]"About FFT size selection:
by default, FFT size is automatically selected based on exponent size, with these cutoffs:
- under 40'000'000, use FFT 2M,
- under 78'000'000, use FFT 4M,
- under 155'000'000, use FFT 8M"
(which I had taken to mean it would be able to handle 154000001 with almost a million to spare)
Granted, that was for V1.5 and I ran V1.9.

It's impressive it was able to run hundreds of thousands of iterations before hitting errors.[/QUOTE]

Just for grins I ran 10Kiters of 154000001 @8192K of my Mlucas SSE2 build on my Core2 macbook - the bolded warning gives an idea where this p is relative to the default maxp for this FFT length:
[code]
[b]INFO: Maximum recommended exponent for this runlength = 152816052; p[ = 154000001]/pmax_rec = 1.007748.[/b]
specified FFT length 8192 K is less than recommended 9216 K for this p.
M154000001: using FFT length 8192K = 8388608 8-byte floats.
this gives an average 18.358230710029602 bits per digit
Using complex FFT radices 256 16 32 32
mers_mod_square: Init threadpool of 2 threads
radix16_dif_dit_pass pfetch_dist = 32
radix16_wrapper_square: pfetch_dist = 1024
Using 2 threads in carry step
10000 iterations of M154000001 with FFT length 8388608 = 8192 K
Res64: B66E8C57794E7814. AvgMaxErr = 0.286465026. [b]MaxErr = 0.406250000[/b]. Program: E18.0
[/code]
Given that 10Kiters is les than 1/100th of 1% of 154K, that MaxErr value is in fact quite worrisome, as in a full-length run it will surely be accompanied by somewhat larger ones, and we are already very close to the fatal flip-a-coin-to-decide-in-which-direction-to-round-this-convolution-output 0.5 level.

All other things being equal, power-of-2 FFT lengths tend to have slightly lower ROE levels than non-powers-of-2, but as the exponents increase one has proportionally more opportunies (in the form of iterations) for outlier ROEs to occur. If one found - using your numbers above - pmax = 40m to be a good limit @2M and pmax = 78m @4M, then it makes sense that the relative jump in pmax should be similar going from 4M to 8M as in going from 2M to 4M. (78/40)*78M = 152.1M, so (assuming 40M and 78M were well-chosen) instead setting 154M is clearly highly optimistic. (The ROE trend is somewhat more subtle than this simple ratio-test, but the ratio captures most of the trend). I've long since automated these settings in my own code, by starting with an algorithmic (random-walk-based) model for convolution outputs and ROEs, comparing those numbers to observed trends over a large range of FFT lengths to check whether the model appears to be a good one, and then implementing it in code. The only fiddles to that I might entertain at this point are ones based on slghtly modifying the tunable constant in the model to reflect the aforementioned power-of-2-or-not effects, and doing similarly for FMA-math-or-not.

kriesel 2018-02-01 20:04

GpuOwL 1.9 8M short tests probing for an upper limit
 
All the following are from gpuOwL v1.9-74f1a38 and run with 8M fft length, on a Radeon RX550 not driving a display.
Gerbicz check should catch round off errors, right?

(CUDALucas 8M fft length upper limit exponent)
gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df
PRP-3: FFT 8M (2048 * 2048 * 2) of 149447533 (17.82 bits/word) 400000 iterations, max cv 0.3%, no errors indicated
stopped and restarted with -legacy option added, iteration time dropped from 27.37 to 21.36 ms/iter;
no errors when iteration 900,000 reached; max cv 1.5% until 900000 reported 11.6%
-legacy is faster and has consistently higher cv, 1.-1.5% typical, in this run for the same exponent on the same (new) hardware

gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy
PRP-3: FFT 8M (2048 * 2048 * 2) of 152000239 (18.12 bits/word)
1,452,000 iterations, max cv 2.1%, no errors indicated, 21.43 ms/iter

gpuowl -user kriesel -cpu condorella-rx550 -device 0 -verbosity 2 -dump df -legacy
PRP-3: FFT 8M (2048 * 2048 * 2) of 152500021 (18.18 bits/word)
200,000 iterations, max cv 2.2% until 11.8 at 200,000, 16.5% following, then settled down to 1.2% or less for the rest of 1,056,000 iterations; no errors indicated, 21.44 ms/iter


All times are UTC. The time now is 22:38.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.