mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   gpuOwL: an OpenCL program for Mersenne primality testing (https://www.mersenneforum.org/showthread.php?t=22204)

ewmayer 2020-06-14 18:51

Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:
[code]2020-06-13 17:46:50 412688e172fd62d9 105965933 OK 87200000 82.29%; 1390 us/it; ETA 0d 07:15; 3bd02b5e45382bcc (check 0.84s)
2020-06-13 17:51:28 412688e172fd62d9 105965933 EE 87400000 82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.79s)
2020-06-13 17:51:29 412688e172fd62d9 105965933 OK 87200000 loaded: blockSize 400, 3bd02b5e45382bcc
2020-06-13 17:53:49 412688e172fd62d9 105965933 OK 87300000 82.38%; 1389 us/it; ETA 0d 07:12; 2761f6451aecc6dc (check 0.79s) 1 errors
2020-06-13 17:56:08 412688e172fd62d9 105965933 EE 87400000 82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.74s) 1 errors
2020-06-13 17:56:09 412688e172fd62d9 105965933 OK 87300000 loaded: blockSize 400, 2761f6451aecc6dc
2020-06-13 17:57:19 412688e172fd62d9 105965933 OK 87350000 82.43%; 1388 us/it; ETA 0d 07:11; 4d1b42bc1dbcf1e1 (check 0.75s) 2 errors
2020-06-13 17:58:29 412688e172fd62d9 105965933 EE 87400000 82.48%; 1392 us/it; ETA 0d 07:11; 083b13bb609c0724 (check 0.79s) 2 errors
2020-06-13 17:58:30 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1
2020-06-13 17:59:40 412688e172fd62d9 105965933 EE 87400000 82.48%; 1389 us/it; ETA 0d 07:10; 083b13bb609c0724 (check 0.77s) 3 errors
2020-06-13 17:59:41 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1
2020-06-13 18:00:51 412688e172fd62d9 105965933 EE 87400000 82.48%; 1385 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.81s) 4 errors
2020-06-13 18:00:51 412688e172fd62d9 3 sequential errors, will stop.
2020-06-13 18:00:51 412688e172fd62d9 Exiting because "too many errors"
2020-06-13 18:00:51 412688e172fd62d9 Bye[/code]
Attempted restart of run hit same issue ... possibly the GEC residue got corrupted somehow? Card has been running at [linux, ROCm] sclk=3, mclk=1150MHz very stably for last month, temps were fine, and other job running on same card suffered no issues. By way of temporary workaround moved the above assignment to bottom of worktodo file and restarted, no issues with the next assignment.

Should I try copying 105965933-old.owl to 105965933.owl and restarting?

Prime95 2020-06-14 19:59

[QUOTE=ewmayer;547992]Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:

Should I try copying 105965933-old.owl to 105965933.owl and restarting?[/QUOTE]

Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on. It also targets a 0.1% probability of error if only some or none of the accuracy options are turned on.

The previous version was not as good at keeper the failure probability that low.

P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative. Not yet implemented.

ewmayer 2020-06-14 20:57

[QUOTE=Prime95;547997]Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on.[/QUOTE]

Ah yes, I see the expo in question is very near what we can expect to be able to use 5.5M FFT - I had run multiple PRPs of expos in that range, some even larger, without issue, thus hearing it was likely due to ROE was unexpected. Of course the vagueness of that 'EE' error code emitted by the program does not help in that regard.

Retry with new build (and worktodo refiddled to restore problematic assignment to top) looks good, but the "mysteries of the ROCm code management engine" front, previously the 2 jobs were getting ~1365 us/iter each, but now job1 (still using older build) is running at 1710 us/iter whereas new-build-using job2 is at 1160 us/iter. Total throughput more or less same, though.

Oh, here the list of all the expos I've run in the 105M+ range on that system - the problematic one is starred, you can see there are multiple larger ones which caused no problem using the same build:
[code]105615283
105712007
105809183
105809189
105809299
105809351
105810461
105810857
105810857
105810979
105813713
105813853
105813859
105813941
105815543
105815581
105815627
105840467
105843821
105844853
105846053
105892097
105892109
105892159
105892211
105892307
105892399
105892459
105892693
105893233
105893321
105894211
105894319
105894329
105894947
105895001
105895157
105895297
105897229
105897839
105900511
105900539
105900703
105900797
105904387
105904529
105904739
105949913
105950011
105956441
105959179
105963503
105964751
105965887
105965933*
105969967
105973331
105980201
105981719
105981937
105984089
105984521
105987407
105987709
105989249
105991979
105992911
105995317
105997069
105997247
105998617[/code]
This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view.

(I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.)

[b]Edit:[/b] Oh - you mention an as-yet-unimplemented CL flag ... you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected. Say I still hit an ROE-related failure using the current build, and there is no new build promising more accuracy. If I could discern from the log that ROEs were behind the abort, I could simply restart and force a slightly higher FFT length, either for the rest of the run or just the next few checkpoints. In the case I hit, it got through ~80% of the run before hitting the error, based on the above all-exponents-in-this-range data said error was clearly an outlier, so simply running for a few minutes ~6M FFT, then reverting to default would have worked around the problem. I'll probably make that my SOP going forward since I have a lot of familiarity with max-expo/FFT-length numbers, but your average user will have no clue.

kriesel 2020-06-14 21:22

[QUOTE=ewmayer;547999]This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view.

(I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.)[/QUOTE]
The probability of an excessive roundoff in p iterations or less is what matters. If the probability of an ROE is low in 10p our chances of completion are pretty good. I did a lengthy study long ago, on gpuowl v1.9's -fft M61 -size 4M, of what the exponent limit could be. See the attachment and description at [URL]https://www.mersenneforum.org/showpost.php?p=498231&postcount=8[/URL].

If you'd like some end user/ tester participation in finding or testing the limits for various fft lengths, please share.

CUDALucas reports any fft length changes it makes due to RO values in its console output, either increases due to high RO values, or decreases after a stretch of low RO values on increased fft length. Gpuowl could report its somewhere too. Then that could be gathered and fed back into program updates.

We users trust you to push exponents limit per fft length within reason. No miracles expected.

Prime95 2020-06-15 01:41

[QUOTE=ewmayer;547999]you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected.[/QUOTE]

gpuowl does not know if the error is ROE related -- all it knows is that GEC failed.
You can try running with "-use STATS" to see what the ROE is once an EE occurs. Unlike CPUs, determining ROE seems a bit expensive so gpuowl never turns that on by default.

I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in.

ewmayer 2020-06-15 02:38

@George:

pretty sure savefiles are FFT-length-independent ... I played with some force-lower-than-default-FFT-length tries a couple months ago for some expos just above then then-5M/5.5M threshold. At least one made it to iter 1M before hitting repeatable roundoff errors, was able to resume at the default 5.5m FFT length w/o issues.

Re. -use STATS, useful to know, but again all the onus is on the user. How about the following simple scheme? If run hits GEC, program automatically enables -use STATS for the retry-from-last-good-GEC-checkpoint interval. Then, 1 of 2 things happens:

1. Retry fails repeatably: in this case program aborts, but user has some hopefully-useful ROE data to go on, to see if resuming at the next-larger FFT length is likely to be useful.

2. Retry succeeds: in this case program automatically shuts off -use STATS, and all that has been lost is a tiny bit of runtime running the retry interval in slower ROE-data-gathering mode.

Xyzzy 2020-06-15 11:30

[QUOTE=Prime95;547997]P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative.[/QUOTE]Please add this!

:mike:

preda 2020-06-15 23:03

[QUOTE=Prime95;548009]
I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in.[/QUOTE]

The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway.

ATH 2020-06-16 00:00

[QUOTE=preda;548114]The savefiles are FFT-length agnostic. [/QUOTE]

So the savefiles believes that the existence of an FFT-length is unknowable :wink:

ewmayer 2020-06-16 00:15

Using the current release, I did a bit of ROE-stats-collecting this afternoon for 2 nearby expos being run side-by-side on one of my R7s. Running with '-use STATS' increased the per-iter time (at sclk=4, mclk=1150) from 1260 us/iter to 1390 us/iter. Here the ROE data snips and the resulting mean-of-means values:
[code]105821119: iters 103.4-105.8M:
Roundoff: N=200900, mean 0.226482, SD 0.012404, CV 0.054769, max 0.327033, z 22.1 (pErr 0.003102%) 105821119 OK 103600000 97.90%; 1399 us/it; ETA 0d 00:52; 3b0fbc4edac12157 (check 0.68s)
Roundoff: N=200900, mean 0.226494, SD 0.012430, CV 0.054879, max 0.353670, z 22.0 (pErr 0.003292%) 105821119 OK 103800000 98.09%; 1392 us/it; ETA 0d 00:47; fda5538c36e14974 (check 0.68s)
Roundoff: N=200900, mean 0.226526, SD 0.012407, CV 0.054769, max 0.339914, z 22.0 (pErr 0.003133%) 105821119 OK 104000000 98.28%; 1392 us/it; ETA 0d 00:42; dac0fe9bf026a7e4 (check 0.68s)
Roundoff: N=200900, mean 0.226531, SD 0.012409, CV 0.054777, max 0.328995, z 22.0 (pErr 0.003150%) 105821119 OK 104200000 98.47%; 1390 us/it; ETA 0d 00:38; e104c00665ca400e (check 0.68s)
Roundoff: N=200900, mean 0.226523, SD 0.012406, CV 0.054769, max 0.346469, z 22.0 (pErr 0.003131%) 105821119 OK 104400000 98.66%; 1390 us/it; ETA 0d 00:33; c8aeb5d821646c3c (check 0.68s)
Roundoff: N=200900, mean 0.226497, SD 0.012425, CV 0.054857, max 0.328152, z 22.0 (pErr 0.003257%) 105821119 OK 104600000 98.85%; 1389 us/it; ETA 0d 00:28; dbe95ac9ddcd67a8 (check 0.69s)
Roundoff: N=200900, mean 0.226491, SD 0.012420, CV 0.054836, max 0.327089, z 22.0 (pErr 0.003218%) 105821119 OK 104800000 99.03%; 1389 us/it; ETA 0d 00:24; ac78b9f17a02af19 (check 0.69s)
Roundoff: N=200900, mean 0.226514, SD 0.012401, CV 0.054749, max 0.331021, z 22.1 (pErr 0.003092%) 105821119 OK 105000000 99.22%; 1390 us/it; ETA 0d 00:19; c4dc4c3e9c8edc6e (check 0.68s)
Roundoff: N=200900, mean 0.226506, SD 0.012407, CV 0.054775, max 0.325616, z 22.0 (pErr 0.003129%) 105821119 OK 105200000 99.41%; 1389 us/it; ETA 0d 00:14; e6e00ed490f107dd (check 0.68s)
Roundoff: N=200900, mean 0.226507, SD 0.012424, CV 0.054848, max 0.325021, z 22.0 (pErr 0.003250%) 105821119 OK 105400000 99.60%; 1391 us/it; ETA 0d 00:10; b00e6b65601c296f (check 0.69s)
Roundoff: N=200900, mean 0.226472, SD 0.012379, CV 0.054662, max 0.332776, z 22.1 (pErr 0.002929%) 105821119 OK 105600000 99.79%; 1392 us/it; ETA 0d 00:05; bb8ac76d5ae921d6 (check 0.69s)
Roundoff: N=200900, mean 0.226470, SD 0.012425, CV 0.054863, max 0.345496, z 22.0 (pErr 0.003248%) 105821119 OK 105800000 99.98%; 1392 us/it; ETA 0d 00:00; b56dc80afad59a5b (check 0.73s)[/code]
mean-of-mean 0.22650108, mean-of-max 0.3342710, max-of-max 0.353670
[code]105821173: iters 4.6-7.2M:
Roundoff: N=200900, mean 0.228696, SD 0.012700, CV 0.055533, max 0.320450, z 21.4 (pErr 0.007500%) 105821173 OK 4800000 4.54%; 1392 us/it; ETA 1d 15:03; 4e9cef0e4d6ffc3d (check 0.68s)
Roundoff: N=200900, mean 0.228754, SD 0.012721, CV 0.055610, max 0.335303, z 21.3 (pErr 0.007888%) 105821173 OK 5000000 4.72%; 1393 us/it; ETA 1d 15:00; 847d6470eadf87ae (check 0.72s)
Roundoff: N=200900, mean 0.228813, SD 0.012756, CV 0.055750, max 0.339807, z 21.3 (pErr 0.008560%) 105821173 OK 5200000 4.91%; 1391 us/it; ETA 1d 14:53; de2ab5cc0bce1465 (check 0.72s)
Roundoff: N=200900, mean 0.228745, SD 0.012685, CV 0.055456, max 0.334091, z 21.4 (pErr 0.007297%) 105821173 OK 5400000 5.10%; 1390 us/it; ETA 1d 14:47; 518bc5da2be5034a (check 0.71s)
Roundoff: N=200900, mean 0.228759, SD 0.012711, CV 0.055563, max 0.350123, z 21.3 (pErr 0.007720%) 105821173 OK 5600000 5.29%; 1390 us/it; ETA 1d 14:42; 6ddc2a8aae1cb6c1 (check 0.69s)
Roundoff: N=200900, mean 0.228756, SD 0.012724, CV 0.055620, max 0.356895, z 21.3 (pErr 0.007934%) 105821173 OK 5800000 5.48%; 1389 us/it; ETA 1d 14:35; f7ecbf6f4fdf0ded (check 0.68s)
Roundoff: N=200900, mean 0.228820, SD 0.012775, CV 0.055831, max 0.346250, z 21.2 (pErr 0.008922%) 105821173 OK 6000000 5.67%; 1389 us/it; ETA 1d 14:30; 121eb8b46de12e8f (check 0.69s)
Roundoff: N=200900, mean 0.228775, SD 0.012750, CV 0.055734, max 0.325508, z 21.3 (pErr 0.008422%) 105821173 OK 6200000 5.86%; 1390 us/it; ETA 1d 14:28; 78780bbb15a42bf2 (check 0.68s)
Roundoff: N=200900, mean 0.228750, SD 0.012727, CV 0.055637, max 0.337813, z 21.3 (pErr 0.007986%) 105821173 OK 6400000 6.05%; 1389 us/it; ETA 1d 14:22; a3aea29cb5719574 (check 0.67s)
Roundoff: N=200900, mean 0.228768, SD 0.012714, CV 0.055577, max 0.342355, z 21.3 (pErr 0.007786%) 105821173 OK 6600000 6.24%; 1390 us/it; ETA 1d 14:19; cfb31fd255b74d50 (check 0.68s)
Roundoff: N=200900, mean 0.228778, SD 0.012723, CV 0.055612, max 0.325987, z 21.3 (pErr 0.007940%) 105821173 OK 6800000 6.43%; 1392 us/it; ETA 1d 14:17; 4d3d54f47b50144c (check 0.68s)
Roundoff: N=200900, mean 0.228789, SD 0.012749, CV 0.055724, max 0.330346, z 21.3 (pErr 0.008406%) 105821173 OK 7000000 6.61%; 1380 us/it; ETA 1d 13:52; 01c96f0b51a616fe (check 0.70s)
Roundoff: N=200900, mean 0.228712, SD 0.012674, CV 0.055415, max 0.345660, z 21.4 (pErr 0.007100%) 105821173 OK 7200000 6.80%; 1433 us/it; ETA 1d 15:15; 332230a88f05c8e4 (check 0.70s)[/code]
mean-of-mean 0.22876269, mean-of-max 0.3377375, max-of-max 0.356895

For these 2 particular expos and iteration intervals the meansof-means and max-of-max values behave as expected. George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change?

The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error.

Prime95 2020-06-16 00:34

[QUOTE=ewmayer;548118]George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change?

The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error.[/QUOTE]

In the past, we have not done large-scale ROE testing. It's only been in the last 2 or 3 months that we've put in the hard work to gather stats, come up with different code paths to increase accuracy, and come up with sensible FFT crossovers. Note that we should do ROE sanity checks for major code changes as well as ROCm releases. No attempt has been made to compare ROE on Windows to ROE on Linux. No attempt has been made to look at nVidia ROE.

In the last release, I only really studied ROE for FFT lengths from 3*512K to 15*512K. Those testing 100Mdigit numbers with an 18M FFT might experience some less than optimal crossovers.

BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while.

Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon!


All times are UTC. The time now is 22:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.