mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-06-14, 18:51   #2322
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101100011110102 Posts
Default

Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:
Code:
2020-06-13 17:46:50 412688e172fd62d9 105965933 OK 87200000  82.29%; 1390 us/it; ETA 0d 07:15; 3bd02b5e45382bcc (check 0.84s)
2020-06-13 17:51:28 412688e172fd62d9 105965933 EE 87400000  82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.79s)
2020-06-13 17:51:29 412688e172fd62d9 105965933 OK 87200000 loaded: blockSize 400, 3bd02b5e45382bcc
2020-06-13 17:53:49 412688e172fd62d9 105965933 OK 87300000  82.38%; 1389 us/it; ETA 0d 07:12; 2761f6451aecc6dc (check 0.79s) 1 errors
2020-06-13 17:56:08 412688e172fd62d9 105965933 EE 87400000  82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.74s) 1 errors
2020-06-13 17:56:09 412688e172fd62d9 105965933 OK 87300000 loaded: blockSize 400, 2761f6451aecc6dc
2020-06-13 17:57:19 412688e172fd62d9 105965933 OK 87350000  82.43%; 1388 us/it; ETA 0d 07:11; 4d1b42bc1dbcf1e1 (check 0.75s) 2 errors
2020-06-13 17:58:29 412688e172fd62d9 105965933 EE 87400000  82.48%; 1392 us/it; ETA 0d 07:11; 083b13bb609c0724 (check 0.79s) 2 errors
2020-06-13 17:58:30 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1
2020-06-13 17:59:40 412688e172fd62d9 105965933 EE 87400000  82.48%; 1389 us/it; ETA 0d 07:10; 083b13bb609c0724 (check 0.77s) 3 errors
2020-06-13 17:59:41 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1
2020-06-13 18:00:51 412688e172fd62d9 105965933 EE 87400000  82.48%; 1385 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.81s) 4 errors
2020-06-13 18:00:51 412688e172fd62d9 3 sequential errors, will stop.
2020-06-13 18:00:51 412688e172fd62d9 Exiting because "too many errors"
2020-06-13 18:00:51 412688e172fd62d9 Bye
Attempted restart of run hit same issue ... possibly the GEC residue got corrupted somehow? Card has been running at [linux, ROCm] sclk=3, mclk=1150MHz very stably for last month, temps were fine, and other job running on same card suffered no issues. By way of temporary workaround moved the above assignment to bottom of worktodo file and restarted, no issues with the next assignment.

Should I try copying 105965933-old.owl to 105965933.owl and restarting?

Last fiddled with by ewmayer on 2020-06-14 at 18:52
ewmayer is offline   Reply With Quote
Old 2020-06-14, 19:59   #2323
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×5×701 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:

Should I try copying 105965933-old.owl to 105965933.owl and restarting?
Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on. It also targets a 0.1% probability of error if only some or none of the accuracy options are turned on.

The previous version was not as good at keeper the failure probability that low.

P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative. Not yet implemented.

Last fiddled with by Prime95 on 2020-06-14 at 20:00
Prime95 is offline   Reply With Quote
Old 2020-06-14, 20:57   #2324
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

261728 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on.
Ah yes, I see the expo in question is very near what we can expect to be able to use 5.5M FFT - I had run multiple PRPs of expos in that range, some even larger, without issue, thus hearing it was likely due to ROE was unexpected. Of course the vagueness of that 'EE' error code emitted by the program does not help in that regard.

Retry with new build (and worktodo refiddled to restore problematic assignment to top) looks good, but the "mysteries of the ROCm code management engine" front, previously the 2 jobs were getting ~1365 us/iter each, but now job1 (still using older build) is running at 1710 us/iter whereas new-build-using job2 is at 1160 us/iter. Total throughput more or less same, though.

Oh, here the list of all the expos I've run in the 105M+ range on that system - the problematic one is starred, you can see there are multiple larger ones which caused no problem using the same build:
Code:
105615283
105712007
105809183
105809189
105809299
105809351
105810461
105810857
105810857
105810979
105813713
105813853
105813859
105813941
105815543
105815581
105815627
105840467
105843821
105844853
105846053
105892097
105892109
105892159
105892211
105892307
105892399
105892459
105892693
105893233
105893321
105894211
105894319
105894329
105894947
105895001
105895157
105895297
105897229
105897839
105900511
105900539
105900703
105900797
105904387
105904529
105904739
105949913
105950011
105956441
105959179
105963503
105964751
105965887
105965933*
105969967
105973331
105980201
105981719
105981937
105984089
105984521
105987407
105987709
105989249
105991979
105992911
105995317
105997069
105997247
105998617
This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view.

(I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.)

Edit: Oh - you mention an as-yet-unimplemented CL flag ... you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected. Say I still hit an ROE-related failure using the current build, and there is no new build promising more accuracy. If I could discern from the log that ROEs were behind the abort, I could simply restart and force a slightly higher FFT length, either for the rest of the run or just the next few checkpoints. In the case I hit, it got through ~80% of the run before hitting the error, based on the above all-exponents-in-this-range data said error was clearly an outlier, so simply running for a few minutes ~6M FFT, then reverting to default would have worked around the problem. I'll probably make that my SOP going forward since I have a lot of familiarity with max-expo/FFT-length numbers, but your average user will have no clue.

Last fiddled with by ewmayer on 2020-06-14 at 21:06
ewmayer is offline   Reply With Quote
Old 2020-06-14, 21:22   #2325
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

106216 Posts
Default

Quote:
Originally Posted by ewmayer View Post
This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view.

(I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.)
The probability of an excessive roundoff in p iterations or less is what matters. If the probability of an ROE is low in 10p our chances of completion are pretty good. I did a lengthy study long ago, on gpuowl v1.9's -fft M61 -size 4M, of what the exponent limit could be. See the attachment and description at https://www.mersenneforum.org/showpo...31&postcount=8.

If you'd like some end user/ tester participation in finding or testing the limits for various fft lengths, please share.

CUDALucas reports any fft length changes it makes due to RO values in its console output, either increases due to high RO values, or decreases after a stretch of low RO values on increased fft length. Gpuowl could report its somewhere too. Then that could be gathered and fed back into program updates.

We users trust you to push exponents limit per fft length within reason. No miracles expected.

Last fiddled with by kriesel on 2020-06-14 at 21:41
kriesel is online now   Reply With Quote
Old 2020-06-15, 01:41   #2326
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·5·701 Posts
Default

Quote:
Originally Posted by ewmayer View Post
you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected.
gpuowl does not know if the error is ROE related -- all it knows is that GEC failed.
You can try running with "-use STATS" to see what the ROE is once an EE occurs. Unlike CPUs, determining ROE seems a bit expensive so gpuowl never turns that on by default.

I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in.
Prime95 is offline   Reply With Quote
Old 2020-06-15, 02:38   #2327
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5,693 Posts
Default

@George:

pretty sure savefiles are FFT-length-independent ... I played with some force-lower-than-default-FFT-length tries a couple months ago for some expos just above then then-5M/5.5M threshold. At least one made it to iter 1M before hitting repeatable roundoff errors, was able to resume at the default 5.5m FFT length w/o issues.

Re. -use STATS, useful to know, but again all the onus is on the user. How about the following simple scheme? If run hits GEC, program automatically enables -use STATS for the retry-from-last-good-GEC-checkpoint interval. Then, 1 of 2 things happens:

1. Retry fails repeatably: in this case program aborts, but user has some hopefully-useful ROE data to go on, to see if resuming at the next-larger FFT length is likely to be useful.

2. Retry succeeds: in this case program automatically shuts off -use STATS, and all that has been lost is a tiny bit of runtime running the retry interval in slower ROE-data-gathering mode.

Last fiddled with by ewmayer on 2020-06-15 at 02:39
ewmayer is offline   Reply With Quote
Old 2020-06-15, 11:30   #2328
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

7,573 Posts
Default

Quote:
Originally Posted by Prime95 View Post
P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative.
Please add this!

Xyzzy is offline   Reply With Quote
Old 2020-06-15, 23:03   #2329
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

47A16 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in.
The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway.

Last fiddled with by preda on 2020-06-15 at 23:04
preda is offline   Reply With Quote
Old 2020-06-16, 00:00   #2330
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2,879 Posts
Default

Quote:
Originally Posted by preda View Post
The savefiles are FFT-length agnostic.
So the savefiles believes that the existence of an FFT-length is unknowable
ATH is online now   Reply With Quote
Old 2020-06-16, 00:15   #2331
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×5,693 Posts
Default

Using the current release, I did a bit of ROE-stats-collecting this afternoon for 2 nearby expos being run side-by-side on one of my R7s. Running with '-use STATS' increased the per-iter time (at sclk=4, mclk=1150) from 1260 us/iter to 1390 us/iter. Here the ROE data snips and the resulting mean-of-means values:
Code:
105821119: iters 103.4-105.8M:
Roundoff: N=200900, mean 0.226482, SD 0.012404, CV 0.054769, max 0.327033, z 22.1 (pErr 0.003102%)	105821119 OK 103600000  97.90%; 1399 us/it; ETA 0d 00:52; 3b0fbc4edac12157 (check 0.68s)
Roundoff: N=200900, mean 0.226494, SD 0.012430, CV 0.054879, max 0.353670, z 22.0 (pErr 0.003292%)	105821119 OK 103800000  98.09%; 1392 us/it; ETA 0d 00:47; fda5538c36e14974 (check 0.68s)
Roundoff: N=200900, mean 0.226526, SD 0.012407, CV 0.054769, max 0.339914, z 22.0 (pErr 0.003133%)	105821119 OK 104000000  98.28%; 1392 us/it; ETA 0d 00:42; dac0fe9bf026a7e4 (check 0.68s)
Roundoff: N=200900, mean 0.226531, SD 0.012409, CV 0.054777, max 0.328995, z 22.0 (pErr 0.003150%)	105821119 OK 104200000  98.47%; 1390 us/it; ETA 0d 00:38; e104c00665ca400e (check 0.68s)
Roundoff: N=200900, mean 0.226523, SD 0.012406, CV 0.054769, max 0.346469, z 22.0 (pErr 0.003131%)	105821119 OK 104400000  98.66%; 1390 us/it; ETA 0d 00:33; c8aeb5d821646c3c (check 0.68s)
Roundoff: N=200900, mean 0.226497, SD 0.012425, CV 0.054857, max 0.328152, z 22.0 (pErr 0.003257%)	105821119 OK 104600000  98.85%; 1389 us/it; ETA 0d 00:28; dbe95ac9ddcd67a8 (check 0.69s)
Roundoff: N=200900, mean 0.226491, SD 0.012420, CV 0.054836, max 0.327089, z 22.0 (pErr 0.003218%)	105821119 OK 104800000  99.03%; 1389 us/it; ETA 0d 00:24; ac78b9f17a02af19 (check 0.69s)
Roundoff: N=200900, mean 0.226514, SD 0.012401, CV 0.054749, max 0.331021, z 22.1 (pErr 0.003092%)	105821119 OK 105000000  99.22%; 1390 us/it; ETA 0d 00:19; c4dc4c3e9c8edc6e (check 0.68s)
Roundoff: N=200900, mean 0.226506, SD 0.012407, CV 0.054775, max 0.325616, z 22.0 (pErr 0.003129%)	105821119 OK 105200000  99.41%; 1389 us/it; ETA 0d 00:14; e6e00ed490f107dd (check 0.68s)
Roundoff: N=200900, mean 0.226507, SD 0.012424, CV 0.054848, max 0.325021, z 22.0 (pErr 0.003250%)	105821119 OK 105400000  99.60%; 1391 us/it; ETA 0d 00:10; b00e6b65601c296f (check 0.69s)
Roundoff: N=200900, mean 0.226472, SD 0.012379, CV 0.054662, max 0.332776, z 22.1 (pErr 0.002929%)	105821119 OK 105600000  99.79%; 1392 us/it; ETA 0d 00:05; bb8ac76d5ae921d6 (check 0.69s)
Roundoff: N=200900, mean 0.226470, SD 0.012425, CV 0.054863, max 0.345496, z 22.0 (pErr 0.003248%)	105821119 OK 105800000  99.98%; 1392 us/it; ETA 0d 00:00; b56dc80afad59a5b (check 0.73s)
mean-of-mean 0.22650108, mean-of-max 0.3342710, max-of-max 0.353670
Code:
105821173: iters 4.6-7.2M:
Roundoff: N=200900, mean 0.228696, SD 0.012700, CV 0.055533, max 0.320450, z 21.4 (pErr 0.007500%)	105821173 OK  4800000   4.54%; 1392 us/it; ETA 1d 15:03; 4e9cef0e4d6ffc3d (check 0.68s)
Roundoff: N=200900, mean 0.228754, SD 0.012721, CV 0.055610, max 0.335303, z 21.3 (pErr 0.007888%)	105821173 OK  5000000   4.72%; 1393 us/it; ETA 1d 15:00; 847d6470eadf87ae (check 0.72s)
Roundoff: N=200900, mean 0.228813, SD 0.012756, CV 0.055750, max 0.339807, z 21.3 (pErr 0.008560%)	105821173 OK  5200000   4.91%; 1391 us/it; ETA 1d 14:53; de2ab5cc0bce1465 (check 0.72s)
Roundoff: N=200900, mean 0.228745, SD 0.012685, CV 0.055456, max 0.334091, z 21.4 (pErr 0.007297%)	105821173 OK  5400000   5.10%; 1390 us/it; ETA 1d 14:47; 518bc5da2be5034a (check 0.71s)
Roundoff: N=200900, mean 0.228759, SD 0.012711, CV 0.055563, max 0.350123, z 21.3 (pErr 0.007720%)	105821173 OK  5600000   5.29%; 1390 us/it; ETA 1d 14:42; 6ddc2a8aae1cb6c1 (check 0.69s)
Roundoff: N=200900, mean 0.228756, SD 0.012724, CV 0.055620, max 0.356895, z 21.3 (pErr 0.007934%)	105821173 OK  5800000   5.48%; 1389 us/it; ETA 1d 14:35; f7ecbf6f4fdf0ded (check 0.68s)
Roundoff: N=200900, mean 0.228820, SD 0.012775, CV 0.055831, max 0.346250, z 21.2 (pErr 0.008922%)	105821173 OK  6000000   5.67%; 1389 us/it; ETA 1d 14:30; 121eb8b46de12e8f (check 0.69s)
Roundoff: N=200900, mean 0.228775, SD 0.012750, CV 0.055734, max 0.325508, z 21.3 (pErr 0.008422%)	105821173 OK  6200000   5.86%; 1390 us/it; ETA 1d 14:28; 78780bbb15a42bf2 (check 0.68s)
Roundoff: N=200900, mean 0.228750, SD 0.012727, CV 0.055637, max 0.337813, z 21.3 (pErr 0.007986%)	105821173 OK  6400000   6.05%; 1389 us/it; ETA 1d 14:22; a3aea29cb5719574 (check 0.67s)
Roundoff: N=200900, mean 0.228768, SD 0.012714, CV 0.055577, max 0.342355, z 21.3 (pErr 0.007786%)	105821173 OK  6600000   6.24%; 1390 us/it; ETA 1d 14:19; cfb31fd255b74d50 (check 0.68s)
Roundoff: N=200900, mean 0.228778, SD 0.012723, CV 0.055612, max 0.325987, z 21.3 (pErr 0.007940%)	105821173 OK  6800000   6.43%; 1392 us/it; ETA 1d 14:17; 4d3d54f47b50144c (check 0.68s)
Roundoff: N=200900, mean 0.228789, SD 0.012749, CV 0.055724, max 0.330346, z 21.3 (pErr 0.008406%)	105821173 OK  7000000   6.61%; 1380 us/it; ETA 1d 13:52; 01c96f0b51a616fe (check 0.70s)
Roundoff: N=200900, mean 0.228712, SD 0.012674, CV 0.055415, max 0.345660, z 21.4 (pErr 0.007100%)	105821173 OK  7200000   6.80%; 1433 us/it; ETA 1d 15:15; 332230a88f05c8e4 (check 0.70s)
mean-of-mean 0.22876269, mean-of-max 0.3377375, max-of-max 0.356895

For these 2 particular expos and iteration intervals the meansof-means and max-of-max values behave as expected. George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change?

The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error.
ewmayer is offline   Reply With Quote
Old 2020-06-16, 00:34   #2332
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×5×701 Posts
Default

Quote:
Originally Posted by ewmayer View Post
George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change?

The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error.
In the past, we have not done large-scale ROE testing. It's only been in the last 2 or 3 months that we've put in the hard work to gather stats, come up with different code paths to increase accuracy, and come up with sensible FFT crossovers. Note that we should do ROE sanity checks for major code changes as well as ROCm releases. No attempt has been made to compare ROE on Windows to ROE on Linux. No attempt has been made to look at nVidia ROE.

In the last release, I only really studied ROE for FFT lengths from 3*512K to 15*512K. Those testing 100Mdigit numbers with an 18M FFT might experience some less than optimal crossovers.

BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while.

Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon!

Last fiddled with by Prime95 on 2020-06-16 at 00:37
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1618 2020-06-24 00:11
GPUOWL AMD Windows OpenCL issues xx005fs GpuOwl 0 2019-07-26 21:37
Testing an expression for primality 1260 Software 17 2015-08-28 01:35
Testing Mersenne cofactors for primality? CRGreathouse Computer Science & Computational Number Theory 18 2013-06-08 19:12
Primality-testing program with multiple types of moduli (PFGW-related) Unregistered Information & Answers 4 2006-10-04 22:38

All times are UTC. The time now is 13:56.

Wed Aug 5 13:56:05 UTC 2020 up 19 days, 9:42, 1 user, load averages: 1.65, 1.74, 1.65

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.