![]() |
Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:
[code]2020-06-13 17:46:50 412688e172fd62d9 105965933 OK 87200000 82.29%; 1390 us/it; ETA 0d 07:15; 3bd02b5e45382bcc (check 0.84s) 2020-06-13 17:51:28 412688e172fd62d9 105965933 EE 87400000 82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.79s) 2020-06-13 17:51:29 412688e172fd62d9 105965933 OK 87200000 loaded: blockSize 400, 3bd02b5e45382bcc 2020-06-13 17:53:49 412688e172fd62d9 105965933 OK 87300000 82.38%; 1389 us/it; ETA 0d 07:12; 2761f6451aecc6dc (check 0.79s) 1 errors 2020-06-13 17:56:08 412688e172fd62d9 105965933 EE 87400000 82.48%; 1386 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.74s) 1 errors 2020-06-13 17:56:09 412688e172fd62d9 105965933 OK 87300000 loaded: blockSize 400, 2761f6451aecc6dc 2020-06-13 17:57:19 412688e172fd62d9 105965933 OK 87350000 82.43%; 1388 us/it; ETA 0d 07:11; 4d1b42bc1dbcf1e1 (check 0.75s) 2 errors 2020-06-13 17:58:29 412688e172fd62d9 105965933 EE 87400000 82.48%; 1392 us/it; ETA 0d 07:11; 083b13bb609c0724 (check 0.79s) 2 errors 2020-06-13 17:58:30 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1 2020-06-13 17:59:40 412688e172fd62d9 105965933 EE 87400000 82.48%; 1389 us/it; ETA 0d 07:10; 083b13bb609c0724 (check 0.77s) 3 errors 2020-06-13 17:59:41 412688e172fd62d9 105965933 OK 87350000 loaded: blockSize 400, 4d1b42bc1dbcf1e1 2020-06-13 18:00:51 412688e172fd62d9 105965933 EE 87400000 82.48%; 1385 us/it; ETA 0d 07:09; 083b13bb609c0724 (check 0.81s) 4 errors 2020-06-13 18:00:51 412688e172fd62d9 3 sequential errors, will stop. 2020-06-13 18:00:51 412688e172fd62d9 Exiting because "too many errors" 2020-06-13 18:00:51 412688e172fd62d9 Bye[/code] Attempted restart of run hit same issue ... possibly the GEC residue got corrupted somehow? Card has been running at [linux, ROCm] sclk=3, mclk=1150MHz very stably for last month, temps were fine, and other job running on same card suffered no issues. By way of temporary workaround moved the above assignment to bottom of worktodo file and restarted, no issues with the next assignment. Should I try copying 105965933-old.owl to 105965933.owl and restarting? |
[QUOTE=ewmayer;547992]Last night's before-going-to-bed run check showed that one of the 2 jobs on card2 of my 3-R7 system had aborted due to repeated errors:
Should I try copying 105965933-old.owl to 105965933.owl and restarting?[/QUOTE] Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on. It also targets a 0.1% probability of error if only some or none of the accuracy options are turned on. The previous version was not as good at keeper the failure probability that low. P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative. Not yet implemented. |
[QUOTE=Prime95;547997]Please try building from the latest source. A week or so ago I checked in changes that target a 0.5% chance of roundoff failure during the very top end of the FFT where all accuracy options are turned on.[/QUOTE]
Ah yes, I see the expo in question is very near what we can expect to be able to use 5.5M FFT - I had run multiple PRPs of expos in that range, some even larger, without issue, thus hearing it was likely due to ROE was unexpected. Of course the vagueness of that 'EE' error code emitted by the program does not help in that regard. Retry with new build (and worktodo refiddled to restore problematic assignment to top) looks good, but the "mysteries of the ROCm code management engine" front, previously the 2 jobs were getting ~1365 us/iter each, but now job1 (still using older build) is running at 1710 us/iter whereas new-build-using job2 is at 1160 us/iter. Total throughput more or less same, though. Oh, here the list of all the expos I've run in the 105M+ range on that system - the problematic one is starred, you can see there are multiple larger ones which caused no problem using the same build: [code]105615283 105712007 105809183 105809189 105809299 105809351 105810461 105810857 105810857 105810979 105813713 105813853 105813859 105813941 105815543 105815581 105815627 105840467 105843821 105844853 105846053 105892097 105892109 105892159 105892211 105892307 105892399 105892459 105892693 105893233 105893321 105894211 105894319 105894329 105894947 105895001 105895157 105895297 105897229 105897839 105900511 105900539 105900703 105900797 105904387 105904529 105904739 105949913 105950011 105956441 105959179 105963503 105964751 105965887 105965933* 105969967 105973331 105980201 105981719 105981937 105984089 105984521 105987407 105987709 105989249 105991979 105992911 105995317 105997069 105997247 105998617[/code] This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view. (I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.) [b]Edit:[/b] Oh - you mention an as-yet-unimplemented CL flag ... you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected. Say I still hit an ROE-related failure using the current build, and there is no new build promising more accuracy. If I could discern from the log that ROEs were behind the abort, I could simply restart and force a slightly higher FFT length, either for the rest of the run or just the next few checkpoints. In the case I hit, it got through ~80% of the run before hitting the error, based on the above all-exponents-in-this-range data said error was clearly an outlier, so simply running for a few minutes ~6M FFT, then reverting to default would have worked around the problem. I'll probably make that my SOP going forward since I have a lot of familiarity with max-expo/FFT-length numbers, but your average user will have no clue. |
[QUOTE=ewmayer;547999]This would appear to confirm my long-running stance that it is unreasonable to expect ROEs to behave in perfectly monotone fashion with exponent at a given FFT size, so we should set our breakover points as aggressively as reasonably possible, but build in mechanisms to gracefully handle such high-ROE cases which (as determined by the GEC) are not due to data corruption. Various internal flags to ratchet up the convolution floating-point accuracy at a given FFT length are the obvious first line of defense, but a simple "if said flags are already at their most-accurate settings, and we still hit a dangerously high ROE, just switch to the next-larger FFT length and complete the current run at that" is the obvious last line of defense in my view.
(I do get the "using such dodges disincentivizes us coders from working to our utmost to rein in ROEs" moral-hazard aspect of the issue, though.)[/QUOTE] The probability of an excessive roundoff in p iterations or less is what matters. If the probability of an ROE is low in 10p our chances of completion are pretty good. I did a lengthy study long ago, on gpuowl v1.9's -fft M61 -size 4M, of what the exponent limit could be. See the attachment and description at [URL]https://www.mersenneforum.org/showpost.php?p=498231&postcount=8[/URL]. If you'd like some end user/ tester participation in finding or testing the limits for various fft lengths, please share. CUDALucas reports any fft length changes it makes due to RO values in its console output, either increases due to high RO values, or decreases after a stretch of low RO values on increased fft length. Gpuowl could report its somewhere too. Then that could be gathered and fed back into program updates. We users trust you to push exponents limit per fft length within reason. No miracles expected. |
[QUOTE=ewmayer;547999]you know what would be really useful? Enhanced error reporting that informs the user whether that 'EE' was due to simple GEC failure or dangerous-ROE-detected.[/QUOTE]
gpuowl does not know if the error is ROE related -- all it knows is that GEC failed. You can try running with "-use STATS" to see what the ROE is once an EE occurs. Unlike CPUs, determining ROE seems a bit expensive so gpuowl never turns that on by default. I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in. |
@George:
pretty sure savefiles are FFT-length-independent ... I played with some force-lower-than-default-FFT-length tries a couple months ago for some expos just above then then-5M/5.5M threshold. At least one made it to iter 1M before hitting repeatable roundoff errors, was able to resume at the default 5.5m FFT length w/o issues. Re. -use STATS, useful to know, but again all the onus is on the user. How about the following simple scheme? If run hits GEC, program automatically enables -use STATS for the retry-from-last-good-GEC-checkpoint interval. Then, 1 of 2 things happens: 1. Retry fails repeatably: in this case program aborts, but user has some hopefully-useful ROE data to go on, to see if resuming at the next-larger FFT length is likely to be useful. 2. Retry succeeds: in this case program automatically shuts off -use STATS, and all that has been lost is a tiny bit of runtime running the retry interval in slower ROE-data-gathering mode. |
[QUOTE=Prime95;547997]P.S. We've talked about adding a command line option that lets you be more aggressive or more conservative.[/QUOTE]Please add this!
:mike: |
[QUOTE=Prime95;548009]
I do not know if you can switch to a 6M length form a 5.5M savefile and vice versa. Mihai needs to weigh in.[/QUOTE] The savefiles are FFT-length agnostic. They only store the "compacted" residue which is not affected by FFT-len in any way. That's why one can restart the savefile with any FFT-length, changing the FFT size midway. |
[QUOTE=preda;548114]The savefiles are FFT-length agnostic. [/QUOTE]
So the savefiles believes that the existence of an FFT-length is unknowable :wink: |
Using the current release, I did a bit of ROE-stats-collecting this afternoon for 2 nearby expos being run side-by-side on one of my R7s. Running with '-use STATS' increased the per-iter time (at sclk=4, mclk=1150) from 1260 us/iter to 1390 us/iter. Here the ROE data snips and the resulting mean-of-means values:
[code]105821119: iters 103.4-105.8M: Roundoff: N=200900, mean 0.226482, SD 0.012404, CV 0.054769, max 0.327033, z 22.1 (pErr 0.003102%) 105821119 OK 103600000 97.90%; 1399 us/it; ETA 0d 00:52; 3b0fbc4edac12157 (check 0.68s) Roundoff: N=200900, mean 0.226494, SD 0.012430, CV 0.054879, max 0.353670, z 22.0 (pErr 0.003292%) 105821119 OK 103800000 98.09%; 1392 us/it; ETA 0d 00:47; fda5538c36e14974 (check 0.68s) Roundoff: N=200900, mean 0.226526, SD 0.012407, CV 0.054769, max 0.339914, z 22.0 (pErr 0.003133%) 105821119 OK 104000000 98.28%; 1392 us/it; ETA 0d 00:42; dac0fe9bf026a7e4 (check 0.68s) Roundoff: N=200900, mean 0.226531, SD 0.012409, CV 0.054777, max 0.328995, z 22.0 (pErr 0.003150%) 105821119 OK 104200000 98.47%; 1390 us/it; ETA 0d 00:38; e104c00665ca400e (check 0.68s) Roundoff: N=200900, mean 0.226523, SD 0.012406, CV 0.054769, max 0.346469, z 22.0 (pErr 0.003131%) 105821119 OK 104400000 98.66%; 1390 us/it; ETA 0d 00:33; c8aeb5d821646c3c (check 0.68s) Roundoff: N=200900, mean 0.226497, SD 0.012425, CV 0.054857, max 0.328152, z 22.0 (pErr 0.003257%) 105821119 OK 104600000 98.85%; 1389 us/it; ETA 0d 00:28; dbe95ac9ddcd67a8 (check 0.69s) Roundoff: N=200900, mean 0.226491, SD 0.012420, CV 0.054836, max 0.327089, z 22.0 (pErr 0.003218%) 105821119 OK 104800000 99.03%; 1389 us/it; ETA 0d 00:24; ac78b9f17a02af19 (check 0.69s) Roundoff: N=200900, mean 0.226514, SD 0.012401, CV 0.054749, max 0.331021, z 22.1 (pErr 0.003092%) 105821119 OK 105000000 99.22%; 1390 us/it; ETA 0d 00:19; c4dc4c3e9c8edc6e (check 0.68s) Roundoff: N=200900, mean 0.226506, SD 0.012407, CV 0.054775, max 0.325616, z 22.0 (pErr 0.003129%) 105821119 OK 105200000 99.41%; 1389 us/it; ETA 0d 00:14; e6e00ed490f107dd (check 0.68s) Roundoff: N=200900, mean 0.226507, SD 0.012424, CV 0.054848, max 0.325021, z 22.0 (pErr 0.003250%) 105821119 OK 105400000 99.60%; 1391 us/it; ETA 0d 00:10; b00e6b65601c296f (check 0.69s) Roundoff: N=200900, mean 0.226472, SD 0.012379, CV 0.054662, max 0.332776, z 22.1 (pErr 0.002929%) 105821119 OK 105600000 99.79%; 1392 us/it; ETA 0d 00:05; bb8ac76d5ae921d6 (check 0.69s) Roundoff: N=200900, mean 0.226470, SD 0.012425, CV 0.054863, max 0.345496, z 22.0 (pErr 0.003248%) 105821119 OK 105800000 99.98%; 1392 us/it; ETA 0d 00:00; b56dc80afad59a5b (check 0.73s)[/code] mean-of-mean 0.22650108, mean-of-max 0.3342710, max-of-max 0.353670 [code]105821173: iters 4.6-7.2M: Roundoff: N=200900, mean 0.228696, SD 0.012700, CV 0.055533, max 0.320450, z 21.4 (pErr 0.007500%) 105821173 OK 4800000 4.54%; 1392 us/it; ETA 1d 15:03; 4e9cef0e4d6ffc3d (check 0.68s) Roundoff: N=200900, mean 0.228754, SD 0.012721, CV 0.055610, max 0.335303, z 21.3 (pErr 0.007888%) 105821173 OK 5000000 4.72%; 1393 us/it; ETA 1d 15:00; 847d6470eadf87ae (check 0.72s) Roundoff: N=200900, mean 0.228813, SD 0.012756, CV 0.055750, max 0.339807, z 21.3 (pErr 0.008560%) 105821173 OK 5200000 4.91%; 1391 us/it; ETA 1d 14:53; de2ab5cc0bce1465 (check 0.72s) Roundoff: N=200900, mean 0.228745, SD 0.012685, CV 0.055456, max 0.334091, z 21.4 (pErr 0.007297%) 105821173 OK 5400000 5.10%; 1390 us/it; ETA 1d 14:47; 518bc5da2be5034a (check 0.71s) Roundoff: N=200900, mean 0.228759, SD 0.012711, CV 0.055563, max 0.350123, z 21.3 (pErr 0.007720%) 105821173 OK 5600000 5.29%; 1390 us/it; ETA 1d 14:42; 6ddc2a8aae1cb6c1 (check 0.69s) Roundoff: N=200900, mean 0.228756, SD 0.012724, CV 0.055620, max 0.356895, z 21.3 (pErr 0.007934%) 105821173 OK 5800000 5.48%; 1389 us/it; ETA 1d 14:35; f7ecbf6f4fdf0ded (check 0.68s) Roundoff: N=200900, mean 0.228820, SD 0.012775, CV 0.055831, max 0.346250, z 21.2 (pErr 0.008922%) 105821173 OK 6000000 5.67%; 1389 us/it; ETA 1d 14:30; 121eb8b46de12e8f (check 0.69s) Roundoff: N=200900, mean 0.228775, SD 0.012750, CV 0.055734, max 0.325508, z 21.3 (pErr 0.008422%) 105821173 OK 6200000 5.86%; 1390 us/it; ETA 1d 14:28; 78780bbb15a42bf2 (check 0.68s) Roundoff: N=200900, mean 0.228750, SD 0.012727, CV 0.055637, max 0.337813, z 21.3 (pErr 0.007986%) 105821173 OK 6400000 6.05%; 1389 us/it; ETA 1d 14:22; a3aea29cb5719574 (check 0.67s) Roundoff: N=200900, mean 0.228768, SD 0.012714, CV 0.055577, max 0.342355, z 21.3 (pErr 0.007786%) 105821173 OK 6600000 6.24%; 1390 us/it; ETA 1d 14:19; cfb31fd255b74d50 (check 0.68s) Roundoff: N=200900, mean 0.228778, SD 0.012723, CV 0.055612, max 0.325987, z 21.3 (pErr 0.007940%) 105821173 OK 6800000 6.43%; 1392 us/it; ETA 1d 14:17; 4d3d54f47b50144c (check 0.68s) Roundoff: N=200900, mean 0.228789, SD 0.012749, CV 0.055724, max 0.330346, z 21.3 (pErr 0.008406%) 105821173 OK 7000000 6.61%; 1380 us/it; ETA 1d 13:52; 01c96f0b51a616fe (check 0.70s) Roundoff: N=200900, mean 0.228712, SD 0.012674, CV 0.055415, max 0.345660, z 21.4 (pErr 0.007100%) 105821173 OK 7200000 6.80%; 1433 us/it; ETA 1d 15:15; 332230a88f05c8e4 (check 0.70s)[/code] mean-of-mean 0.22876269, mean-of-max 0.3377375, max-of-max 0.356895 For these 2 particular expos and iteration intervals the meansof-means and max-of-max values behave as expected. George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change? The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error. |
[QUOTE=ewmayer;548118]George, Mihai, do you guys do larger-scale such tests for each ROE-affecting code change?
The slowdown involved was sufficiently modest that I would like to again suggest a code tweak to turn on the stats-gathering for every interval-retry resulting from a run hitting a GEC mismatch, and switching it back off should the ensuing retry result cure the GEC error.[/QUOTE] In the past, we have not done large-scale ROE testing. It's only been in the last 2 or 3 months that we've put in the hard work to gather stats, come up with different code paths to increase accuracy, and come up with sensible FFT crossovers. Note that we should do ROE sanity checks for major code changes as well as ROCm releases. No attempt has been made to compare ROE on Windows to ROE on Linux. No attempt has been made to look at nVidia ROE. In the last release, I only really studied ROE for FFT lengths from 3*512K to 15*512K. Those testing 100Mdigit numbers with an 18M FFT might experience some less than optimal crossovers. BTW, the latest version introduced MIDDLE=13,14,15 so that we should have optimized code for first time tests for quite a while. Yes, the 3 errors code path could be better. It could retry once, then switch to next higher accuracy level, then try STATS, etc. I'll leave that to Mihai who is up to his eyeballs in PRP proofs. Don't expect anything soon! |
| All times are UTC. The time now is 22:58. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.