mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Marin's Mersenne-aries (https://www.mersenneforum.org/forumdisplay.php?f=30)
-   -   Strategic Double Clicking (https://www.mersenneforum.org/showthread.php?t=20372)

nofaith628 2017-05-28 03:31

Taken unassigned exponents below 55M.

ewmayer 2017-05-28 07:42

[QUOTE=ewmayer;459249]The 4 expos ~53M I grabbed are roughly 1/3rd done, each running at 21 msec/iter 2-threaded on David Stanfill's AMD Ryzen @2816K for a total throughput of 190 iters/sec. I lost nearly a day because I started them off using the new LOACC carry-math option in the as-yet-unreleased Mlucas v17, which gains 5-7% speed on Intel, but not, as it turns out, AMD (at least not on Ryzen). That mode gave a worrisome number of ROEs during the first few Miters, so I rebuilt the carry modules using HIACC and started from scratch. Once the fresh runs passed the original set I verified that all of the residues-at-time-of-suspension of the LOACC runs matched the HIACC ones. The latter have suffered only a handful of ROEs > 0.4 through the first ~16M iterations:

ewmayer@RyzenBeast:~$ grep "MaxErr = 0.4" mlucas/run*/*at
mlucas/run0/p53647547.stat:[May 15 00:20:09] M53647547 Iter# = 4370000 [ 8.15% complete] clocks = 00:00:00.000 [ 0.0216 sec/iter] Res64: 941DDF47EDABCE91. AvgMaxErr = 0.212353957. MaxErr = 0.427734375.
mlucas/run0/p53647547.stat:[May 16 02:12:57] M53647547 Iter# = 8650000 [16.12% complete] clocks = 00:00:00.000 [ 0.0215 sec/iter] Res64: FC79A210D061517F. AvgMaxErr = 0.212396012. MaxErr = 0.421875000.
mlucas/run0/p53647547.stat:[May 17 17:24:50] M53647547 Iter# = 15060000 [28.07% complete] clocks = 00:00:00.000 [ 0.0222 sec/iter] Res64: F5B5F463C3975209. AvgMaxErr = 0.212299970. MaxErr = 0.429687500.
mlucas/run1/p53648423.stat:[May 15 13:54:59] M53648423 Iter# = 6720000 [12.53% complete] clocks = 00:00:00.000 [ 0.0213 sec/iter] Res64: 10F71B565CC23530. AvgMaxErr = 0.211920335. MaxErr = 0.417968750.
mlucas/run2/p53648893.stat:[May 15 07:47:09] M53648893 Iter# = 5660000 [10.55% complete] clocks = 00:00:00.000 [ 0.0216 sec/iter] Res64: AD13CD5BB99ED5DD. AvgMaxErr = 0.211574025. MaxErr = 0.406250000.[/QUOTE]

Just finished these DCs [53647547,53648423,53648893,53648981] on the Ryzen system - all 4 final residues mismatch those of the first-test submission.

However, while the expected roundoff error warnings (due to these exponents being right at over slightly above the 2816K FFT upper limit - I forced 2816K for all via command-line FFT-length specification, which overrides the default) seem benign in terms of number and size, my earlier grep pattern above failed to alert me to something worrisome occurring on this system, namely frequent (slightly more than one each 1M iterations, on average) instantly-fatal ROEs as demonstrated below, which I only discovered on manual inspection of the final run-status files:
[code]
[May 14 13:53:53] M53648981 Iter# = 2630000 [ 4.90% complete] clocks = 00:00:00.000 [ 0.0212 sec/iter] Res64: 96D0B09F8743EBA5. AvgMaxErr = 0.213127022. MaxErr = 0.281250000.
[May 14 13:57:26] M53648981 Iter# = 2640000 [ 4.92% complete] clocks = 00:00:00.000 [ 0.0213 sec/iter] Res64: 01D333BEF917C27C. AvgMaxErr = 0.213169920. MaxErr = 0.312500000.
M53648981 Roundoff warning on iteration 2644687, maxerr = 0.492187500000
Retrying iteration interval to see if roundoff error is reproducible.
Restarting M53648981 at iteration = 2640000. Res64: 01D333BEF917C27C
M53648981: using FFT length 2816K = 2883584 8-byte floats.
this gives an average 18.604965556751598 bits per digit
Retry of iteration interval with fatal roundoff error was successful.
[May 14 14:02:39] M53648981 Iter# = 2650000 [ 4.94% complete] clocks = 00:00:00.000 [ 0.0213 sec/iter] Res64: 1D5D3BF1B65111A9. AvgMaxErr = 0.213237492. MaxErr = 0.312500000.
[May 14 14:06:12] M53648981 Iter# = 2660000 [ 4.96% complete] clocks = 00:00:00.000 [ 0.0213 sec/iter] Res64: D66A5FF8E31B6ECC. AvgMaxErr = 0.212939164. MaxErr = 0.281250000.
[/code]
Here, "Retry of iteration interval with fatal roundoff error was successful" means that the code went back to the most-recent checkpoint file, restarted from there, and failed to encounter the same ROE (or any other kind of fatal ROE) in the ensuing rerun of the 10000-iteration interval starting from said checkpoint. If any of these 0.5ish errors were simply due to an inadequate FFT length or "unlucky" [in terms of conspiring to give an anomalously high ROE] set of FFT inputs for the iteration in question, the same retry mechanism would have reproduced the error on the retry and as a result switched to the next-larger FFT length and restarted from the same last-checkpoint file using the larger length.

These kinds of errors of the out-of-nowhere-and-instantly-fatal are of the variety which I usually associate with marginal/flaky/old hardware, non of which seems to apply here - new high-end system, not overclocked. Perhaps the GPUs attached to the Ryzen mobo which David is doing his 24/7 crunching on are throwing transient glitches? I need to do more sleuthing to try to uncover the cause. While all the ensuing retry-interval attempts were successful like the above exemplar, my worry is that if such data-corruption issues are happening at all, not all of them may result in a detectable fatal ROE, i.e. some may be of the 'silent' variety w.r.to the program's internal data-integrity checks.

So would appreciate if someone could grab 'em and do a third run on each, if at all possible with the code used being set up to print interim Res64s every 10000 iterations (or some multiple thereof), permitting cross-checking against my interim-Res64 data to localize the point of divergence in the case your final results mismatch mine.

GP2 2017-05-28 16:00

[QUOTE=ewmayer;459904]Just finished these DCs [53647547,53648423,53648893,53648981] on the Ryzen system - all 4 final residues mismatch those of the first-test submission.

...

So would appreciate if someone could grab 'em and do a third run on each, if at all possible with the code used being set up to print interim Res64s every 10000 iterations (or some multiple thereof), permitting cross-checking against my interim-Res64 data to localize the point of divergence in the case your final results mismatch mine.[/QUOTE]

OK, I will do the triple checks with interim residues.

ewmayer 2017-05-28 21:08

1 Attachment(s)
[QUOTE=GP2;459926]OK, I will do the triple checks with interim residues.[/QUOTE]

Thanks - here is text file of the 1m-iter Res64s from my 4 runs, for you to grep against as your runs pass 1m multiples. If we encounter a divergence, we can go fine-grained-compare in the preceding 1m iters:

GP2 2017-05-29 09:33

[QUOTE=ewmayer;459937]Thanks - here is text file of the 1m-iter Res64s from my 4 runs, for you to grep against as your runs pass 1m multiples. If we encounter a divergence, we can go fine-grained-compare in the preceding 1m iters:[/QUOTE]

OK. I'm monitoring each milestone and at the 4M mark, all four exponents match so far.

kladner 2017-05-29 11:46

[URL="https://www.mersenne.org/report_exponent/?exp_lo=40267771&exp_hi=&full=1"]40267771[/URL]
Mismatched.

ET_ 2017-05-29 13:33

[QUOTE=kladner;459978][URL="https://www.mersenne.org/report_exponent/?exp_lo=40267771&exp_hi=&full=1"]40267771[/URL]
Mismatched.[/QUOTE]

...and no real P-1 done as well...

Mark Rose 2017-05-29 15:35

[QUOTE=kladner;459978][URL="https://www.mersenne.org/report_exponent/?exp_lo=40267771&exp_hi=&full=1"]40267771[/URL]
Mismatched.[/QUOTE]

Queued. Will start tonight.

Madpoo 2017-05-30 20:02

Here's a new list... its only exponents below 60M.

Details: broken down at a monthly level, it's any machine that doesn't have any good results for that month but its [B]total[/B] bad is larger than its [B]total[/B] good results. I've included those total bad/good in the list so you can see what I'm talking about (I don't think I've included those before when looking at the monthly stats for a cpu).

The list of exponents is the smallest unknown exponent from that cpu for that month. If they have multiple unknowns in the month and this first one comes back bad, then we can chase down the rest. Or maybe it matches and we can chalk up a "good" result for that computer and move on to the more promising "bad" stuff:

[CODE]Exponent Bad Good BadT GoodT Unk worktodo
40545083 0 0 25 24 1 DoubleCheck=40545083,72,1
41245613 0 0 7 6 1 DoubleCheck=41245613,73,1
41395223 0 0 7 6 1 DoubleCheck=41395223,72,1
41922103 0 0 25 24 1 DoubleCheck=41922103,72,1
42202247 0 0 1 0 1 DoubleCheck=42202247,72,1
42971651 0 0 27 21 1 DoubleCheck=42971651,72,1
43209487 0 0 3 1 2 DoubleCheck=43209487,72,1
43246309 0 0 7 6 1 DoubleCheck=43246309,72,1
43339489 0 0 2 1 2 DoubleCheck=43339489,72,1
43767511 0 0 13 11 3 DoubleCheck=43767511,72,1
44607023 0 0 1 0 1 DoubleCheck=44607023,72,1
45445843 0 0 3 2 1 DoubleCheck=45445843,72,1
45910157 0 0 1 0 1 DoubleCheck=45910157,72,1
45938873 0 0 2 1 1 DoubleCheck=45938873,73,1
46009031 0 0 2 1 1 DoubleCheck=46009031,72,1
46082471 0 0 1 0 1 DoubleCheck=46082471,72,1
46129771 0 0 11 5 1 DoubleCheck=46129771,72,1
46891709 0 0 2 1 1 DoubleCheck=46891709,72,1
47111417 0 0 5 4 1 DoubleCheck=47111417,72,1
47205713 0 0 2 1 1 DoubleCheck=47205713,72,1
47538119 0 0 5 4 1 DoubleCheck=47538119,72,1
47944483 0 0 2 1 1 DoubleCheck=47944483,72,1
48188069 0 0 3 2 4 DoubleCheck=48188069,72,1
49428013 0 0 3 2 1 DoubleCheck=49428013,72,1
49560479 0 0 4 3 6 DoubleCheck=49560479,72,1
49592923 0 0 2 1 1 DoubleCheck=49592923,72,1
49868879 0 0 13 11 3 DoubleCheck=49868879,72,1
50074643 0 0 9 5 7 DoubleCheck=50074643,73,1
50110699 0 0 3 2 1 DoubleCheck=50110699,73,1
50473831 0 0 3 2 1 DoubleCheck=50473831,73,1
50535253 0 0 2 1 1 DoubleCheck=50535253,73,1
50602247 0 0 23 21 1 DoubleCheck=50602247,73,1
50662103 0 0 2 1 2 DoubleCheck=50662103,73,1
50683939 0 0 6 5 2 DoubleCheck=50683939,73,1
50902297 0 0 23 21 1 DoubleCheck=50902297,73,1
51076063 0 0 2 1 1 DoubleCheck=51076063,73,1
51090239 0 0 3 1 3 DoubleCheck=51090239,73,1
51458051 0 0 3 2 5 DoubleCheck=51458051,73,1
51833161 0 0 23 21 1 DoubleCheck=51833161,73,1
51851221 0 0 3 2 5 DoubleCheck=51851221,73,1
51951451 1 0 37 21 15 DoubleCheck=51951451,73,1
52147999 0 0 13 11 3 DoubleCheck=52147999,73,1
52335989 0 0 6 5 1 DoubleCheck=52335989,73,1
52445741 0 0 13 11 1 DoubleCheck=52445741,73,1
52573463 0 0 13 11 3 DoubleCheck=52573463,73,1
52600231 0 0 2 1 4 DoubleCheck=52600231,73,1
52633459 0 0 9 5 8 DoubleCheck=52633459,73,1
52717579 0 0 3 1 1 DoubleCheck=52717579,73,1
52857713 0 0 40 25 6 DoubleCheck=52857713,73,1
52875601 0 0 2 1 4 DoubleCheck=52875601,73,1
52917121 0 0 2 1 1 DoubleCheck=52917121,73,1
53026177 0 0 3 2 1 DoubleCheck=53026177,73,1
53524609 0 0 23 21 1 DoubleCheck=53524609,73,1
53717549 0 0 4 3 4 DoubleCheck=53717549,73,1
53737139 0 0 4 3 4 DoubleCheck=53737139,73,1
53741773 0 0 39 29 1 DoubleCheck=53741773,73,1
54121889 0 0 2 1 3 DoubleCheck=54121889,73,1
54134383 0 0 2 1 4 DoubleCheck=54134383,73,1
54264983 0 0 36 23 4 DoubleCheck=54264983,73,1
54441767 0 0 39 29 2 DoubleCheck=54441767,73,1
54872291 0 0 3 2 2 DoubleCheck=54872291,73,1
55006123 0 0 36 23 2 DoubleCheck=55006123,73,1
55061557 0 0 2 1 4 DoubleCheck=55061557,73,1
55100737 0 0 4 3 1 DoubleCheck=55100737,73,1
55297247 0 0 39 29 1 DoubleCheck=55297247,73,1
55298843 0 0 2 1 4 DoubleCheck=55298843,73,1
55299281 0 0 39 29 1 DoubleCheck=55299281,73,1
55449209 0 0 36 23 2 DoubleCheck=55449209,73,1
55684813 0 0 2 1 2 DoubleCheck=55684813,73,1
55816259 0 0 36 23 2 DoubleCheck=55816259,73,1
55956871 0 0 36 23 2 DoubleCheck=55956871,73,1
55983827 0 0 2 1 3 DoubleCheck=55983827,73,1
56004757 0 0 2 1 4 DoubleCheck=56004757,73,1
56088629 0 0 4 3 3 DoubleCheck=56088629,73,1
56566841 0 0 3 2 1 DoubleCheck=56566841,73,1
56604923 0 0 4 3 3 DoubleCheck=56604923,73,1
56644061 0 0 3 2 2 DoubleCheck=56644061,73,1
56736437 0 0 36 23 2 DoubleCheck=56736437,73,1
57091651 0 0 39 29 4 DoubleCheck=57091651,73,1
57143657 0 0 23 21 1 DoubleCheck=57143657,73,1
57255067 0 0 2 1 3 DoubleCheck=57255067,73,1
57374939 0 0 36 23 2 DoubleCheck=57374939,73,1
57404041 0 0 3 2 1 DoubleCheck=57404041,73,1
57423991 0 0 2 1 1 DoubleCheck=57423991,73,1
57435919 0 0 37 21 6 DoubleCheck=57435919,73,1
57580739 0 0 3 2 1 DoubleCheck=57580739,73,1
57607213 0 0 4 3 1 DoubleCheck=57607213,73,1
57655019 0 0 2 1 2 DoubleCheck=57655019,73,1
57885211 0 0 19 14 2 DoubleCheck=57885211,73,1
57995117 0 0 3 2 2 DoubleCheck=57995117,73,1
58351453 0 0 23 21 1 DoubleCheck=58351453,73,1
58382663 0 0 36 23 1 DoubleCheck=58382663,73,1
58463819 0 0 36 23 2 DoubleCheck=58463819,73,1
59332439 0 0 4 3 4 DoubleCheck=59332439,73,1
59378471 0 0 39 29 4 DoubleCheck=59378471,73,1
59485343 0 0 2 1 1 DoubleCheck=59485343,75,1
59574169 0 0 4 3 3 DoubleCheck=59574169,73,1
59626481 0 0 3 2 2 DoubleCheck=59626481,73,1[/CODE]

rudi_m 2017-05-30 22:04

[QUOTE=Madpoo;460080]Here's a new list... its only exponents below 60M.
[/QUOTE]

I took all below 50M.

ewmayer 2017-05-30 22:14

[QUOTE=Madpoo;460080]Here's a new list... its only exponents below 60M.[/QUOTE]

Grabbed these:

57580739
57607213
57655019
57885211
57995117
58351453
58382663
58463819


All times are UTC. The time now is 22:58.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.