2021-03-07, 23:51 | #1 |
∂^{2}ω=0
Sep 2002
República de California
11,633 Posts |
Mlucas v20: Preview of coming detractions, or something :)
As I've noted elsewhere, v20 will have p-1 factoring support as its major feature add, with some details such as "fused p-1 stage 1 and PRP testing?" not yet finalized. (PRP-proof support will be in a later follow-on release.)
By way of performance improvements, v20 will have much-improved accuracy for exponents near the FFT-length breakover points. Some background: Back in Mlucas v17, I deployed a streamlined chained-DWT-weights computation in the carry-propagation routines, which gives a ~3-7% speedup for exponents not near an FFT breakover point, and handily allows runtime "dial-in accuracy" as things get near such a breakover point. The problem? Even at the maximal accuracy setting - shortest chain length - said carry routines are a little less accurate than the non-chained ones they replaced, and their associated ROEs tend to be noisier. In fact the original high-accuracy carry macros are still there, but they require a special preprocessor directive at compile time to invoke. I foolishly used the fact the data layouts used by the newer chained-carry macros are somewhat different to keep me from allowing both options to be available at runtime. In v20 the layouts have been harmonized, and for exponents near the limit for each FFT length the code starts with the slightly faster chained-carries in shortest-chain mode, but if it still hits dangerous ROEs (>= 0.4375), it switches to the high-accuracy carries, and only ups the FFT length if those still don't manage to keep the ROEs under control. Moreover, in PRP-test mode, since we do both every-iteration ROE checking and the Gerbicz check, we allow ROE = 0.4375 because the latter check will catch any rare-but-not-unheard-of instances where such a 0.4375 error is really a fatal 0.5625 one in disguise. At least that's the idea - I am currently subjecting the modified carry code to a really limit-pushing test, a PRP test of M107353937 using a 5.5M FFT. This case was one I inadvertently let slip into a special 5.5M-FFT-run queue I had using gpuowl on one of my GPUs, I meant to limit things to p < 107.3M based on several dozen expos ~107M run that way, but this one got in. I noticed its progress seemed stuck, turnes out gpuowl got ~15% done using 5.5M FFT before hitting its first Gerbicz-check "EE" and going into flailing-around mode. I killed it and finished the run using 6M FFT, so I have a complete set of checkpoint Res64s for cross-comparison purposes. The Mlucas v20 run is on the oldest of my little Intel NUC minis, a 2-core/4-thread Broadwell-CPU one, running an AVX2 build of the code. So far 100Kiters in, 27 ROEs = 0.4375 but none higher, interim result matches the reference run: [2021-03-07 15:41:46] M107353937 Iter# = 100000 [ 0.09% complete] clocks = 00:07:48.699 [ 46.8699 msec/iter] Res64: F3FC9D3BEB7987FF. AvgMaxErr = 0.322385850. MaxErr = 0.437500000. Residue shift count = 24310597. I fully expect the run will hit an ROE > 0.4375 and/or a G-check error at some point, that wll help me properly dial in some of the attendant logic. The actual default limit for 5.5M FFT will be somewhat lower than this, around p = 106.5M for FMA-using builds, a smidge lower for non-FMA. Further slight accuracy improvements may be possible beyond this, e.g. in the FFT-twiddle and DWT-weights, but the above is the low-hanging-fruit one. |
2021-03-08, 20:01 | #2 |
∂^{2}ω=0
Sep 2002
República de California
2D71_{16} Posts |
M107353937 proved a bit too large to handle @5.5M FFT, even with the newly-reinstated HIACC carry code - here are the ROEs > 0.4375 hit in the first half-million iterations (I omit the 0.4375 ones, of which there were over 60):
M107353937 Roundoff warning on iteration 120001, maxerr = 0.453125000000 M107353937 Roundoff warning on iteration 245828, maxerr = 0.439453125000 M107353937 Roundoff warning on iteration 327741, maxerr = 0.450927734375 M107353937 Roundoff warning on iteration 444934, maxerr = 0.500000000000 Each of those triggered a switch to a re-run of the affected 10000-iteration interval @6M, after which I killed the run and restarted @5.5M, by way of data gathering. The Res64 at 500,000, where I finally stopped, matched the one of the previously completed gpuowl run. Next I tried an exponent just over 107M, 107001617, that got through over a half-million iterations @5.5M before hitting the tripwire: M107001617 Roundoff warning on iteration 72215, maxerr = 0.437500000000 M107001617 Roundoff warning on iteration 98979, maxerr = 0.437500000000 M107001617 Roundoff warning on iteration 124197, maxerr = 0.437500000000 M107001617 Roundoff warning on iteration 542021, maxerr = 0.468750000000 Another thing I have already added to v20 is per-run counting of ROEs (specifically ones large enough to trigger a switch to the next-larger FFT length) and Gerbicz check errors, for inclusion in the 'errors' subfield of the end-of-run JSON report to the Primenet server. That suggests that an effective strategy in such borderline cases would be to switch to the next-larger FFT length for rerun of an ROE-affected subinterval, but on successful completion of such, to drop back down if the accumulated-error rate remains below some threshold, say no more than 1 such error per million iterations. Last fiddled with by ewmayer on 2021-03-08 at 20:03 Reason: Mein schpelink ist furchtbar, ja! |
Thread Tools | |
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Version v0.7.7 Preview | Mysticial | y-cruncher | 0 | 2019-01-04 02:35 |
Where are these coming from? | Chuck | GPU to 72 | 1 | 2018-09-03 03:34 |
Haswell Preview Benchmark | kracker | Hardware | 543 | 2015-10-05 05:28 |
Prime95 version 27.1 early preview, not-even-close-to-beta release | Prime95 | Software | 126 | 2012-02-09 16:17 |
Missing mouse-over preview text | retina | Forum Feedback | 1 | 2011-09-12 15:32 |