View Single Post
Old 2021-10-10, 16:16   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

710110 Posts
Default V20.1.x P-1 run time scaling

Based on very limited data, run time scaling is approximately p2.1, for typical recommended bounds, in line with expectations from other applications and from first principles. (So twice the exponent is more than four times the run time, for nontrivial exponents, where fixed-duration or low-order-scaling setup time does not affect scaling much.)

When selecting exponents for run-time scaling tests, I recommend at least one with a known factor that should be found with usual bounds. That goes first to anchor the low end of the scaling. M10000831 works well. Widely spaced other exponents of use to GIMPS compose the rest; current first-test wavefront ~107M, ~220M, 332M (100Mdigit), & ideally higher (~500M-700M). Running them in that order allows a scaling fit to develop in a spreadsheet with the least compute time expenditure. That helps avoid single data points costing months or the appearance of a hung application. If running on WSL & Windows, take care to pause Windows updating for a sufficient duration that the scaling runs will complete without interruption, for an easier situation for tabulating compute time per exponent.

The first attachment shows results of runs while also running other GIMPS loads, and brief stage 1 tests without the other loads. Run time scaling for Mlucas v20.1 on a dual-xeon-e5-2697v2 system on Ubuntu atop WSL & Windows 10, 128 GiB ECC ram, is consistent with OBD P-1 factoring whole attempts of ~10 months standalone duration, ~15 months with other usual loads. Running stage 2 in parallel on multiple systems can be used to ensure OBD P-1 completion in under a year. (Note the fit to points including a 10M exponent is inaccurate, because that point was run with 8 cores, unlike 16 for the others in that set.)

An experimental sequence of self test at 192M fft length for varying thread counts indicated 20 threads was latency optimal for this system for that length. See https://www.mersenneforum.org/showpo...8&postcount=20

A second run time scaling set for the same dual-xeon-e5-2697v2 system on Ubuntu atop WSL & Windows 10 system was run with 20 threads and Mlucas V20.1.1. See the second attachment. Several run time estimates for gigadigit P-1 were computed, with all shorter than one year. There are a few available ways to shorten run time relative to those tests and estimates, listed in the attachment.
A comparison of the stat files of M3321928171 from the first scaling run in Mlucas V20.1 and the second in V20.1.1 with differing core counts but matching B1=17,000,000 shows stage 1 iteration 100,000 res64 values match. (Res64: C51C82322FC7CBE6) See also https://www.mersenneforum.org/showpo...6&postcount=35 "requirements for comparability of interim residues" post.
This system's run time scaling may be revisited and improved later based on lessons learned in the following.

An additional run time scaling on a similar system (dual-xeon-e5-2690, 64 GiB ECC ram) on Ubuntu atop WSL & Windows 10 and 16 threads indicates solo completion of both stages for OBD P-1 would take ~1.9 years, and split stage 2 with an equal speed system would take ~1.25 years overall. Experiments varying number of cores for solo instance yielded local minima at 9, 19 and 32 cores (19 fastest), providing ~10.6% improvement for an estimated ~1.7 years solo both stages, and with a split stage 2 ~1.1 years (or <1 year with a faster system such as the one in the previous paragraph). Limited testing of dual instances indicated aggregate throughput ~2.65 iterations/sec is possible, corresponding to an effective timing of 377. msec/iter. This could be either a stage 1 and a stage 2 or two stage 1s, and with latency ~2 years, equivalent aggregate throughput >1 OBD P-1/year. Some of these figures could be improved by using a native Linux boot, avoiding various detrimental aspects of WSL environment. See the third attachment, which includes some CentOS 8 Stream selftest timings for comparison to Ubuntu/WSL/Win10 on the identical hardware.

A separate run time scaling was conducted on an i5-7600T (4 core no HT) 64 GiB (nonECC) ram CentOS 7.9 system. This indicates OBD P-1 may be feasible in 2 years on that hardware, or with stage 2 split with another equal of faster system, 1.4 years, although the lack of ECC is a concern. It may be used for some OBD stage 1 progress and then that system switched to P-1 for a 2.25G exponent with estimated completion ~9.8 months solo. See the fourth attachment.

Given the run time scalings obtained so far, and mlucas.cfg timing for 192M fft length, we can estimate that a timing under ~500 ms/iter is required to qualify for OBD P-1 to designated bounds solo within a year. Assuming run time is 1/3 stage 1, 2/3 stage 2, stage 1 taking no more than (4 months * 365 /12 days/month * 24 hours/day * 3600 seconds/hour) / (17000000 * 1.442 iters) = 10512000 / 24514000 = 429. msec / iter provides a rough magnitude check. So far I have observed s2/s1 duration ratios lower than 2, during most runtime scaling at exponents < 1G (and those would improve with greater memory allowed for stage 2), which would permit somewhat longer stage 1 than 4 months and longer than 429. msec/iter 192M fft length selftest times.


Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2022-02-13 at 17:29 Reason: updated third attachment & related text
kriesel is offline