View Single Post
2022-09-25, 17:08   #15
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

1D6616 Posts
P-1 performance

Mprime / prime95 V30.8 introduces an enhancement in P-1 performance, using polynomials to achieve almost 100% pairing of primes in stage 2. This allows cost-effectively factoring to higher stage 2 bounds, and achieving higher factor found probability, saving more primality tests.
Use V30.8 or later for P-1 factoring. Use adequate memory to enable the gains. A few GB is better than only running stage 1. The savings (which I define as over many very similar exponents, estimated probability of finding a factor and thereby avoiding some full PRP tests times estimated cost per primality test avoided, minus estimated cost of P-1 factor testing as a function of exponent, ram, and bounds) are approximately logarithmic with allowed memory, so 4 GiB is better than nothing, 16 GiB is good, 32 is better, more is better yet, until risking the onset of slowdown by paging/swapping which can cut performance drastically and change the expected gain into a large loss. Prime95's GUI limits allowed stage 2 ram to 90% of installed physical system ram. That limit can be overridden by editing local.txt's Memory= line with a text editor, then restarting prime95.

Use adequate memory and bounds the first time P-1 is run on an exponent. "Optimizing" by running P-1 to low bounds first, selected to maximize factors found per unit of initial computing effort, is actually a DE-optimization for the project. Avoid inadequate-bounds factoring attempts whenever feasible. (Even if it means asking someone else to do the P-1 run.)

Typically on CPUs I've benchmarked, at the wavefront of DC or first time testing, fewer cores per worker produces highest aggregate throughput figures, but the difference is slight. The response of prime95 v30.8 P-1 to a lot of allowed ram is larger. So run two workers for a single-CPU-package system, and they will use about the same amount of time in stage 1 and two, and alternate using large quantities of memory for stage 2, fully employing available memory and maximizing expected net savings of computing time. See second attachment. That attachment clearly shows by comparing the first try and retry curves for similar exponents (current first-primality-test wavefront) that the expected time saved is much larger for a first try, even at considerably less allowed ram than for a retry with nearly 64 GiB of ram. It also indicates there is not much difference in expected time saved versus retried exponent for the same allowable ram, from near the current DC wavefront (66M) to the first-test wavefront (110M).

Multi-socket systems (Dual-Xeon, Quad-Xeon, etc) may have nonuniform memory access (NUMA). Specifying a large amount of allowed ram that will cause significant traffic over the NUMA interconnect (QPI, UPI, etc) may be slower than using a lesser amount of ram all connected to one processor socket. Performance may be better on a dual-Xeon system by running 4 workers, with ~45% of system ram allowed per worker, so that it could all be on the near side of the NUMA interconnect. There was a noticeable dip in performance on a dual-Xeon system with 2 workers when using more than half the total system ram in stage 2 in a single worker, which would require some of it to be accessed across the NUMA boundary. See first attachment.

In prime95 v30.8b14, with prime95 optimizing bounds freely for the given exponent and allowed stage 2 ram, selected B1 is observed to increase slightly (~O(ram0.15)) with allowed ram; selected B2 is observed to increase nearly linearly with allowed ram (~O(ram0.77 to ram0.98)), and B2/B1 ratio increase substantially (~O(ram0.62 to ram0.89))with allowed ram. I believe the stage 2 performance increase results in selecting somewhat lower B1, as well as much higher B2, for the same exponent and allowed ram as would have occurred in v30.7 or earlier. Perhaps counterintuitively, these optimizations, for total probable compute time, result in longer P-1 stage 1 and stage 2 times with increasing ram allowed.

See also https://www.mersenneforum.org/showpo...&postcount=724 and https://www.mersenneforum.org/showpo...&postcount=727

The preceding is all in the context of mprime/prime95 doing both stage 1 and stage 2. A recent development in gpuowl supports passing the results of P-1 stage 1 performed standalone in gpuowl, into an mprime/prime95 folder and worktodo file for performance of stage 2 by prime95. See https://mersenneforum.org/showpost.p...postcount=2870 and some posts following it for more info, relevant to gpuowl ~v7.2-129 and configuring prime95 v30.8+ to work with that.

A brief comparison of v30.7b9 and v30.8b14, on i5-1035G1, Windows 10, two 32GiB DIMMs installed, on the same P-1 assignment, at 60GiB allowed stage 2 ram, running a single worker with 4 physical cores, yielded the following
Code:
for worktodo line PFactor=(AID),1,2,118970857,-1,77,1.3

version      B1        B2     factor odds    runtime estimate    computed odds/day of P-1
v30.7b9   587,000  25,716,000    3.59%      12 hours 26 minutes          6.93%/day
v30.8b14  840,000 267,531,000    5.49%      10 hours 41 minutes         12.33%/day
ratio      1.431     10.403      1.529             0.859                   1.779
Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 prime95 v30.8 P-1 retry projected savings versus stage 2 memory.pdf (35.6 KB, 63 views) martinette p-1 benchmarking p95 30.8b14.pdf (58.9 KB, 4 views)

Last fiddled with by kriesel on 2023-01-23 at 20:24 Reason: updated second attachment, defined savings