View Single Post
Old 2019-04-29, 20:38   #8
kriesel's Avatar
Mar 2017
US midwest

29·173 Posts
Default Why don't we use several computing devices together to primality test one exponent faster?

Depending on point of view, the answer is either,
a) we already do that, or
b) time and money.

First, (a). Cpu GIMPS applications such as prime95, mprime, and Mlucas can use multiple cores to perform the fft computations in each single iteration, or even multiple cpu packages in a system. To do it efficiently, high memory bandwidth is necessary. It is often possible to get more total throughput from a given multicore system by using a lower number of cores per exponent. It generally lowers performance to spread a computation across multiple processor packages. As fast as the data communication is, between chip packages on a well designed motherboard, it isn't enough for these extremely demanding applications.
Gpu GIMPS applications use the many cores present on a gpu in parallel, by default.
Furthermore, the various bit levels of TF, P-1 factoring, first primality test, and double check get assigned and run separately. At the project level, there is pipelining and multiple processing, with a variety of task sizes assigned separately. In the normal course of events, at most one assignment type is being worked on for a given exponent at a time.
It is possible to relatively efficiently parallelize some of the work on one exponent, as follows. The chance of the other parallel runs being a wasted effort is the chance of a factor being found by any of the runs. Choose gpu models according to availability and relative TF (int32) performance or PRP & P-1 (DP) performance.

For completing multiple TF levels remaining, split the workload according to gpu relative speeds across as many gpus as are available. If for example two identical gpus were available, allocate all TF levels except the highest to one, and the highest to the other.
If performing separate P-1, start that simultaneously on another gpu or cpu, and a PRP with proof run on a cpu or gpu.
If running combined PRP & P-1 such as gpuowl V7, run that on the fastest available gpu.

And to quote Ernst Mayer, for verification of a run that's already done,
Note that a secondary verification *can* readily be split across multiple independent machines, if the first run saves interim checkpoint files at regular intervals. Then each DC machine can start with one of said files, run to the next checkpoint iteration, if the result matches that of the first run there, and all said subinterval-DCs match thusly, one has a valid DC.
This would be a fast way of verifying or disproving an LL test or PRP run that indicated prime or probably prime the first time through. Only if all residues can be replicated from the set of previous saved residues is the run verified; a mismatch in any of the partial reruns is cause for doubt of the final residue.

Now, (b). Even the fastest available interconnection between computers is not fast enough to keep up with the demand of any such application created to use multiple computers to compute one iteration of a PRP test, LL test,or P-1 factoring using ffts. There is overhead in shipping data over network connections. No such application exists. Similarly, there is overhead in communicating between gpus via PCIe interface, or host to gpu and vice versa, and the data rates are much lower than the bandwidth available on an individual gpu. It's more efficient to run an exponent per gpu to take advantage of the gpu-local bandwidth advantage, and the inherent parallelism of having many exponents to test or factor. Madpoo has empirically determined that on a dual-14-core-cpu system, maximum iteration speed for a single exponent test is reached at all cores of one socket plus six of the other socket. The communication between cpu sockets was not fast enough to employ more cores effectively.Total throughput is better if runs occupy sockets separately. Inefficiency means wasted power, and time, and power x time x utility rate = money.
There are good reasons to not spend precious programmer time to support such configurations. There is a large amount of work to do in the exponent range p<109, such that advances in hardware speed in future years will make the higher exponents' run times acceptable at some point on single systems or gpus, before the wavefront of most user activity reaches them.

See also

Top of reference tree:

Last fiddled with by kriesel on 2021-02-16 at 18:49 Reason: added dual-cpu-socket Madpoo experience
kriesel is online now