mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Information & Answers (https://www.mersenneforum.org/forumdisplay.php?f=38)
-   -   combining different PCs to search same exponent (https://www.mersenneforum.org/showthread.php?t=23959)

ssybesma 2019-01-02 19:54

combining different PCs to search same exponent
 
Was poking around and did not find a direct answer to this.


Is this possible? If so I would like to do it for one of the 100-million digit primes (anything over 332192810*) since they take so long to test.

* - doing the Spock thing, more accurately over 332192809.4887361824512481... (this is from the [URL]https://mrob.com/pub/comp/hypercalc/hypercalc-javascript.html[/URL] website)

Thanks!

chalsall 2019-01-02 20:01

[QUOTE=ssybesma;504689]Is this possible?[/QUOTE]

No.

science_man_88 2019-01-02 20:08

[QUOTE=ssybesma;504689]Was poking around and did not find a direct answer to this.


Is this possible? If so I would like to do it for one of the 100-million digit primes (anything over 332192810) since they take so long to test.



Thanks![/QUOTE]

In theory yes, in practice no, and it's likely to not help at all and actually hurt performance by a million miles.

Since the LL tests as implemented are in effect finding values 2 less than quadratic residues chained together in a certain way it could be done in parallel for individual steps. But, since there are up to half as many quadratic residues mod a number, as there are remainders mod that value, it's in practice not going to fit all at once even for that first stage on any set of computers.

If you are asking if it could do TF, P-1,PRP, and LL all at once, then again, In theory yes. However, in practice the memory requirements are too much, and if TF finds a factor, then the other tests are wasted.

plus the code for the ridiculous LL test version is not implemented.

ssybesma 2019-01-02 20:18

[QUOTE=science_man_88;504691]In theory yes, in practice no, and it's likely to not help at all and actually hurt performance by a million miles.

<snipped> [/QUOTE]




OK, that was a good explanation thank you.

Uncwilly 2019-01-02 21:10

To added to what has been stated. Unless the cores are in the same box and have equal access to the same memory it is not practical. Multiple CPU's on the same motherboard can work together. But, if the objective is to test as many of those numbers as fast as possible, having each CPU work on a different exponent is better (provided the RAM can handle it.)

M344587487 2019-01-02 21:47

The most practical way to "combine" PCs for primality testing is to use two similarly specced PCs and run the same test independently. You'd do this to catch and recover from errors which are much more likely to occur with larger exponents by comparing interim results. Even that is of dubious merit, particularly now that PRP and better error detection is used for first time tests.

kriesel 2019-01-02 21:56

A crash program to attack one 100Mdigit candidate might be done as follows.
TF from current bit level to goal level -1bit, on one gpu.
TF from goal level-1 to final, in parallel on a second gpu.
PRP/GC with P-1 built in on a third, AMD gpu with gpuowl 5.0. (Or it might be faster to run P-1 in parallel separately on CUDAPm1 and run PRP/GC with gpuowl V3.5 to 3.8.)
If the gpus are equal speed or close, the factoring runs finish much earlier than the primality test.
(TF time, or P-1 time, are typically somewhere around ~1/40 of the primality test time for the same exponent and hardware as I recall.)
The odds of finding a factor by TF or P-1 are low but not zero, around 3% for P-1.
If gpu one or two find a factor for it, the time of the others was unnecessary.
Therefore it's better to pipeline; one gpu screens candidate exponents by TF, and other gpus and cpus follow by P-1 or primality test (LL or PRP not both) in parallel, on exponents that were previously TF screened.

Throwing multiple _cores_ at a single exponent is standard operating procedure in prime95/mprime when multiple are available. There is a requirement for a very high data rate between such cooperating cores. Package-local cache is best. Memory transfer rates are typically the limiting factor in performance. (George experiments with different code sequences that trade more instructions for fewer memory accesses.) Even the package-to-package transfer rate on dual-Xeon pc's are slow by comparison, and show up as slowdowns in benchmarks when a prime95 worker uses cores from both packages. (See [URL]https://www.mersenneforum.org/showpost.php?p=504218&postcount=4[/URL] and [URL]https://www.mersenneforum.org/showpost.php?p=504219&postcount=5[/URL] for several different hardware examples of benchmark results.) The available networking speeds between computers are slow compared to that transfer rate. So there's no point in attempting to code for distributing a single pseudoprime iteration or Lucas-Lehmer iteration across multiple computers, as it would produce a slowdown, not a speed-up. As I recall, a similar argument has been made against using multiple gpus on a single exponent; the PCIe communication is not fast enough, and SLI or Crossfire may also lack enough speed. That PCIe speed limit would also constrain coprocessor cards bearing cpus and RAM.

If tackling a long run such as a 100Mdigit exponent primality test,
[LIST=1][*]benchmarking,[*]adjusting number of cores/worker for a good throughput & latency tradeoff,[*]testing for memory errors, and then doing at least one successful double check first are recommended, as is[*]using PRP/GC for superior error detection and correction by repeating some iterations as in gpuowl or prime95/mprime (not LL/Jacobi at 50% detection, or naked LL with no Jacobi test as in CUDALucas).[*]Most importantly, first, check its current status, and reserve the assignment. Check status at [URL]https://www.mersenne.org/report_exponent/[/URL] Reserve at [URL]https://www.mersenne.org/manual_assignment/[/URL] or [URL]https://www.mersenne.org/manual_gpu_assignment/[/URL][/LIST]

science_man_88 2019-01-02 22:34

[QUOTE=Uncwilly;504698]To added to what has been stated. Unless the cores are in the same box and have equal access to the same memory it is not practical. Multiple CPU's on the same motherboard can work together. But, if the objective is to test as many of those numbers as fast as possible, having each CPU work on a different exponent is better (provided the RAM can handle it.)[/QUOTE]

yeah using info from : [url]https://www.zdnet.com/article/what-is-the-worlds-data-storage-capacity/[/url] I get current ( via extrapolation) world wide storage to be about 21 yottabits of storage, enough to store all quadratic residues as p bit values with no spacing for any single p<81

Mysticial 2019-01-02 22:48

If the LL iteration requires significantly more memory than the system has, then it is possible for multiple computers to outperform a single computer. Especially if the networking can be made faster than the disk storage.

But in order to spill memory, we're talking exponents on the order of a trillion. So you can do a few iterations. But you're not running it to completion with today's technology.

kriesel 2019-05-06 16:57

[QUOTE=Mysticial;504707]If the LL iteration requires significantly more memory than the system has, then it is possible for multiple computers to outperform a single computer. Especially if the networking can be made faster than the disk storage.

But in order to spill memory, we're talking exponents on the order of a trillion. So you can do a few iterations. But you're not running it to completion with today's technology.[/QUOTE]Does any such efficiently implemented code exist? As far as I know, Mlucas at p<2[SUP]32[/SUP] is currently the highest available exponent support for primality testing of Mersenne primes. And as you imply, run time on available hardware is what is limiting.
Roughly 2 million ~100Mbit candidates can be tested in the time of one 100Gbit candidate or 1/126 the time of one trillion-bit candidate. For a sense of time scale, a 100Mbit candidate takes days or weeks or months to primality test once, depending on hardware performance.


(Edit: above relates to consumer budget hardware. Non-billionaire consumers. ;)

Mysticial 2019-05-06 17:32

[QUOTE=kriesel;515945]Does any such efficiently implemented code exist? As far as I know, Mlucas at p<2[SUP]32[/SUP] is currently the highest available exponent support for primality testing of Mersenne primes. And as you imply, run time on available hardware is what is limiting.
Roughly 2 million ~100Mbit candidates can be tested in the time of one 100Gbit candidate or 1/126 the time of one trillion-bit candidate. For a sense of time scale, a 100Mbit candidate takes days or weeks or months to primality test once, depending on hardware performance.[/QUOTE]

Yes and no. There do exist super-computer implementations of large multiplication using MPI. That's how they did Pi computations in the 1990s - 2000s. But these were likely acyclic implementations for integer-multiply rather than IBDWT for LL mul-mod.

At these sizes even the best supercomputer implementations will probably take minutes to do a single 1 trillion digit multiply.


All times are UTC. The time now is 10:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.