mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   Parallelization ... (https://www.mersenneforum.org/showthread.php?t=4902)

pacionet 2005-10-27 15:54

Parallelization ...
 
Will be possible , in future, that a single exponent is checked by different machines by software parallelization ?

garo 2005-10-27 16:25

Never say never but it is highly unlikely as the Lucas-Lehmer test is sequential where each step depends on the result of the previous one.

xilman 2005-10-27 17:20

[QUOTE=garo]Never say never but it is highly unlikely as the Lucas-Lehmer test is sequential where each step depends on the result of the previous one.[/QUOTE]
However, the computation within each iteration can be parallelized and so be done on more than one machine at once.

Paul

ewmayer 2005-10-27 18:41

[QUOTE=xilman]However, the computation within each iteration can be parallelized and so be done on more than one machine at once.[/QUOTE]However, this is of little practical benefit (unless e.g. a specific double-chech is urgently needed), since the required interprocessor communication invariably leads to performance degradation, e.g. 4 CPUs operating in parallel might give the same overall throughput as 3 CPUs working independently.

T.Rex 2005-10-27 19:29

GLucas does LLT by parallelizing the FFT work with several threads.
With specific very fast connectivity (Quadrics, as an example) available between several big machines with many CPUs, one may imagine replacing the threads by MPI, once it has been made clear which parts of the FFT memory is used by each thread on each machine. I guess the bigger the exponent to be checked is, the more efficient such an architecture is.
But moving mprime or GLucas or MLucas to MPI is very complicated and very few computing centers having clusters connected by MPI/Quadrics would want to search a new Mersenne prime. So, such an idea does not seem useful.

BUT, it may be interesting to some team to try discovering if F33 (the 33th Fermat number) is prime or not.
With only one very fast CPU (3.6 GHz), it should take about 4000 years of computation ...
But, if one could use 16 clusters of 128 CPUs, it should take only 1 to 2 years ....
Since new big machines (IBM/PPC, Sun/SPARC, Bull/Itanium2, ...) are now delivering machines with 32 or 64 very powerful processors with HT/SMT, and since there is now a race to have several cores (2 to 8, like in Cell) in one chip, it should be possible to check the primality of F33 in less than 1 year of computation, on 4 machines with 64x8 cores ; in 5 years from now, I guess.

Since LLT can be used to prove a Fermat number is prime or composite, mprime could be used.
But, is it worth to parallelize mprime and to use 4 very big machines in cluster during 1 year for only 1 bit result (F33 prime or not) ?
Fame ?

Tony

xilman 2005-10-27 20:06

[QUOTE=ewmayer]However, this is of little practical benefit (unless e.g. a specific double-chech is urgently needed), since the required interprocessor communication invariably leads to performance degradation, e.g. 4 CPUs operating in parallel might give the same overall throughput as 3 CPUs working independently.[/QUOTE]
You are, of course, completely correct.

However, I was discussing possibility rather than cost-effectiveness. It's possible to parallelize within an exponentiation step --- everyone who's thought about the situation agrees on that.

Most people who have thought about the situation have failed to come up with a scheme to parallelize the exponentiation itself.


Paul

ewmayer 2005-10-28 19:44

[QUOTE=xilman]Most people who have thought about the situation have failed to come up with a scheme to parallelize the exponentiation itself.l[/QUOTE]That's the same problem we have e.g. with p-1 factoring - we can easily distribute the stage 2 powerings, but not stage 1.

jasonp 2005-10-29 19:58

[QUOTE=T.Rex]
But, is it worth to parallelize mprime and to use 4 very big machines in cluster during 1 year for only 1 bit result (F33 prime or not) ?
[/QUOTE]
To finish F33 in one year you'd have to manage 39 squarings per second. The (complex) FFT size would be 2^28, and you'd need two FFTs per squaring. Assuming an FFT of size N requires 5*N*log2(N) FPU ops (an overestimate, but we're not counting DWT weights, pointwise squaring or carry propagation), your Pepin tester would need a sustained throughput of 2.93 teraflops.

You'd need hundreds of ordinary computers to manage that, and this isn't going to scale nearly as well as linpack. You can technically use a few dozen of Clearspeed's chips, but I doubt you'd be able to manage the off-chip memory bandwidth needed to keep them all busy.

jasonp

akruppa 2005-10-29 20:02

[QUOTE=ewmayer]That's the same problem we have e.g. with p-1 factoring - we can easily distribute the stage 2 powerings, but not stage 1.[/QUOTE]

Just a thought... how about using a longer FFT length which should reduce parallelisation overhead, and doing more than one squaring in frequency space?

Alex

fatphil 2005-10-30 14:57

[QUOTE=akruppa]Just a thought... how about using a longer FFT length which should reduce parallelisation overhead, and doing more than one squaring in frequency space?

Alex[/QUOTE]

Longer FFT length might not work at all well, as you'd end up with more overhead than real bits of data. Using much larger limbs would be more likely to work, and that leads you towards the NTT path, IMHO.

The double-word limbs (complex floating point transform, not NTT) that I'm using heavily in PIES/LG presently unfortunately don't appear to help at all, as they're alas fractionally 'noisier' than a double-length transform would be.

Phil


All times are UTC. The time now is 22:43.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.