mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2005-10-27, 15:54   #1
pacionet
 
pacionet's Avatar
 
Oct 2005
Italy

33910 Posts
Default Parallelization ...

Will be possible , in future, that a single exponent is checked by different machines by software parallelization ?
pacionet is offline   Reply With Quote
Old 2005-10-27, 16:25   #2
garo
 
garo's Avatar
 
Aug 2002
Termonfeckin, IE

22·691 Posts
Default

Never say never but it is highly unlikely as the Lucas-Lehmer test is sequential where each step depends on the result of the previous one.

Last fiddled with by garo on 2005-10-27 at 16:25
garo is offline   Reply With Quote
Old 2005-10-27, 17:20   #3
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

250428 Posts
Default

Quote:
Originally Posted by garo
Never say never but it is highly unlikely as the Lucas-Lehmer test is sequential where each step depends on the result of the previous one.
However, the computation within each iteration can be parallelized and so be done on more than one machine at once.

Paul
xilman is offline   Reply With Quote
Old 2005-10-27, 18:41   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

Quote:
Originally Posted by xilman
However, the computation within each iteration can be parallelized and so be done on more than one machine at once.
However, this is of little practical benefit (unless e.g. a specific double-chech is urgently needed), since the required interprocessor communication invariably leads to performance degradation, e.g. 4 CPUs operating in parallel might give the same overall throughput as 3 CPUs working independently.
ewmayer is online now   Reply With Quote
Old 2005-10-27, 19:29   #5
T.Rex
 
T.Rex's Avatar
 
Feb 2004
France

16248 Posts
Default

GLucas does LLT by parallelizing the FFT work with several threads.
With specific very fast connectivity (Quadrics, as an example) available between several big machines with many CPUs, one may imagine replacing the threads by MPI, once it has been made clear which parts of the FFT memory is used by each thread on each machine. I guess the bigger the exponent to be checked is, the more efficient such an architecture is.
But moving mprime or GLucas or MLucas to MPI is very complicated and very few computing centers having clusters connected by MPI/Quadrics would want to search a new Mersenne prime. So, such an idea does not seem useful.

BUT, it may be interesting to some team to try discovering if F33 (the 33th Fermat number) is prime or not.
With only one very fast CPU (3.6 GHz), it should take about 4000 years of computation ...
But, if one could use 16 clusters of 128 CPUs, it should take only 1 to 2 years ....
Since new big machines (IBM/PPC, Sun/SPARC, Bull/Itanium2, ...) are now delivering machines with 32 or 64 very powerful processors with HT/SMT, and since there is now a race to have several cores (2 to 8, like in Cell) in one chip, it should be possible to check the primality of F33 in less than 1 year of computation, on 4 machines with 64x8 cores ; in 5 years from now, I guess.

Since LLT can be used to prove a Fermat number is prime or composite, mprime could be used.
But, is it worth to parallelize mprime and to use 4 very big machines in cluster during 1 year for only 1 bit result (F33 prime or not) ?
Fame ?

Tony

Last fiddled with by T.Rex on 2005-10-27 at 19:31
T.Rex is offline   Reply With Quote
Old 2005-10-27, 20:06   #6
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2A2216 Posts
Default

Quote:
Originally Posted by ewmayer
However, this is of little practical benefit (unless e.g. a specific double-chech is urgently needed), since the required interprocessor communication invariably leads to performance degradation, e.g. 4 CPUs operating in parallel might give the same overall throughput as 3 CPUs working independently.
You are, of course, completely correct.

However, I was discussing possibility rather than cost-effectiveness. It's possible to parallelize within an exponentiation step --- everyone who's thought about the situation agrees on that.

Most people who have thought about the situation have failed to come up with a scheme to parallelize the exponentiation itself.


Paul
xilman is offline   Reply With Quote
Old 2005-10-28, 19:44   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by xilman
Most people who have thought about the situation have failed to come up with a scheme to parallelize the exponentiation itself.l
That's the same problem we have e.g. with p-1 factoring - we can easily distribute the stage 2 powerings, but not stage 1.
ewmayer is online now   Reply With Quote
Old 2005-10-29, 19:58   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

Quote:
Originally Posted by T.Rex
But, is it worth to parallelize mprime and to use 4 very big machines in cluster during 1 year for only 1 bit result (F33 prime or not) ?
To finish F33 in one year you'd have to manage 39 squarings per second. The (complex) FFT size would be 2^28, and you'd need two FFTs per squaring. Assuming an FFT of size N requires 5*N*log2(N) FPU ops (an overestimate, but we're not counting DWT weights, pointwise squaring or carry propagation), your Pepin tester would need a sustained throughput of 2.93 teraflops.

You'd need hundreds of ordinary computers to manage that, and this isn't going to scale nearly as well as linpack. You can technically use a few dozen of Clearspeed's chips, but I doubt you'd be able to manage the off-chip memory bandwidth needed to keep them all busy.

jasonp
jasonp is offline   Reply With Quote
Old 2005-10-29, 20:02   #9
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

46438 Posts
Default

Quote:
Originally Posted by ewmayer
That's the same problem we have e.g. with p-1 factoring - we can easily distribute the stage 2 powerings, but not stage 1.
Just a thought... how about using a longer FFT length which should reduce parallelisation overhead, and doing more than one squaring in frequency space?

Alex
akruppa is offline   Reply With Quote
Old 2005-10-30, 14:57   #10
fatphil
 
fatphil's Avatar
 
May 2003

111001112 Posts
Default

Quote:
Originally Posted by akruppa
Just a thought... how about using a longer FFT length which should reduce parallelisation overhead, and doing more than one squaring in frequency space?

Alex
Longer FFT length might not work at all well, as you'd end up with more overhead than real bits of data. Using much larger limbs would be more likely to work, and that leads you towards the NTT path, IMHO.

The double-word limbs (complex floating point transform, not NTT) that I'm using heavily in PIES/LG presently unfortunately don't appear to help at all, as they're alas fractionally 'noisier' than a double-length transform would be.

Phil
fatphil is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 22:43.


Fri Aug 6 22:43:56 UTC 2021 up 14 days, 17:12, 1 user, load averages: 4.41, 4.21, 3.79

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.