mersenneforum.org Is a 18M-FFT sufficient to test a 100M number?
 Register FAQ Search Today's Posts Mark Forums Read

 2010-05-18, 02:43 #1 __HRB__     Dec 2008 Boycotting the Soapbox 24·32·5 Posts Is a 18M-FFT sufficient to test a 100M number? I compute: 100.000.000/log10(2)/18M~=17.64 bits/double According to http://www.loria.fr/~gaudry/publis/issac07.pdf prime95 uses 17.76 bits for a 32M FFT, but it doesn't mention whether 80-bit temporaries (how much impact does this have?) were used or not. If an 18M FFT is sufficient, then an implementation should be at least 10% faster than a 20M FFT, because length-3 FFTs are less complicated than length-5 FFTs.
2010-05-18, 05:48   #2
Batalov

"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

100110101011012 Posts

Quote:
 Originally Posted by Prime95 At no point in time did I say I was going to implement this. I can guarantee it won't happen this decade.
...but now, a new decade started! There's a chance, man.

 2010-05-18, 06:06 #3 davieddy     "Lucan" Dec 2006 England 2·3·13·83 Posts For all we know this may be the highspot of Bolivia. Sundance PS Keep thinking Butch - that's what you're good at. Last fiddled with by davieddy on 2010-05-18 at 06:40
2010-05-18, 07:13   #4
davieddy

"Lucan"
Dec 2006
England

2×3×13×83 Posts

Quote:
 Originally Posted by davieddy For all we know this may be the highspot of Bolivia. Sundance PS Keep thinking Butch - that's what you're good at.
PPS Who are those guys?

2010-05-18, 13:58   #5
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

22·5·397 Posts

Quote:
 Originally Posted by __HRB__ According to...prime95 uses 17.76 bits for a 32M FFT, but it doesn't mention whether 80-bit temporaries (how much impact does this have?) were used or not. If an 18M FFT is sufficient, then an implementation should be at least 10% faster than a 20M FFT, because length-3 FFTs are less complicated than length-5 FFTs.
80-bit temporaries were only used in the x87 FFT code which only supported FFT lengths up to 4M. 80-bit temporaries only had a modest impact (~1%) since every write to memory rounded to a 64-bit double.

Yes, an 18M FFT is sufficient for the smallest 100M numbers.

Quote:
 but now, a new decade started! There's a chance, man
A pretty good chance.

V25.12:
Code:
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=2: 522.924 ms., 523.432 ms.
V26.1:
Code:
Time FFTlen=18432K, Arch=3, Pass1=4096, Pass2=4608, clm=4: 396.830 ms., 397.333 ms.
Time FFTlen=18432K, Arch=3, Pass1=4096, Pass2=4608, clm=2: 400.698 ms., 401.327 ms.

Time FFTlen=20480K, Arch=3, Pass1=2048, Pass2=10240, clm=4: 431.619 ms., 441.531 ms.
Time FFTlen=20480K, Arch=3, Pass1=2048, Pass2=10240, clm=2: 438.762 ms., 439.651 ms.
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=4: 420.561 ms., 421.185 ms.
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=2: 427.008 ms., 427.514 ms.

 2010-05-18, 16:50 #6 henryzz Just call me Henry     "David" Sep 2007 Liverpool (GMT/BST) 599210 Posts Hows v26 coming along? It looks like there is quite a speed increase there which would be nice to use. Does that speed increase continue down to sizes used by things like pfgw and llr?
2010-05-18, 17:20   #7
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

22×5×397 Posts

Quote:
 Originally Posted by henryzz Hows v26 coming along? It looks like there is quite a speed increase there which would be nice to use. Does that speed increase continue down to sizes used by things like pfgw and llr?
It's proceeding nicely. All two-pass FFTs have been recoded. This means a lot of QA will be necessary. The speed increase is less pronounced at smaller FFT sizes. FFT sizes above 64K will see a benefit -- including pfgw and llr.

 2010-05-21, 20:56 #8 TheJudger     "Oliver" Mar 2005 Germany 23×139 Posts Hi, nice speedup from an allready incredible fast code. I'm just curious which is your strategy for performance optimizations. - just run one test with one thread and aim for best performance? - run N tests with one thread each and aim for best performance? - run one test with N threads and aim for best performance? (N is the number of cores of your system(s)) I remember that you said something like v24/v25 are optimized for "low memory bandwidth" like P4 (and Core 2 CPUs) and v26 will benefit from higher memory bandwidth of e.g. Core i7. So will we see those speedup on real world usage (utilizing all cores), too or does memory bandwidth bottleneck this improvement somehow? Oliver Last fiddled with by TheJudger on 2010-05-21 at 20:57
2010-05-21, 23:29   #9
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

11111000001002 Posts

Quote:
 Originally Posted by TheJudger I'm just curious which is your strategy for performance optimizations. - just run one test with one thread and aim for best performance?
Yes, this one.

Quote:
 - run N tests with one thread each and aim for best performance?
This would be better, but is much harder to simulate and get consistent results. In the past, improvements to the one worker case has always improved the N workers case.

Perhaps I can release a pre-pre-beta candidate and let you folks run the benchmark with the other N-1 cores running LL tests. It might lead to several different preferred FFT implementations (e.g. the 2560K FFT has 6 different implementations with different pass 2 sizes and therefore subtly different cache and bandwidth requirements).

Quote:
 - run one test with N threads and aim for best performance?

Quote:
 I remember that you said something like v24/v25 are optimized for "low memory bandwidth" like P4 (and Core 2 CPUs) and v26 will benefit from higher memory bandwidth of e.g. Core i7. So will we see those speedup on real world usage (utilizing all cores), too or does memory bandwidth bottleneck this improvement somehow?
I'm running v26 on a dual-core P4, a Core 2, and an i7. All are faster than v25.

The basic MASM building blocks have been optimized for 32-bit and 64-bit Core 2 architecture. I really need to take a month or two and optimize them for the P4, K8, and K10 architectures.

2010-05-22, 19:42   #10
TheJudger

"Oliver"
Mar 2005
Germany

21308 Posts

Quote:
 Originally Posted by Prime95 Perhaps I can release a pre-pre-beta candidate and let you folks run the benchmark with the other N-1 cores running LL tests. It might lead to several different preferred FFT implementations (e.g. the 2560K FFT has 6 different implementations with different pass 2 sizes and therefore subtly different cache and bandwidth requirements).
When you think that this helps you in your development I think there are some people around here with access to some big and fancy toys...

Quote:
 Originally Posted by Prime95 The basic MASM building blocks have been optimized for 32-bit and 64-bit Core 2 architecture. I really need to take a month or two and optimize them for the P4, K8, and K10 architectures.
Take the time you need! Prime95 v25 isn't that bad at all.
I'm wondering how the distribution of CPU type is within the primenet. Is it worth to spent much time on P4 optimization?

Do you plan to fix/increase the limit of 32 cores in v26?

Oliver

2010-05-22, 21:24   #11
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

22×5×397 Posts

Quote:
 Originally Posted by TheJudger Do you plan to fix/increase the limit of 32 cores in v26?
It is on my to-do list. There are two areas affected: max number of worker windows and max number of threads for one LL test.

 Similar Threads Thread Thread Starter Forum Replies Last Post Alberico Lepore Alberico Lepore 22 2018-01-03 16:17 devarajkandadai Number Theory Discussion Group 7 2017-09-23 02:58 biggerben Software 7 2014-10-24 05:47 JuanTutors Lounge 6 2012-02-21 07:36 joblack LMH > 100M 1 2009-10-08 12:31

All times are UTC. The time now is 17:13.

Mon Aug 8 17:13:18 UTC 2022 up 32 days, 12 hrs, 1 user, load averages: 1.22, 1.17, 1.25