mersenneforum.org  

Go Back   mersenneforum.org > New To GIMPS? Start Here! > Information & Answers

Reply
 
Thread Tools
Old 2010-05-18, 02:43   #1
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

13208 Posts
Default Is a 18M-FFT sufficient to test a 100M number?

I compute: 100.000.000/log10(2)/18M~=17.64 bits/double

According to http://www.loria.fr/~gaudry/publis/issac07.pdf prime95 uses 17.76 bits for a 32M FFT, but it doesn't mention whether 80-bit temporaries (how much impact does this have?) were used or not.

If an 18M FFT is sufficient, then an implementation should be at least 10% faster than a 20M FFT, because length-3 FFTs are less complicated than length-5 FFTs.
__HRB__ is offline   Reply With Quote
Old 2010-05-18, 05:48   #2
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

990110 Posts
Default

Quote:
Originally Posted by Prime95 View Post
At no point in time did I say I was going to implement this. I can guarantee it won't happen this decade.
...but now, a new decade started! There's a chance, man.
Batalov is offline   Reply With Quote
Old 2010-05-18, 06:06   #3
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

194A16 Posts
Default

For all we know this may be the highspot of Bolivia.

Sundance

PS Keep thinking Butch - that's what you're good at.

Last fiddled with by davieddy on 2010-05-18 at 06:40
davieddy is offline   Reply With Quote
Old 2010-05-18, 07:13   #4
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

Quote:
Originally Posted by davieddy View Post
For all we know this may be the highspot of Bolivia.

Sundance

PS Keep thinking Butch - that's what you're good at.
PPS Who are those guys?
davieddy is offline   Reply With Quote
Old 2010-05-18, 13:58   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22·5·397 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
According to...prime95 uses 17.76 bits for a 32M FFT, but it doesn't mention whether 80-bit temporaries (how much impact does this have?) were used or not.

If an 18M FFT is sufficient, then an implementation should be at least 10% faster than a 20M FFT, because length-3 FFTs are less complicated than length-5 FFTs.
80-bit temporaries were only used in the x87 FFT code which only supported FFT lengths up to 4M. 80-bit temporaries only had a modest impact (~1%) since every write to memory rounded to a 64-bit double.

Yes, an 18M FFT is sufficient for the smallest 100M numbers.

Quote:
but now, a new decade started! There's a chance, man
A pretty good chance.

V25.12:
Code:
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=2: 522.924 ms., 523.432 ms.
V26.1:
Code:
Time FFTlen=18432K, Arch=3, Pass1=4096, Pass2=4608, clm=4: 396.830 ms., 397.333 ms.
Time FFTlen=18432K, Arch=3, Pass1=4096, Pass2=4608, clm=2: 400.698 ms., 401.327 ms.

Time FFTlen=20480K, Arch=3, Pass1=2048, Pass2=10240, clm=4: 431.619 ms., 441.531 ms.
Time FFTlen=20480K, Arch=3, Pass1=2048, Pass2=10240, clm=2: 438.762 ms., 439.651 ms.
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=4: 420.561 ms., 421.185 ms.
Time FFTlen=20480K, Arch=3, Pass1=2560, Pass2=8192, clm=2: 427.008 ms., 427.514 ms.
Prime95 is offline   Reply With Quote
Old 2010-05-18, 16:50   #6
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3·1,997 Posts
Default

Hows v26 coming along? It looks like there is quite a speed increase there which would be nice to use. Does that speed increase continue down to sizes used by things like pfgw and llr?
henryzz is online now   Reply With Quote
Old 2010-05-18, 17:20   #7
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

794010 Posts
Default

Quote:
Originally Posted by henryzz View Post
Hows v26 coming along? It looks like there is quite a speed increase there which would be nice to use. Does that speed increase continue down to sizes used by things like pfgw and llr?
It's proceeding nicely. All two-pass FFTs have been recoded. This means a lot of QA will be necessary. The speed increase is less pronounced at smaller FFT sizes. FFT sizes above 64K will see a benefit -- including pfgw and llr.
Prime95 is offline   Reply With Quote
Old 2010-05-21, 20:56   #8
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

100010110002 Posts
Default

Hi,

nice speedup from an allready incredible fast code.

I'm just curious which is your strategy for performance optimizations.
- just run one test with one thread and aim for best performance?
- run N tests with one thread each and aim for best performance?
- run one test with N threads and aim for best performance?
(N is the number of cores of your system(s))

I remember that you said something like v24/v25 are optimized for "low memory bandwidth" like P4 (and Core 2 CPUs) and v26 will benefit from higher memory bandwidth of e.g. Core i7. So will we see those speedup on real world usage (utilizing all cores), too or does memory bandwidth bottleneck this improvement somehow?


Oliver

Last fiddled with by TheJudger on 2010-05-21 at 20:57
TheJudger is offline   Reply With Quote
Old 2010-05-21, 23:29   #9
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

22×5×397 Posts
Default

Quote:
Originally Posted by TheJudger View Post
I'm just curious which is your strategy for performance optimizations.
- just run one test with one thread and aim for best performance?
Yes, this one.

Quote:
- run N tests with one thread each and aim for best performance?
This would be better, but is much harder to simulate and get consistent results. In the past, improvements to the one worker case has always improved the N workers case.

Perhaps I can release a pre-pre-beta candidate and let you folks run the benchmark with the other N-1 cores running LL tests. It might lead to several different preferred FFT implementations (e.g. the 2560K FFT has 6 different implementations with different pass 2 sizes and therefore subtly different cache and bandwidth requirements).

Quote:
- run one test with N threads and aim for best performance?
I really don't care much about this case.

Quote:
I remember that you said something like v24/v25 are optimized for "low memory bandwidth" like P4 (and Core 2 CPUs) and v26 will benefit from higher memory bandwidth of e.g. Core i7. So will we see those speedup on real world usage (utilizing all cores), too or does memory bandwidth bottleneck this improvement somehow?
I'm running v26 on a dual-core P4, a Core 2, and an i7. All are faster than v25.

The basic MASM building blocks have been optimized for 32-bit and 64-bit Core 2 architecture. I really need to take a month or two and optimize them for the P4, K8, and K10 architectures.
Prime95 is offline   Reply With Quote
Old 2010-05-22, 19:42   #10
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

23·139 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Perhaps I can release a pre-pre-beta candidate and let you folks run the benchmark with the other N-1 cores running LL tests. It might lead to several different preferred FFT implementations (e.g. the 2560K FFT has 6 different implementations with different pass 2 sizes and therefore subtly different cache and bandwidth requirements).
When you think that this helps you in your development I think there are some people around here with access to some big and fancy toys...

Quote:
Originally Posted by Prime95 View Post
The basic MASM building blocks have been optimized for 32-bit and 64-bit Core 2 architecture. I really need to take a month or two and optimize them for the P4, K8, and K10 architectures.
Take the time you need! Prime95 v25 isn't that bad at all.
I'm wondering how the distribution of CPU type is within the primenet. Is it worth to spent much time on P4 optimization?

Do you plan to fix/increase the limit of 32 cores in v26?

Oliver
TheJudger is offline   Reply With Quote
Old 2010-05-22, 21:24   #11
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

174048 Posts
Default

Quote:
Originally Posted by TheJudger View Post
Do you plan to fix/increase the limit of 32 cores in v26?
It is on my to-do list. There are two areas affected: max number of worker windows and max number of threads for one LL test.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
necessary but not sufficient condition for primes Alberico Lepore Alberico Lepore 22 2018-01-03 16:17
Devaraj numbers- necessary and sufficient condition devarajkandadai Number Theory Discussion Group 7 2017-09-23 02:58
CPU time for 100M digit prime test biggerben Software 7 2014-10-24 05:47
How far along are you in your 100M digit LL test? JuanTutors Lounge 6 2012-02-21 07:36
Who is LL-ing a mersenne number > 100M digits? joblack LMH > 100M 1 2009-10-08 12:31

All times are UTC. The time now is 15:56.


Mon Aug 8 15:56:02 UTC 2022 up 32 days, 10:43, 1 user, load averages: 1.14, 1.23, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔