mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Closed Thread
 
Thread Tools
Old 2008-05-21, 17:23   #23
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

24×389 Posts
Default

Quote:
Originally Posted by ewmayer View Post
If those are the numbers the above code spat out for you, then, yes - though note that some of the asymptotic scalings used in the heuristic ROE model will begin to break down at very small FFT lengths. Should still be ballpark-style-good, though.

Going in the other direction it also becomes clear why SP needs to be at least 10x faster than DP on the hardware in question in order to be of interest for the purpose of current and future GIMPS work - at 1024K SP only allows ~4.5 bits per input, nearly 5x less than DP, and the ratio gets worse as the FFT lengths get larger. To test an exponent ~50M, SP needs a 14336K FFT, DP needs just 2560K - 5.6x smaller. And since the FFT work scales roughly as N*log(N), a 14336K FFT in fact needs roughly 7x as many operations as a 2560K DP FFT.

I'm not saying SP is completely out of the running, but it is likely mainly of interest for smaller FFT lengths, and only if the hardware in question can do it on the order of 10x faster than DP.
Nice figures and comparison, thanks.

So if I play devils advocate here for a while. Lets say I write a small proggy for a GPU that uses the native SP units (there a a lot of those little SP critters in the GPUs, yeah?) then I can run ~14M FFT (assuming enough memory to hold the data). Now let's say a Core2 with two cores can do 4 DP ops/tick @ 4GHz, i.e. ~ 16 GDP ops/s. So all I need is a GPU with a SP-unit-count x clock-frequency to be > 16GDP x 7(conversion for SP) (=112G). That would mean a GPU at 500MHz would need ~224 SP units to give the same throughput as a C2Duo 4GHz using DP. Okay, all very-very-ballpark figures, but I think you will get my drift here. It might be competitive with the CPU.

And that is very theoretical and maybe not doable but, I think there is a fundamental point that has been missed with this whole thread. We don't need the GPU to be as fast or faster than the standard x86 CPU. All that is needed is to get some code running on the GPU to do useful work. It doesn't have to efficiently optimised and whatnot, because it can still contribute to the overall throughput in some small way. A PC sitting there with an unused GPU seems wasted. We can surely use both CPU and GPU (clearly on different jobs) to improve the throughput of the machine as a whole. Just take some stock C-code and compile for a GPU and start contributing. Right? What did I get wrong? Did I miss something fundamentally obvious that makes this post all just silly?
retina is online now  
Old 2008-05-21, 17:58   #24
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote:
Originally Posted by retina View Post
We don't need the GPU to be as fast or faster than the standard x86 CPU. All that is needed is to get some code running on the GPU to do useful work.
Anyone is welcome to take the best generic C-based LL test programs [Mlucas or Glucas] and try a port to the GPU of their choice. [I believe in most cases that needs the purchase on an add-on hardware board and an SDK for it]. I am happy to try to assist with such code porting on a limited-time basis, but right now all my spare coding time is taken up with code related to commodity CPUs, where the payoff is guaranteed. If someone with access to the above GPU/SDK setup does some of the initial legwork and the results look promising, I will be all ears.
ewmayer is offline  
Old 2008-05-21, 21:40   #25
RMAC9.5
 
RMAC9.5's Avatar
 
Jun 2003

32·17 Posts
Default

If Retina's very-very-ballpark figures of "224 SP units * 500 MHz = 1 Core2Duo 4 GHz PC" are in the ballpark, 320 SP units * 775 MHz = 2 Core2Duo 4 GHz PCs. The PrimeNet Server home page PC totals indicate that more than 5500 PIII thru K6 PCs are still running LL or factoring code. Most of these PCs are new enough to support AGP 4X/8X graphics cards. Many/most of these PCs are probably owned/managed by Prime95 enthusiasts like ourselves. What if 80% of us invested $150 bucks per PC in a new graphics cards and added the equivalent of 8000+ Core2Duo 4 GHz PCs to our Prime95 throughput? Wouldn't that be neat!
If implementing LL using SP units is not possible because 14M FFTs are too big to fit in the graphics memory on the card and memory transfers between the card and main memory are too slow, could the SP units still be used for factoring?

Last fiddled with by RMAC9.5 on 2008-05-21 at 22:05
RMAC9.5 is offline  
Old 2008-05-21, 22:30   #26
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

144278 Posts
Default

I am tempted to acquire an HD3870 board and the relevant software, but since it costs £130 I can't justify doing it until July; rather too much of my disposable income this month has gone to the natural-gas and the ADSL companies.

I just about understand what's needed for FFTs; I don't really understand what would be required for factoring. My suspicion is that a reasonable approach is to use very basic multi-precision arithmetic with 10 or 11 bits in each float (so that the SP unit does exact multiplication; 10 lets you do the large sums-of-products associated with Karatsuba algorithms without worrying too much about overflow), and at this point the implementation is fairly ugly and the performance pretty marginal; I have never successfully implemented multi-precision division, though I have a Knuth vol.2 sitting as my monitor stand at the moment.

I really don't like using the term 'invested' for buying hardware; hardware depreciates a lot faster than cars, and nobody invests in an SUV. I really doubt that all that many people would buy even in-absolute-terms-cheap hardware for prime-finding; all we can hope is that the in-absolute-terms-cheap hardware ends up to be in absolute terms reasonably efficient.
fivemack is offline  
Old 2008-05-22, 01:41   #27
cheesehead
 
cheesehead's Avatar
 
"Richard B. Woods"
Aug 2002
Wisconsin USA

769210 Posts
Default

Quote:
Originally Posted by fivemack View Post
I just about understand what's needed for FFTs; I don't really understand what would be required for factoring.
Both P-1 and ECM factoring also require FFTs; just about all the same considerations for L-L apply to them too.

Only trial factoring does not use FFTs. Its RAM working set is relatively small.

Last fiddled with by cheesehead on 2008-05-22 at 01:42
cheesehead is offline  
Old 2008-05-22, 02:28   #28
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

24×3×157 Posts
Default

BTW, trial factoring is mostly integer operations - not floating point.

Also note that if you double trial factoring speed, you only increase total GIMPS throughput by 1-2%. To see this, doubling TF speed would allow you to TF one more bit level in the same amount of time. Yet, TF to 2^70 instead of 2^69 will only eliminate an extra 1/70 of the candidates.
Prime95 is online now  
Old 2008-05-22, 10:50   #29
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2×47×101 Posts
Default

In comparison, the 2304K FFT implementation may increase GIMPS throughput by 10% in the 39.5-44.3M exponent range LLs (and P-1s) which are being distributed and run now. (while the next range, 44.3-49.1M, would be the true domain for the 2560K FFT ...but we have to pass the swamp to get there ...and find the 45th M there).

Sorry for a non-sequitur.
Batalov is offline  
Old 2008-05-22, 10:56   #30
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

22616 Posts
Default

Quote:
Originally Posted by jasonp View Post
if modern graphics cards support DP at all then it is in emulated format, like Cell does (I'd love to be proven wrong on this).
1st generation Cell SPUs have hardware support for double precision floating-point but at a very high-price: 13 cycle latency of which 6 are stalling the issue stage (source Cell BE Handbook 1.0 10 May 2006, section 3.3.2).
ldesnogu is offline  
Old 2008-05-22, 16:47   #31
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote:
Originally Posted by Batalov View Post
In comparison, the 2304K FFT implementation may increase GIMPS throughput by 10% in the 39.5-44.3M exponent range LLs (and P-1s) which are being distributed and run now.
You mean, something like this?

Restarting M40009087 at iteration = 25170000. Res64: 5B4EFA758F01C842
M40009087: using FFT length 2304K = 2359296 8-byte floats.
this gives an average 16.958061642116970 bits per digit
Using complex FFT radices 36 32 32 32
[May 21 08:16:02] M40009087 Iter# = 25180000 clocks = 00:23:50.250 [ 0.1430 sec/iter] Res64: 5E2EB2A1DDD02A72. AvgMaxErr = 0.017174351. MaxErr = 0.023437500
[May 21 08:40:08] M40009087 Iter# = 25190000 clocks = 00:24:04.186 [ 0.1444 sec/iter] Res64: 3365F50CF184500A. AvgMaxErr = 0.017160620. MaxErr = 0.023681641
ewmayer is offline  
Old 2008-05-22, 19:21   #32
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

2·47·101 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You mean, something like this?

Restarting M40009087 at iteration = 25170000. Res64: 5B4EFA758F01C842
M40009087: using FFT length 2304K = 2359296 8-byte floats.
this gives an average 16.958061642116970 bits per digit
Using complex FFT radices 36 32 32 32
[May 21 08:16:02] M40009087 Iter# = 25180000 clocks = 00:23:50.250 [ 0.1430 sec/iter] Res64: 5E2EB2A1DDD02A72. AvgMaxErr = 0.017174351. MaxErr = 0.023437500
[May 21 08:40:08] M40009087 Iter# = 25190000 clocks = 00:24:04.186 [ 0.1444 sec/iter] Res64: 3365F50CF184500A. AvgMaxErr = 0.017160620. MaxErr = 0.023681641
If these two lines show the speed of 2304K compared to 2560K in Mlucas, then no, I didn't mean something like this. It is great that Mlucas has this FFT size, but if I read your example correctly (that is, that 0.1430 sec/iter line is before the shown restart with another FFT size) it is only marginally faster than 2560K. <...no, wait. Both timings are _after_ the restart with iteration = 25170000... what was the sec/iter with 2560K?>

I meant in the custom assembly gwnum and Prime95/mprime, where it will be run like (6*2^n) * (6*2^n). The (6*2^n) pass1 is already written, used, and its speed is not marginally faster than (8*2^n); its speed is fantastic. The pass2 routines are all power-of-2. Is there a reason that a (6*2^n) pass2 would be suddenly as slow as (8*2^n) ? I don't think so.

And I didn't mean that it has to be done; it's hard. Just mentioned it for the sake of contrasting the value of implementing TF in GPUs to that of the gwnum work (estimated 1% project speed increase, and ostensibly much more work than enhancing the gwnum assembly code that exists). That was my only intention.


P.S. Why the 10% estimate? I clock my 2560K as being 20% slower than 2048K calculations (on all of my comps). Well, the time/iteration is 25% larger, that means in throughput - 20% slower (1/1.25). Optimistically I expect 2304K to have the throughput in between...

Last fiddled with by Batalov on 2008-05-22 at 19:30 Reason: P.S.
Batalov is offline  
Old 2008-05-22, 19:41   #33
jrk
 
jrk's Avatar
 
May 2008

3×5×73 Posts
Default

Good point, Batalov.

Does anyone know how many un-cleared exponents there are below 44.3M? So we could estimate how much computing time would be saved by a 2304K FFT.

Last fiddled with by jrk on 2008-05-22 at 19:42
jrk is offline  
Closed Thread

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
New PC dedicated to Mersenne Prime Search Taiy Hardware 12 2018-01-02 15:54
The prime-crunching on dedicated hardware FAQ (II) jasonp Hardware 46 2016-07-18 16:41
How would you design a CPU/GPU for prime number crunching? emily Hardware 4 2012-02-20 18:46
DSP hardware for number crunching? ixfd64 Hardware 15 2011-08-09 01:11
Optimal Hardware for Dedicated Crunching Computer Angular Hardware 5 2004-01-16 12:37

All times are UTC. The time now is 21:24.


Sun Aug 1 21:24:41 UTC 2021 up 9 days, 15:53, 0 users, load averages: 1.37, 1.51, 1.54

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.