![]() |
|
|
#23 | |
|
Undefined
"The unspeakable one"
Jun 2006
My evil lair
622410 Posts |
Quote:
So if I play devils advocate here for a while. Lets say I write a small proggy for a GPU that uses the native SP units (there a a lot of those little SP critters in the GPUs, yeah?) then I can run ~14M FFT (assuming enough memory to hold the data). Now let's say a Core2 with two cores can do 4 DP ops/tick @ 4GHz, i.e. ~ 16 GDP ops/s. So all I need is a GPU with a SP-unit-count x clock-frequency to be > 16GDP x 7(conversion for SP) (=112G). That would mean a GPU at 500MHz would need ~224 SP units to give the same throughput as a C2Duo 4GHz using DP. Okay, all very-very-ballpark figures, but I think you will get my drift here. It might be competitive with the CPU. And that is very theoretical and maybe not doable but, I think there is a fundamental point that has been missed with this whole thread. We don't need the GPU to be as fast or faster than the standard x86 CPU. All that is needed is to get some code running on the GPU to do useful work. It doesn't have to efficiently optimised and whatnot, because it can still contribute to the overall throughput in some small way. A PC sitting there with an unused GPU seems wasted. We can surely use both CPU and GPU (clearly on different jobs) to improve the throughput of the machine as a whole. Just take some stock C-code and compile for a GPU and start contributing. Right? What did I get wrong? Did I miss something fundamentally obvious that makes this post all just silly? |
|
|
|
|
|
#24 |
|
∂2ω=0
Sep 2002
República de California
2·32·647 Posts |
Anyone is welcome to take the best generic C-based LL test programs [Mlucas or Glucas] and try a port to the GPU of their choice. [I believe in most cases that needs the purchase on an add-on hardware board and an SDK for it]. I am happy to try to assist with such code porting on a limited-time basis, but right now all my spare coding time is taken up with code related to commodity CPUs, where the payoff is guaranteed. If someone with access to the above GPU/SDK setup does some of the initial legwork and the results look promising, I will be all ears.
|
|
|
|
|
#25 |
|
Jun 2003
32·17 Posts |
If Retina's very-very-ballpark figures of "224 SP units * 500 MHz = 1 Core2Duo 4 GHz PC" are in the ballpark, 320 SP units * 775 MHz = 2 Core2Duo 4 GHz PCs. The PrimeNet Server home page PC totals indicate that more than 5500 PIII thru K6 PCs are still running LL or factoring code. Most of these PCs are new enough to support AGP 4X/8X graphics cards. Many/most of these PCs are probably owned/managed by Prime95 enthusiasts like ourselves. What if 80% of us invested $150 bucks per PC in a new graphics cards and added the equivalent of 8000+ Core2Duo 4 GHz PCs to our Prime95 throughput? Wouldn't that be neat!
If implementing LL using SP units is not possible because 14M FFTs are too big to fit in the graphics memory on the card and memory transfers between the card and main memory are too slow, could the SP units still be used for factoring? Last fiddled with by RMAC9.5 on 2008-05-21 at 22:05 |
|
|
|
|
#26 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
144278 Posts |
I am tempted to acquire an HD3870 board and the relevant software, but since it costs £130 I can't justify doing it until July; rather too much of my disposable income this month has gone to the natural-gas and the ADSL companies.
I just about understand what's needed for FFTs; I don't really understand what would be required for factoring. My suspicion is that a reasonable approach is to use very basic multi-precision arithmetic with 10 or 11 bits in each float (so that the SP unit does exact multiplication; 10 lets you do the large sums-of-products associated with Karatsuba algorithms without worrying too much about overflow), and at this point the implementation is fairly ugly and the performance pretty marginal; I have never successfully implemented multi-precision division, though I have a Knuth vol.2 sitting as my monitor stand at the moment. I really don't like using the term 'invested' for buying hardware; hardware depreciates a lot faster than cars, and nobody invests in an SUV. I really doubt that all that many people would buy even in-absolute-terms-cheap hardware for prime-finding; all we can hope is that the in-absolute-terms-cheap hardware ends up to be in absolute terms reasonably efficient. |
|
|
|
|
#27 | |
|
"Richard B. Woods"
Aug 2002
Wisconsin USA
22·3·641 Posts |
Quote:
Only trial factoring does not use FFTs. Its RAM working set is relatively small. Last fiddled with by cheesehead on 2008-05-22 at 01:42 |
|
|
|
|
|
#28 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
5×11×137 Posts |
BTW, trial factoring is mostly integer operations - not floating point.
Also note that if you double trial factoring speed, you only increase total GIMPS throughput by 1-2%. To see this, doubling TF speed would allow you to TF one more bit level in the same amount of time. Yet, TF to 2^70 instead of 2^69 will only eliminate an extra 1/70 of the candidates. |
|
|
|
|
#29 |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
2×47×101 Posts |
In comparison, the 2304K FFT implementation may increase GIMPS throughput by 10% in the 39.5-44.3M exponent range LLs (and P-1s) which are being distributed and run now. (while the next range, 44.3-49.1M, would be the true domain for the 2560K FFT ...but we have to pass the swamp to get there ...and find the 45th M there).
Sorry for a non-sequitur. |
|
|
|
|
#30 |
|
Jan 2008
France
2·52·11 Posts |
1st generation Cell SPUs have hardware support for double precision floating-point but at a very high-price: 13 cycle latency of which 6 are stalling the issue stage (source Cell BE Handbook 1.0 10 May 2006, section 3.3.2).
|
|
|
|
|
#31 | |
|
∂2ω=0
Sep 2002
República de California
265768 Posts |
Quote:
Restarting M40009087 at iteration = 25170000. Res64: 5B4EFA758F01C842 M40009087: using FFT length 2304K = 2359296 8-byte floats. this gives an average 16.958061642116970 bits per digit Using complex FFT radices 36 32 32 32 [May 21 08:16:02] M40009087 Iter# = 25180000 clocks = 00:23:50.250 [ 0.1430 sec/iter] Res64: 5E2EB2A1DDD02A72. AvgMaxErr = 0.017174351. MaxErr = 0.023437500 [May 21 08:40:08] M40009087 Iter# = 25190000 clocks = 00:24:04.186 [ 0.1444 sec/iter] Res64: 3365F50CF184500A. AvgMaxErr = 0.017160620. MaxErr = 0.023681641 |
|
|
|
|
|
#32 | |
|
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2
224268 Posts |
Quote:
I meant in the custom assembly gwnum and Prime95/mprime, where it will be run like (6*2^n) * (6*2^n). The (6*2^n) pass1 is already written, used, and its speed is not marginally faster than (8*2^n); its speed is fantastic. The pass2 routines are all power-of-2. Is there a reason that a (6*2^n) pass2 would be suddenly as slow as (8*2^n) ? I don't think so. And I didn't mean that it has to be done; it's hard. Just mentioned it for the sake of contrasting the value of implementing TF in GPUs to that of the gwnum work (estimated 1% project speed increase, and ostensibly much more work than enhancing the gwnum assembly code that exists). That was my only intention. P.S. Why the 10% estimate? I clock my 2560K as being 20% slower than 2048K calculations (on all of my comps). Well, the time/iteration is 25% larger, that means in throughput - 20% slower (1/1.25). Optimistically I expect 2304K to have the throughput in between... Last fiddled with by Batalov on 2008-05-22 at 19:30 Reason: P.S. |
|
|
|
|
|
#33 |
|
May 2008
3·5·73 Posts |
Good point, Batalov.
Does anyone know how many un-cleared exponents there are below 44.3M? So we could estimate how much computing time would be saved by a 2304K FFT. Last fiddled with by jrk on 2008-05-22 at 19:42 |
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| New PC dedicated to Mersenne Prime Search | Taiy | Hardware | 12 | 2018-01-02 15:54 |
| The prime-crunching on dedicated hardware FAQ (II) | jasonp | Hardware | 46 | 2016-07-18 16:41 |
| How would you design a CPU/GPU for prime number crunching? | emily | Hardware | 4 | 2012-02-20 18:46 |
| DSP hardware for number crunching? | ixfd64 | Hardware | 15 | 2011-08-09 01:11 |
| Optimal Hardware for Dedicated Crunching Computer | Angular | Hardware | 5 | 2004-01-16 12:37 |