mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2007-10-13, 08:23   #12
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have
advantages which militate against the development of a separate
dedicated chip?

Last fiddled with by davieddy on 2007-10-13 at 08:27
davieddy is offline   Reply With Quote
Old 2007-10-13, 10:20   #13
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

PS I do appreciate that the inclusion of a DP
FPU with the world's computers is the "raison d'etre" for GIMPS.
PPS And in 128-bit FP I suspect we would get more than double
size in the mantissa compared with 64-bit DP.

Last fiddled with by davieddy on 2007-10-13 at 10:54
davieddy is offline   Reply With Quote
Old 2007-10-13, 19:20   #14
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by davieddy View Post
Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have
advantages which militate against the development of a separate
dedicated chip?
The biggest advantage is that if someone else designs an FPU that has 106-bit mantissa precision and gets it to run at 2.X GHz, then you don't need to :) Otherwise, programmable logic is large enough nowadays that if you really wanted that kind of precision in dedicated hardware you could design it yourself. But then you are limited to a few hundred MHz at most, and you have to design a board to hold the logic chip. The chip costs as much as 2-10 complete PCs just by itself, and so to make a better business case than just buying 2-10 PCs you have to get 2-10x the throughput of a single PC. In this you are competing against CPUs that large companies have invested billions of dollars and man-centuries of design work to build. Ernst's customers can get that kind of speedup for their own custom problems, but the LL test maps too well to general-purpose hardware for a custom solution to compete with double precision on an Intel/AMD processor
jasonp is offline   Reply With Quote
Old 2007-10-14, 20:43   #15
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

141208 Posts
Default

Quote:
Originally Posted by jasonp View Post
To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen.
I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.
retina is offline   Reply With Quote
Old 2007-10-15, 01:10   #16
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DD716 Posts
Default

Quote:
Originally Posted by retina View Post
I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.
Correct. My preference is the last of those methods, because as long as the hardware supports integer multiplication with a double-size product then integer arithmetic can be used throughout. If you have a floating-point multiply-add and do not round between the multiply and the add, you can still implement the reduction. Finally, special moduli can simplify the reduction down to some shifts and adds. The advantage here is that even general-purpose hardware can do multiple integer additions in parallel, but usually not multiple floating-point additions, and the latter dominate any well-designed FFT code. So you're in a position where some parts of the NTT multiply are faster than a floating point equivalent and some parts are (possibly a lot) slower. My experiments on 64-bit x86 indicate that a well-designed NTT is about 15% slower than a well-designed FFT. Of course George's code is phenomenally well-designed.
jasonp is offline   Reply With Quote
Old 2007-10-15, 07:47   #17
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2×3×13×83 Posts
Default

Quote:
Originally Posted by jasonp View Post
Of course George's code is phenomenally well-designed.
Isn't this the real reason that pentiums seem so well suited
to the LLT?

Last fiddled with by davieddy on 2007-10-15 at 07:48
davieddy is offline   Reply With Quote
Old 2007-10-15, 08:05   #18
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

68510 Posts
Default

Quote:
Originally Posted by davieddy View Post
Isn't this the real reason that pentiums seem so well suited
to the LLT?
That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George).
sdbardwick is offline   Reply With Quote
Old 2007-10-15, 09:39   #19
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George).
I used "pentiums" in the generic sense.
For the uninitiated, what (in a nutshell preferably) does
SSE2 do?

Last fiddled with by davieddy on 2007-10-15 at 09:43 Reason: trying to spell "uninitiated":)
davieddy is offline   Reply With Quote
Old 2007-10-15, 14:18   #20
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

3×2,141 Posts
Default

SSE2 is the instruction-set extension that lets you request two double-precision FP operations at a time.

On Pentium4 and Opteron, the two DP operations are performed consecutively in the pipeline, so each SSE2 instruction takes up two pipeline slots. On Core2 and Barcelona, they are performed simultaneously.
fivemack is offline   Reply With Quote
Old 2007-10-15, 18:45   #21
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly"
operations where from inputs a and b we seek (a+b) and (a-b).
Can this not be accomplished in hardware (preferably in fixed-point)?
davieddy is offline   Reply With Quote
Old 2007-10-16, 03:32   #22
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Quote:
Originally Posted by davieddy View Post
My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly"
operations where from inputs a and b we seek (a+b) and (a-b).
Can this not be accomplished in hardware (preferably in fixed-point)?
The main part of an FFT replaces (a,b) with ( (a+b), (a-b)*w ) or ( (a + b*w), (a - b*w) ) for complex floating point numbers a,b,w. Sometimes w has a simple form that reduces the amount of arithmetic needed for one butterfly, and judicious grouping together of butterflies leads to clusters of unrelated work that a superscalar processor executes nicely. You cannot replace the floating point values with fixed point, because the required dynamic range is too large. Of course this can all be implemented in hardware, the real question is whether your custom solution can do 5 billion of them per second like Prime95 on a Core2 can do.

Technically the FFT replaces a dense matrix multiplication (the discrete fourier transform) with a sparse matrix multiplication that happens to have a nice recursive structure. A good introduction to the FFT is in Numerical Recipes (the whole book is online at www.nr.com) and also the links page at www.fftw.org

Number-theoretic FFTs use integers modulo a prime p for all these computations, with all results reduced modulo p. a,b, and w in this case are integers, or possibly (depending on p) complex integers with the real and imaginary parts separately reduced mod p.

There are 'Fast Walsh Transforms' where w is always 1, but they cannot be used for arbitrary size convolutions.

Last fiddled with by jasonp on 2007-10-16 at 03:38
jasonp is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
Dual Core to process single work unit? JimboPrimer Homework Help 18 2011-08-28 04:08
exclude single core from quad core cpu for gimps jippie Information & Answers 7 2009-12-14 22:04
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 20:33.


Sun Aug 1 20:33:13 UTC 2021 up 9 days, 15:02, 0 users, load averages: 2.26, 2.24, 1.96

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.