mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   so what GIMPS work can single precision do? (https://www.mersenneforum.org/showthread.php?t=9278)

davieddy 2007-10-13 08:23

Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have
advantages which militate against the development of a separate
dedicated chip?

davieddy 2007-10-13 10:20

PS I do appreciate that the inclusion of a DP
FPU with the world's computers is the "raison d'etre" for GIMPS.
PPS And in 128-bit FP I suspect we would get more than double
size in the mantissa compared with 64-bit DP.

jasonp 2007-10-13 19:20

[QUOTE=davieddy;116253]Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have
advantages which militate against the development of a separate
dedicated chip?[/QUOTE]
The biggest advantage is that if someone else designs an FPU that has 106-bit mantissa precision and gets it to run at 2.X GHz, then you don't need to :) Otherwise, programmable logic is large enough nowadays that if you really wanted that kind of precision in dedicated hardware you could design it yourself. But then you are limited to a few hundred MHz at most, and you have to design a board to hold the logic chip. The chip costs as much as 2-10 complete PCs just by itself, and so to make a better business case than just buying 2-10 PCs you have to get 2-10x the throughput of a single PC. In this you are competing against CPUs that large companies have invested billions of dollars and man-centuries of design work to build. Ernst's customers can get that kind of speedup for their own custom problems, but the LL test maps too well to general-purpose hardware for a custom solution to compete with double precision on an Intel/AMD processor

retina 2007-10-14 20:43

[QUOTE=jasonp;116172]To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen.[/QUOTE]I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.

jasonp 2007-10-15 01:10

[QUOTE=retina;116344]I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.[/QUOTE]
Correct. My preference is the last of those methods, because as long as the hardware supports integer multiplication with a double-size product then integer arithmetic can be used throughout. If you have a floating-point multiply-add and do not round between the multiply and the add, you can still implement the reduction. Finally, special moduli can simplify the reduction down to some shifts and adds. The advantage here is that even general-purpose hardware can do multiple integer additions in parallel, but usually not multiple floating-point additions, and the latter dominate any well-designed FFT code. So you're in a position where some parts of the NTT multiply are faster than a floating point equivalent and some parts are (possibly a lot) slower. My experiments on 64-bit x86 indicate that a well-designed NTT is about 15% slower than a well-designed FFT. Of course George's code is phenomenally well-designed.

davieddy 2007-10-15 07:47

[quote=jasonp;116355]
Of course George's code is phenomenally well-designed.[/quote]

Isn't this the real reason that pentiums seem so well suited
to the LLT?

sdbardwick 2007-10-15 08:05

[quote=davieddy;116371]Isn't this the real reason that pentiums seem so well suited
to the LLT?[/quote]
That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George).

davieddy 2007-10-15 09:39

[quote=sdbardwick;116373]That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George).[/quote]

I used "pentiums" in the generic sense.
For the uninitiated, what (in a nutshell preferably) does
SSE2 do?

fivemack 2007-10-15 14:18

SSE2 is the instruction-set extension that lets you request two double-precision FP operations at a time.

On Pentium4 and Opteron, the two DP operations are performed consecutively in the pipeline, so each SSE2 instruction takes up two pipeline slots. On Core2 and Barcelona, they are performed simultaneously.

davieddy 2007-10-15 18:45

My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly"
operations where from inputs a and b we seek (a+b) and (a-b).
Can this not be accomplished in hardware (preferably in fixed-point)?

jasonp 2007-10-16 03:32

[QUOTE=davieddy;116409]My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly"
operations where from inputs a and b we seek (a+b) and (a-b).
Can this not be accomplished in hardware (preferably in fixed-point)?[/QUOTE]
The main part of an FFT replaces (a,b) with ( (a+b), (a-b)*w ) or ( (a + b*w), (a - b*w) ) for complex floating point numbers a,b,w. Sometimes w has a simple form that reduces the amount of arithmetic needed for one butterfly, and judicious grouping together of butterflies leads to clusters of unrelated work that a superscalar processor executes nicely. You cannot replace the floating point values with fixed point, because the required dynamic range is too large. Of course this can all be implemented in hardware, the real question is whether your custom solution can do 5 billion of them per second like Prime95 on a Core2 can do.

Technically the FFT replaces a dense matrix multiplication (the discrete fourier transform) with a sparse matrix multiplication that happens to have a nice recursive structure. A good introduction to the FFT is in Numerical Recipes (the whole book is online at [url]www.nr.com[/url]) and also the links page at [url]www.fftw.org[/url]

Number-theoretic FFTs use integers modulo a prime p for all these computations, with all results reduced modulo p. a,b, and w in this case are integers, or possibly (depending on p) complex integers with the real and imaginary parts separately reduced mod p.

There are 'Fast Walsh Transforms' where w is always 1, but they cannot be used for arbitrary size convolutions.


All times are UTC. The time now is 23:29.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.