![]() |
so what GIMPS work can single precision do?
As we know, single precision sucks, because Lucas-Lehmer tests require double precision. While it's possible to emulate double precision on single precision hardware, it is terribly inefficient, since only about 10% of the performance gets converted to single precision.
However, I've heard that trial factoring can be done in single precision. Is this true? If so, it might be a good idea to add GPU support for Prime95 if the user chooses to do trial factoring. |
In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.
The performance would be abysmal. The same is true for trial factoring, except that multi-millionfold precision isn't necessary. Double precision is enough. I've no doubt that a GPU [i]could[/i] be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on? |
[QUOTE=Mr. P-1;114262]In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.
The performance would be abysmal. The same is true for trial factoring, except that multi-millionfold precision isn't necessary. Double precision is enough. I've no doubt that a GPU [i]could[/i] be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on?[/QUOTE] Given the cost of a Lucas-Lehmer test and the fact that CPUs currently doing trial factoring will be freed to do Lucas-Lehmer tests, I think this would be a worthwhile project. For example, say half of the CPUs in prime net are running Lucas-Lehmer tests and half are trial factoring. Trial factoring eliminates many composite mersenne numbers from consideration, which results in an immense reduction of the processing time needed to handle a given segment of the mersenne numer line versus the case where there is no trial factoring. If GPUs were recruited for trial factoring and those CPUs that were trial factoring were reassigned to running Lucas-Lehmer tests, then the amount of time needed to handle a given segment of the mersenne number line would be halved, as the trial factoring would be done for "free." I make a few assumptions in that example, such as the CPUs in prime net being evenly divided between Lucas-Lehemer tests and trial factoring and either all of the CPUs being single core with the same specifications or the CPUs being evenly divided along lines of processing power. I also assume that the GPUs recruited for prime net will be able to trial factor numbers at a rate sufficient to produce more potential mersene prime numbers that have passed trial factoring than the CPUs recruited for prime net will be able to test, which is the most important assumption as it is my basis for claiming that trial factoring would be done for "free." |
[quote=Mr. P-1;114262]In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.
The performance would be abysmal. [/quote] I would like to present my "understanding" of this topic, and welcome correction where appropriate. In a typical LL test, each iteration needs to square a 40M bit number. In a 2048K FFT the number is first split into 2^21 "elements" each containing ~20 bits. At the midway stage of the FFT these elements are squared. These (by now) irrational quantities are repesented to sufficient accuracy by the standard 53-bit "double-precision". If we were to try this in single precision, most the bits would be needed for the "precision" leaving few bits (if any) to be the "significant" bits in each element. If precision greater than "double" (=53 bit) were available, how much help would this be? I can see that a Law of Diminishing Returns applies here. David PS I have read that the mod(2^exponent-1) step comes "free". At what point does it come in? Does it mean that we don't need 40 significant bits at any stage? |
[QUOTE=davieddy;116136]I would like to present my "understanding" of this topic,
and welcome correction where appropriate. In a typical LL test, each iteration needs to square a 40M bit number.[/QUOTE] Yes, for 40M exponents. The size of the number in bits is exactly the exponent. [QUOTE]In a 2048K FFT the number is first split into 2^21 "elements" each containing ~20 bits.[/QUOTE] I don't know how many bits there are per element. [QUOTE]At the midway stage of the FFT these elements are squared.[/QUOTE] It's not clear whether you are saying that the result of performing an FFT to "the midway stage" is that the elements are squared, or that squaring is an operation that is performed on elements "at the midway stage". In fact, what happens is that at the midway stage [i]of the entire process[/i], i.e., when the FFT transform is complete, the transformed elements have to be explicitly squared. Then an inverse FFT is performed. At no point are the untransformed elements squared. The entire process is FFT, element-by-element squaring, inverse-FFT. This may be what you meant. [QUOTE]These (by now) irrational quantities are repesented to sufficient accuracy by the standard 53-bit "double-precision".[/QUOTE] The transformed quantities are complex, i.e., have real and imaginary components, not merely irrational. I don't know whether the imaginary part automagically disappears in the inverse transform, or if it has to be folded into the real part somehow. [QUOTE]If we were to try this in single precision, most the bits would be needed for the "precision" leaving few bits (if any) to be the "significant" bits in each element.[/QUOTE] I'm not sure that "most" would be. I would expect that the number of bit available would about half that of a double, meaning that the total number of FLOPS would be multiplied by 4. [QUOTE]If precision greater than "double" (=53 bit) were available, how much help would this be? I can see that a Law of Diminishing Returns applies here.[/QUOTE] I don't think that law does, at least not at the level of the software implementation. (It probably does at the hardware/microcode level.) The limiting case is where the entire number can be represented within a single huge-precision operand. Clearly this would require only a few FLOPS to transform, square, inv-transform (and of course we could just square it directly). [QUOTE]PS I have read that the mod(2^exponent-1) step comes "free". At what point does it come in? Does it mean that we don't need 40 significant bits at any stage?[/QUOTE] My understanding is that the result at the end of the inverse transform is the square of the operand (mod Mp), rather than just the square of the operand. That's how I interpret the claim that this step comes "free" . |
[QUOTE=davieddy;116136]
At the midway stage of the FFT these elements are squared. These (by now) irrational quantities are repesented to sufficient accuracy by the standard 53-bit "double-precision". If we were to try this in single precision, most the bits would be needed for the "precision" leaving few bits (if any) to be the "significant" bits in each element. [/QUOTE] This is correct, except that there is an inverse FFT performed to get the results of the squaring. If the numbers were split into 2^k chunks of B bits each, then each chunk of the product would need 2B+k bits when the squaring finishes. If 2B+k exceeds 53-bit double precision, then at least some of the results get truncated and the computation is spoiled. It's actually more complicated than this, because we can use balanced representation, where chunks take values up to +-2^(B-1). In this case the squaring results need at most 2B+k-2 bits, but the average value is zero so this expression can exceed 53 bits in size and still recover the bits of the product with overwhelming probability. [QUOTE] If precision greater than "double" (=53 bit) were available, how much help would this be? I can see that a Law of Diminishing Returns applies here. David PS I have read that the mod(2^exponent-1) step comes "free". At what point does it come in? Does it mean that we don't need 40 significant bits at any stage?[/QUOTE] To a certain extent this is idle speculation, because no one except Intel/AMD can build an extended-precision FPU that's as fast as the ordinary FPU on an Intel/AMD processor, and there is no business case to add such an expensive feature to the processors. Nonetheless, I doubt that it would help all that much. The mod is free in that you do not need 2^(k+1) chunks to hold the squaring result, but each chunk is still going to need extra bits to be represented exactly. |
[quote=jasonp;116161]This is correct, except that there is an inverse FFT performed to get the results of the squaring. If the numbers were split into 2^k chunks of B bits each, then each chunk of the product would need 2B+k bits when the squaring finishes. If 2B+k exceeds 53-bit double precision, then at least some of the results get truncated and the computation is spoiled.
[/quote] I can't see where "+k" comes into it, and since B~20 and k=21 2B+k = 61 so it is just as well. All that is clear is that the exact square of a 40M bit number contains 80M bits, suggesting that the representation of each element must represent a 2B bit number + sufficient precision for its accurate derivation to the nearest integer. |
[quote=Mr. P-1;116159]
The entire process is FFT, element-by-element squaring, inverse-FFT. This may be what you meant. [/quote] It was. BTW 40M/2^21 ~ 19 |
[QUOTE=davieddy;116166]I can't see where "+k" comes into it, and since B~20 and k=21
2B+k = 61 so it is just as well. All that is clear is that the exact square of a 40M bit number contains 80M bits, suggesting that the representation of each element must represent a 2B bit number + sufficient precision for its accurate derivation to the nearest integer.[/QUOTE] Multiplying two numbers, each with 2^k elements each B bits in size, produces a 'multiplication pyramid' where each element of the product is the sum of at most 2^k numbers, each with 2*B bits ([url="http://en.wikipedia.org/wiki/Multiplication_ALU"]see here[/url]). That sum will have 2*B+k bits on average. For the Mersenne-mod case, the high half of the product wraps around to the low half, which means every element of the product is the sum of exactly 2^k 2B-bit numbers. You are correct that if all the elements are positive, then double precision is not enough to represent the result without fatal errors. However, we use balanced representation; each input element is a positive or negative (B-1)-bit integer. The multiplication behaves the same way, you still have to add up 2^k of these 2*(B-1) bit numbers for each element in the product, but now approximately half the numbers are positive and half are negative, so their sum is going to be zero on average. Of course it won't be precisely zero, but the exact value of the sum is most likely to be a small number near zero, with values farther away from zero being exponentially less likely. Sums that are so large in absolute value that they exceed the size of a 53-bit integer are considered so unlikely that they are assumed never to happen, even after all the squarings that a single LL test performs. To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen. |
[quote=jasonp;116161]
To a certain extent this is idle speculation, because no one except Intel/AMD can build an extended-precision FPU that's as fast as the ordinary FPU on an Intel/AMD processor, and there is no business case to add such an expensive feature to the processors. Nonetheless, I doubt that it would help all that much. [/quote] I thought this speculation was highly relevant to the original question: if 53-bit precision is needed to handle elements whose starting length is ~19 bits, one can envisage that single precision can't do the task at all. It is natural to speculate that 106-bit precision might enable elemnts with > 38 bits to be handled. Maybe my suggested "Law of diminishing returns" has already set in at 53-bit precision. |
[QUOTE=davieddy;116186]if 53-bit precision is needed to handle elements whose starting length is ~19 bits, one can envisage that single precision can't do the task at all. It is natural to speculate that 106-bit precision might enable elemnts with > 38 bits to be handled. Maybe my suggested "Law of diminishing returns" has already set in at 53-bit precision.[/QUOTE]
Actually, in terms of allowable-bits-per-input word as a function of FPU mantissa precision, you in fact get *accelerating* returns [though things asymptote quite quickly to linear once you get to DP and beyond] - this is because with increasing precision, the average fraction of bits of each float which is "noise" becomes relatively smaller with respect to the fraction that is "signal", for a suitably larger input word size. [And there is also a slight secondary error-reduction effect which comes from larger-input-word --> smaller-transform-length --> less-RO-error-accumulation.] But as I and others have noted, the main gain in this respect is in going from SP to DP, and there is no market case for chip makers to add dedicated high-speed 128-bit-float support - the market for it is simply too small to justify the silicon needed. And the folks who really, really do need it can just use 128-bit-float emulation, and throw more CPUs at the problem if they need more throughput. |
Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have advantages which militate against the development of a separate dedicated chip? |
PS I do appreciate that the inclusion of a DP
FPU with the world's computers is the "raison d'etre" for GIMPS. PPS And in 128-bit FP I suspect we would get more than double size in the mantissa compared with 64-bit DP. |
[QUOTE=davieddy;116253]Thankyou. That was more or less what I was trying to say.
Does the incorporation of a FPU within the main processor have advantages which militate against the development of a separate dedicated chip?[/QUOTE] The biggest advantage is that if someone else designs an FPU that has 106-bit mantissa precision and gets it to run at 2.X GHz, then you don't need to :) Otherwise, programmable logic is large enough nowadays that if you really wanted that kind of precision in dedicated hardware you could design it yourself. But then you are limited to a few hundred MHz at most, and you have to design a board to hold the logic chip. The chip costs as much as 2-10 complete PCs just by itself, and so to make a better business case than just buying 2-10 PCs you have to get 2-10x the throughput of a single PC. In this you are competing against CPUs that large companies have invested billions of dollars and man-centuries of design work to build. Ernst's customers can get that kind of speedup for their own custom problems, but the LL test maps too well to general-purpose hardware for a custom solution to compete with double precision on an Intel/AMD processor |
[QUOTE=jasonp;116172]To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen.[/QUOTE]I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.
|
[QUOTE=retina;116344]I thought that NTT requires a reduction in the computation. This would mean either an integer division (slow), multiplication by inverse (floating point needed) or multiplication by the "magic number" inverse (integer double precision) and then right shift. All three options are not very attractive.[/QUOTE]
Correct. My preference is the last of those methods, because as long as the hardware supports integer multiplication with a double-size product then integer arithmetic can be used throughout. If you have a floating-point multiply-add and do not round between the multiply and the add, you can still implement the reduction. Finally, special moduli can simplify the reduction down to some shifts and adds. The advantage here is that even general-purpose hardware can do multiple integer additions in parallel, but usually not multiple floating-point additions, and the latter dominate any well-designed FFT code. So you're in a position where some parts of the NTT multiply are faster than a floating point equivalent and some parts are (possibly a lot) slower. My experiments on 64-bit x86 indicate that a well-designed NTT is about 15% slower than a well-designed FFT. Of course George's code is phenomenally well-designed. |
[quote=jasonp;116355]
Of course George's code is phenomenally well-designed.[/quote] Isn't this the real reason that pentiums seem so well suited to the LLT? |
[quote=davieddy;116371]Isn't this the real reason that pentiums seem so well suited
to the LLT?[/quote] That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George). |
[quote=sdbardwick;116373]That, and Intel's implementation of SIMD (SSE2) is better than AMD's at the moment. I don't know if that will change with AMD's Barcelona core and successors, but it could; AMD chips have been quicker than Intel in the past (circa P2, P3 and P4 prior to SSE2 optimization by George).[/quote]
I used "pentiums" in the generic sense. For the uninitiated, what (in a nutshell preferably) does SSE2 do? |
SSE2 is the instruction-set extension that lets you request two double-precision FP operations at a time.
On Pentium4 and Opteron, the two DP operations are performed consecutively in the pipeline, so each SSE2 instruction takes up two pipeline slots. On Core2 and Barcelona, they are performed simultaneously. |
My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly" operations where from inputs a and b we seek (a+b) and (a-b). Can this not be accomplished in hardware (preferably in fixed-point)? |
[QUOTE=davieddy;116409]My undesanding of FFT is that it replaces multiplication
operations with addition/subtractions. Specifically "butterfly" operations where from inputs a and b we seek (a+b) and (a-b). Can this not be accomplished in hardware (preferably in fixed-point)?[/QUOTE] The main part of an FFT replaces (a,b) with ( (a+b), (a-b)*w ) or ( (a + b*w), (a - b*w) ) for complex floating point numbers a,b,w. Sometimes w has a simple form that reduces the amount of arithmetic needed for one butterfly, and judicious grouping together of butterflies leads to clusters of unrelated work that a superscalar processor executes nicely. You cannot replace the floating point values with fixed point, because the required dynamic range is too large. Of course this can all be implemented in hardware, the real question is whether your custom solution can do 5 billion of them per second like Prime95 on a Core2 can do. Technically the FFT replaces a dense matrix multiplication (the discrete fourier transform) with a sparse matrix multiplication that happens to have a nice recursive structure. A good introduction to the FFT is in Numerical Recipes (the whole book is online at [url]www.nr.com[/url]) and also the links page at [url]www.fftw.org[/url] Number-theoretic FFTs use integers modulo a prime p for all these computations, with all results reduced modulo p. a,b, and w in this case are integers, or possibly (depending on p) complex integers with the real and imaginary parts separately reduced mod p. There are 'Fast Walsh Transforms' where w is always 1, but they cannot be used for arbitrary size convolutions. |
| All times are UTC. The time now is 20:32. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.