![]() |
|
|
#1 |
|
Bemusing Prompter
"Danny"
Dec 2002
California
45338 Posts |
As we know, single precision sucks, because Lucas-Lehmer tests require double precision. While it's possible to emulate double precision on single precision hardware, it is terribly inefficient, since only about 10% of the performance gets converted to single precision.
However, I've heard that trial factoring can be done in single precision. Is this true? If so, it might be a good idea to add GPU support for Prime95 if the user chooses to do trial factoring. |
|
|
|
|
|
#2 |
|
Jun 2003
7×167 Posts |
In fact, LL test require multi-millionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.
The performance would be abysmal. The same is true for trial factoring, except that multi-millionfold precision isn't necessary. Double precision is enough. I've no doubt that a GPU could be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on? |
|
|
|
|
|
#3 | |
|
Dec 2005
22·23 Posts |
Quote:
For example, say half of the CPUs in prime net are running Lucas-Lehmer tests and half are trial factoring. Trial factoring eliminates many composite mersenne numbers from consideration, which results in an immense reduction of the processing time needed to handle a given segment of the mersenne numer line versus the case where there is no trial factoring. If GPUs were recruited for trial factoring and those CPUs that were trial factoring were reassigned to running Lucas-Lehmer tests, then the amount of time needed to handle a given segment of the mersenne number line would be halved, as the trial factoring would be done for "free." I make a few assumptions in that example, such as the CPUs in prime net being evenly divided between Lucas-Lehemer tests and trial factoring and either all of the CPUs being single core with the same specifications or the CPUs being evenly divided along lines of processing power. I also assume that the GPUs recruited for prime net will be able to trial factor numbers at a rate sufficient to produce more potential mersene prime numbers that have passed trial factoring than the CPUs recruited for prime net will be able to test, which is the most important assumption as it is my basis for claiming that trial factoring would be done for "free." Last fiddled with by ShiningArcanine on 2007-10-11 at 17:04 |
|
|
|
|
|
|
#4 | |
|
"Lucan"
Dec 2006
England
2·3·13·83 Posts |
Quote:
and welcome correction where appropriate. In a typical LL test, each iteration needs to square a 40M bit number. In a 2048K FFT the number is first split into 2^21 "elements" each containing ~20 bits. At the midway stage of the FFT these elements are squared. These (by now) irrational quantities are repesented to sufficient accuracy by the standard 53-bit "double-precision". If we were to try this in single precision, most the bits would be needed for the "precision" leaving few bits (if any) to be the "significant" bits in each element. If precision greater than "double" (=53 bit) were available, how much help would this be? I can see that a Law of Diminishing Returns applies here. David PS I have read that the mod(2^exponent-1) step comes "free". At what point does it come in? Does it mean that we don't need 40 significant bits at any stage? Last fiddled with by davieddy on 2007-10-11 at 19:17 |
|
|
|
|
|
|
#5 | |||||||
|
Jun 2003
7·167 Posts |
Quote:
Quote:
Quote:
In fact, what happens is that at the midway stage of the entire process, i.e., when the FFT transform is complete, the transformed elements have to be explicitly squared. Then an inverse FFT is performed. At no point are the untransformed elements squared. The entire process is FFT, element-by-element squaring, inverse-FFT. This may be what you meant. Quote:
Quote:
Quote:
Quote:
|
|||||||
|
|
|
|
|
#6 | ||
|
Tribal Bullet
Oct 2004
1101110101112 Posts |
Quote:
It's actually more complicated than this, because we can use balanced representation, where chunks take values up to +-2^(B-1). In this case the squaring results need at most 2B+k-2 bits, but the average value is zero so this expression can exceed 53 bits in size and still recover the bits of the product with overwhelming probability. Quote:
The mod is free in that you do not need 2^(k+1) chunks to hold the squaring result, but each chunk is still going to need extra bits to be represented exactly. |
||
|
|
|
|
|
#7 | |
|
"Lucan"
Dec 2006
England
11001010010102 Posts |
Quote:
2B+k = 61 so it is just as well. All that is clear is that the exact square of a 40M bit number contains 80M bits, suggesting that the representation of each element must represent a 2B bit number + sufficient precision for its accurate derivation to the nearest integer. |
|
|
|
|
|
|
#8 | |
|
"Lucan"
Dec 2006
England
2·3·13·83 Posts |
Quote:
BTW 40M/2^21 ~ 19 Last fiddled with by davieddy on 2007-10-12 at 00:42 |
|
|
|
|
|
|
#9 | |
|
Tribal Bullet
Oct 2004
3×1,181 Posts |
Quote:
You are correct that if all the elements are positive, then double precision is not enough to represent the result without fatal errors. However, we use balanced representation; each input element is a positive or negative (B-1)-bit integer. The multiplication behaves the same way, you still have to add up 2^k of these 2*(B-1) bit numbers for each element in the product, but now approximately half the numbers are positive and half are negative, so their sum is going to be zero on average. Of course it won't be precisely zero, but the exact value of the sum is most likely to be a small number near zero, with values farther away from zero being exponentially less likely. Sums that are so large in absolute value that they exceed the size of a 53-bit integer are considered so unlikely that they are assumed never to happen, even after all the squarings that a single LL test performs. To answer the original question, if any of those graphics chips can perform 32-bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen. |
|
|
|
|
|
|
#10 | |
|
"Lucan"
Dec 2006
England
194A16 Posts |
Quote:
question: if 53-bit precision is needed to handle elements whose starting length is ~19 bits, one can envisage that single precision can't do the task at all. It is natural to speculate that 106-bit precision might enable elemnts with > 38 bits to be handled. Maybe my suggested "Law of diminishing returns" has already set in at 53-bit precision. |
|
|
|
|
|
|
#11 | |
|
∂2ω=0
Sep 2002
República de California
2×32×647 Posts |
Quote:
But as I and others have noted, the main gain in this respect is in going from SP to DP, and there is no market case for chip makers to add dedicated high-speed 128-bit-float support - the market for it is simply too small to justify the silicon needed. And the folks who really, really do need it can just use 128-bit-float emulation, and throw more CPUs at the problem if they need more throughput. Last fiddled with by ewmayer on 2007-10-12 at 17:58 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| does half-precision have any use for GIMPS? | ixfd64 | GPU Computing | 9 | 2017-08-05 22:12 |
| translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
| Dual Core to process single work unit? | JimboPrimer | Homework Help | 18 | 2011-08-28 04:08 |
| exclude single core from quad core cpu for gimps | jippie | Information & Answers | 7 | 2009-12-14 22:04 |
| 4 checkins in a single calendar month from a single computer | Gary Edstrom | Lounge | 7 | 2003-01-13 22:35 |