20070914, 11:36  #1 
Bemusing Prompter
"Danny"
Dec 2002
California
4554_{8} Posts 
so what GIMPS work can single precision do?
As we know, single precision sucks, because LucasLehmer tests require double precision. While it's possible to emulate double precision on single precision hardware, it is terribly inefficient, since only about 10% of the performance gets converted to single precision.
However, I've heard that trial factoring can be done in single precision. Is this true? If so, it might be a good idea to add GPU support for Prime95 if the user chooses to do trial factoring. 
20070914, 16:40  #2 
Jun 2003
1169_{10} Posts 
In fact, LL test require multimillionfold precision. We don't have processors capable of such operations natively so we emulate them in software, using the operations they do provide. Single precision hardware could be programed in much the same way. There would be no need to emulate double precision instructions.
The performance would be abysmal. The same is true for trial factoring, except that multimillionfold precision isn't necessary. Double precision is enough. I've no doubt that a GPU could be programmed to do useful work for GIMPS. The question is: is it a worthwhile project for the programmers to expend their scarce time on? 
20071011, 17:03  #3  
Dec 2005
2^{2}·23 Posts 
Quote:
For example, say half of the CPUs in prime net are running LucasLehmer tests and half are trial factoring. Trial factoring eliminates many composite mersenne numbers from consideration, which results in an immense reduction of the processing time needed to handle a given segment of the mersenne numer line versus the case where there is no trial factoring. If GPUs were recruited for trial factoring and those CPUs that were trial factoring were reassigned to running LucasLehmer tests, then the amount of time needed to handle a given segment of the mersenne number line would be halved, as the trial factoring would be done for "free." I make a few assumptions in that example, such as the CPUs in prime net being evenly divided between LucasLehemer tests and trial factoring and either all of the CPUs being single core with the same specifications or the CPUs being evenly divided along lines of processing power. I also assume that the GPUs recruited for prime net will be able to trial factor numbers at a rate sufficient to produce more potential mersene prime numbers that have passed trial factoring than the CPUs recruited for prime net will be able to test, which is the most important assumption as it is my basis for claiming that trial factoring would be done for "free." Last fiddled with by ShiningArcanine on 20071011 at 17:04 

20071011, 18:17  #4  
"Lucan"
Dec 2006
England
2×3×13×83 Posts 
Quote:
and welcome correction where appropriate. In a typical LL test, each iteration needs to square a 40M bit number. In a 2048K FFT the number is first split into 2^21 "elements" each containing ~20 bits. At the midway stage of the FFT these elements are squared. These (by now) irrational quantities are repesented to sufficient accuracy by the standard 53bit "doubleprecision". If we were to try this in single precision, most the bits would be needed for the "precision" leaving few bits (if any) to be the "significant" bits in each element. If precision greater than "double" (=53 bit) were available, how much help would this be? I can see that a Law of Diminishing Returns applies here. David PS I have read that the mod(2^exponent1) step comes "free". At what point does it come in? Does it mean that we don't need 40 significant bits at any stage? Last fiddled with by davieddy on 20071011 at 19:17 

20071011, 22:57  #5  
Jun 2003
7·167 Posts 
Quote:
Quote:
Quote:
In fact, what happens is that at the midway stage of the entire process, i.e., when the FFT transform is complete, the transformed elements have to be explicitly squared. Then an inverse FFT is performed. At no point are the untransformed elements squared. The entire process is FFT, elementbyelement squaring, inverseFFT. This may be what you meant. Quote:
Quote:
Quote:
Quote:


20071011, 23:03  #6  
Tribal Bullet
Oct 2004
DD7_{16} Posts 
Quote:
It's actually more complicated than this, because we can use balanced representation, where chunks take values up to +2^(B1). In this case the squaring results need at most 2B+k2 bits, but the average value is zero so this expression can exceed 53 bits in size and still recover the bits of the product with overwhelming probability. Quote:
The mod is free in that you do not need 2^(k+1) chunks to hold the squaring result, but each chunk is still going to need extra bits to be represented exactly. 

20071012, 00:30  #7  
"Lucan"
Dec 2006
England
2·3·13·83 Posts 
Quote:
2B+k = 61 so it is just as well. All that is clear is that the exact square of a 40M bit number contains 80M bits, suggesting that the representation of each element must represent a 2B bit number + sufficient precision for its accurate derivation to the nearest integer. 

20071012, 00:40  #8  
"Lucan"
Dec 2006
England
2·3·13·83 Posts 
Quote:
BTW 40M/2^21 ~ 19 Last fiddled with by davieddy on 20071012 at 00:42 

20071012, 03:24  #9  
Tribal Bullet
Oct 2004
3×1,181 Posts 
Quote:
You are correct that if all the elements are positive, then double precision is not enough to represent the result without fatal errors. However, we use balanced representation; each input element is a positive or negative (B1)bit integer. The multiplication behaves the same way, you still have to add up 2^k of these 2*(B1) bit numbers for each element in the product, but now approximately half the numbers are positive and half are negative, so their sum is going to be zero on average. Of course it won't be precisely zero, but the exact value of the sum is most likely to be a small number near zero, with values farther away from zero being exponentially less likely. Sums that are so large in absolute value that they exceed the size of a 53bit integer are considered so unlikely that they are assumed never to happen, even after all the squarings that a single LL test performs. To answer the original question, if any of those graphics chips can perform 32bit integer multiplies really fast, as in faster than floating point multiplies, then it's possible to use number theoretic integer FFTs to perform an LL test at high speed. I think this 'if' is also unlikely enough that we can assume it will never happen. 

20071012, 09:11  #10  
"Lucan"
Dec 2006
England
194A_{16} Posts 
Quote:
question: if 53bit precision is needed to handle elements whose starting length is ~19 bits, one can envisage that single precision can't do the task at all. It is natural to speculate that 106bit precision might enable elemnts with > 38 bits to be handled. Maybe my suggested "Law of diminishing returns" has already set in at 53bit precision. 

20071012, 17:50  #11  
∂^{2}ω=0
Sep 2002
República de California
10110110001010_{2} Posts 
Quote:
But as I and others have noted, the main gain in this respect is in going from SP to DP, and there is no market case for chip makers to add dedicated highspeed 128bitfloat support  the market for it is simply too small to justify the silicon needed. And the folks who really, really do need it can just use 128bitfloat emulation, and throw more CPUs at the problem if they need more throughput. Last fiddled with by ewmayer on 20071012 at 17:58 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
does halfprecision have any use for GIMPS?  ixfd64  GPU Computing  9  20170805 22:12 
translating double to single precision?  ixfd64  Hardware  5  20120912 05:10 
Dual Core to process single work unit?  JimboPrimer  Homework Help  18  20110828 04:08 
exclude single core from quad core cpu for gimps  jippie  Information & Answers  7  20091214 22:04 
4 checkins in a single calendar month from a single computer  Gary Edstrom  Lounge  7  20030113 22:35 