20031222, 09:09  #1 
Mar 2003
10_{16} Posts 
Floating point precision
Every time a potential new hardware platform that might be suitable for doing GIMPS work (or rather the LL test), one of the first things that are being checked is the precision of its floating point hardware, and the standard criteria is that a float has too low precision (32 bits) and that the size of a double (64 bits) is needed.
But my question is: is it really so? When "The math" page (http://www.mersenne.org/math.htm) talks about convolution errors it suggests using a larger FFT size when the convolution error exceeds 0.5. But the size of doubles is still the same. The benchmark page has two sections: those machines using SSE2 instructions and those without it. When using SSE2 it says it has less precision so the exponent ranges are a little different. Shouldn't it therefor be possible to use hardware with less precision (for example a float) for its floating point values, only using a larger FFT size (twice as large?) from the beginning? If so, some hardware types should be reevaluated for its usefulness for GIMPS work: graphics cards, Clearspeed, etc. 
20031222, 21:45  #2 
∂^{2}ω=0
Sep 2002
República de California
3·7·13·43 Posts 
The problem is that, due to roundoff errors and convolution output sizes, there is an efective precision threshold below which the vector lengths that would required grows far faster than linearly in the inout number size. See the following related thread:
http://www.mersenneforum.org/showthr...light=equation Now, the formula in the paper I mention in that thread can equally well be applied to singleprecision floats as to doubles. IEEE singleprecision floats have 24 mantissa bits (including the usual hidden bit), so equation 8 in the CMP paper becomes 2*log2(W) + log2(C) + (1/2)*[log2(N) + log2(log2(N))] + (3/2)*[log2(ln(ln(N)))] < 24 . The asymptotic constant C is roughly 2 for doubles and there's no reason to assume it will be appreciably different for singles. So taking C = 2 we see that for an FFT length of 2^20, the maximum wordsize is bounded by W ~= 4.4 bits, which allows one to test M(p) only as large as ~4.5M. Compare that to a maximum wordsize of ~19 bits for doubles, which allows numbers with p up to ~20M to be tested. The other problem is that for every doubling of the input vector length N, we lose roughly an added 0.250.3 bits per input word. So even if we quadruple the length of our SP input vector to 2^22, we're only allowed ~3.83.9 bits per input now, meaning we can only test numbers with p up to roughly 16M. So even if we could make perfect use of a 4way SP SIMD execution unit (as is available on the x86 MMX unit and the Apple PPC's AltiVec unit) and thus achieve 4x the SP floating throughput per cycle as with DP floats (which seems dubious in light of the fact that we've more than doubles our bandwidth to main memory because of the very same nonlinear increase in vector length with reduced precision) we're still worse off because we still can't even test numbers as large with that extra work. (And to make matters worse, most such SIMD units stall the main processor when executing, i.e. we can't efficiently do both DP and SP operations in parallel.) If anyone wants to play with the above formula using, say, the Unix bc utility (invoke with bc l, to get the realnumber version), here is the bc input I used for the N = 2^20 singleprecision calculation. The first line, b, is the number of bits in the floating mantissa, the second is the input vector length to the FFT (these are the only required user inputs), the third is an intermediate step which computes log2(n), and the 4th line outputs the no. of bits in maximum average input wordsize: b = 24 n=2^20 x = l(n)/l(2) ((b1)  ((x + l(x)/l(2))/2 + (1.5)*(l(l(l(n)))/l(2))))/2 
20031223, 01:48  #3 
Sep 2003
13·199 Posts 
Any hope for integer LL testing?
For instance http://www.craigwood.com/nick/armprime/math.html I suppose graphics cards tend to be floatingpoint oriented though... 
20031223, 18:36  #4 
Mar 2003
10000_{2} Posts 
When I asked the question I had especially graphics cards in mind, so I will write from that perspective here.
If using an input vector length of 2^22 that alone would take 16MB of memory. My understanding of FFTs aren't exactly excellent but I assume we only need one of those in a readonly capacity, right? If targeting a 128MB card and allowing for actual memory for the display and the memory for the number to test would probably allow a 64MB FFT, in other words an input vector length of 2^24. I couldn't figure out what the limit would be for the numbers to be tested though. Is there any other algorithm that can be used? Maybe slower versions but that doesn't need as large input vectors? Slicing up the number or FFT and performing multiple passes of iterations? Or use two floats as an atomic unit in order to increase the precision? GP2: Doing integer LL testing might actually be a possibility for the WildcatVP cards from 3Dlabs. When glancing at the specs for the VP560 card I noticed there are 32bit floating point and integer units. Unless they mean 32 bits per pixel, ie. 8 bits each for RGBA, that card can possibly do the trick. But I haven't checked how to program that card. Doing integer work on the common NVIDIA or ATI cards is a nogo, I think. From what I understand they translate all integers to the floating point realm. 
20031224, 06:49  #5 
Aug 2002
Portland, OR USA
422_{8} Posts 
There is another thread about Mixed floating/integer transform for faster LL testing
Perhaps one VP560, or two of the other cards, could be used for this. Last fiddled with by Maybeso on 20031224 at 06:49 
20031226, 09:52  #6 
Mar 2003
10_{16} Posts 
I have found a doubledouble and quaddouble library for the CPU. Basically it treats 2 or 4 doubles as a single unit thereby increasing the precision.
http://www.cs.berkeley.edu/~yozo/ When looking at the code and paper it looks like the same could be done with singleprecision float values. I will probably check if a floatfloat version is doable unless someone has any objections or sees any flaws. It won't happen until next year, but that's soon enough Would a quadfloat version be of any value? For example by allowing a decreased FFT input vector size? 
20031226, 17:16  #7  
∂^{2}ω=0
Sep 2002
República de California
11739_{10} Posts 
Quote:
The best part of the mixed floatint transform is that it avoids precisely this emulation problem, i.e. gives the same kind of precision boost as quad floats do, but without the expense of emulating nonhardwaresupported operations. The major potential bottleneck (at least as far as videostyle SIMD processors are concerned) is that it needs good hardware support for integer multiply. For instance, we could contemplate doing a singleprecision floating FFT in parallel with a pureinteger transform modulo a 32bit prime, say p = M31 = 2^31  1, which allows an efficient modulo. The parallel integer transform would allow us to use roughly log2(p)/2 ~= 31/2 = 15.5 more bits per input than if we were doing a strictly floatingpoint transform. As we've already seen, singleprecision math would allow only roughly 4 bits per input word at a vector length of 2^20. Adding a parallel transform mod M31 would boost this to ~19.5 bits per input, which is about as good (in fact a tad better) than we could get by doing a doubleprecision floatingpoint FFT. A 4way SIMD unit like the AltiVec or MMX would allow most 32bit floating and integer operations to be done very quickly, but the sticking point is that we'll need to calculate lots of fulllength products of 31bit ints, which AFAIK neither AltiVec nor MMX can do efficiently. If we could do at least one such product per cycle in the core CPU we might still be OK, but only if we can do continue to do 32bit SIMD arithmetic in the graphics coprocessor at the same time. I'm no hardware expert, but I seem to recall that neither the PPC nor the x86 allow the core CPUs and the AltiVec/MMX coprocessors to blast away simultaneously. There is also the problem of how efficiently data can be shuffled between the two functional units  a fast 64bit integer multiply in the core CPU is useless to us if the result (after modding) can't quickly be transferred back to the 32bit SIMD unit, where it is needed. I'm not saying it impossible to get this kind of scheme to run fast, just that there are a *lot* of highly nontrivial coding and hardwarearchitecture issues to be solved along the way. 

20031227, 10:37  #8 
Mar 2003
10_{16} Posts 
People have different reasons for participating in GIMPS: some are in it for the mathematical value, some are stats junkies, some are in it for the chance of winning the prize money, some want to stress test their system to ensure it is behaving correctly, some hate seeing good processing power go to waste by being idle, and there are certainly more reasons than these.
Whatever the reason is, there are still some unused areas in a computer system that could be put to good use and what I would like to do is to utilize a standard part in the system that isn't used for finding the next largest known prime number. Some of the limits/constraints for doing calculations on the graphics card are: * The GPU works with singleprecision floats only * The program (vertex shader or pixel shader) sent to the GPU may only take up a few instruction slots, how few is card specific and is dependant on what generation it is: a DirectX 9 part has more slots than a DirectX 8 part, for example * The instruction set is limited and geared for graphics use * The graphics processor is parallel by nature, operating at up to eight pixels at a time * The graphics processor has much higher bandwith to its memory than ordinary CPU to system memory * Uploading data to the graphics card is relatively slow, reading data back to the CPU is even slower * There is a limited amount of memory on the card, with no chance of upgrading it without replacing the whole card * If a screen mode change occurs, the buffers may be destroyed * ... and so on ... Instead of demanding that someone implements a method to do calculations on the graphics card I am volunteering to do so myself. For it to be any useful I don't intend to disturb the Prime95 program too much, it is a tried and true way of working. The biggest chance of success as I see it is to make as much of the calculations on the card itself, the CPU will only be directing the card. While doing the iterations on the card the CPU will be idle (or running Prime95 more likely). If both the CPU and GPU need access to the data it would use up too much memory bandwidth. The showstopper has thus been that the GPU only has singleprecision floats, but if I can work around that... Sure enough the emulated doubleprecision is slower than if it had been implemented in hardware, but right now the GPU isn't used at all, so even the emulated way is faster than what we have currently. The first step is to make this work. When it does, it is time to think about how/if to incorporate it into Prime95.  As I said previously this is talking about the graphics hardware. Some of it can be applied to other hardware too. 
20031228, 00:14  #9 
Aug 2002
2^{6}×5 Posts 
The precision problem is definitely the #1 hurdle. I'd be very curious if there's a way to emulate a double using the stream programming model of a GPU.

20031228, 02:01  #10 
Aug 2003
2^{4}·3 Posts 
I think there are some platforms out there with a 128 bit 'long double' type (sparc at least). I'm curious whether these would be particularly well suited for LL testing.

20031228, 05:55  #11  
Aug 2002
2^{6}·5 Posts 
Quote:


Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
floating point operations  ATH  Lounge  3  20060101 20:29 
Floating point options for Windows XP 64  dsouza123  Hardware  2  20050312 17:45 
LL tests: Integer or floating point?  E_tron  Math  4  20040113 19:44 
Integer and Floating point Trial Factoring in parallel ?  dsouza123  Software  3  20030921 12:46 
floating point exception in Version 23.4.2  mda2376  Software  2  20030612 04:45 