mersenneforum.org Floating point precision
 Register FAQ Search Today's Posts Mark Forums Read

 2003-12-22, 09:09 #1 lunna   Mar 2003 24 Posts Floating point precision Every time a potential new hardware platform that might be suitable for doing GIMPS work (or rather the LL test), one of the first things that are being checked is the precision of its floating point hardware, and the standard criteria is that a float has too low precision (32 bits) and that the size of a double (64 bits) is needed. But my question is: is it really so? When "The math" page (http://www.mersenne.org/math.htm) talks about convolution errors it suggests using a larger FFT size when the convolution error exceeds 0.5. But the size of doubles is still the same. The benchmark page has two sections: those machines using SSE2 instructions and those without it. When using SSE2 it says it has less precision so the exponent ranges are a little different. Shouldn't it therefor be possible to use hardware with less precision (for example a float) for its floating point values, only using a larger FFT size (twice as large?) from the beginning? If so, some hardware types should be re-evaluated for its usefulness for GIMPS work: graphics cards, Clearspeed, etc.
 2003-12-22, 21:45 #2 ewmayer ∂2ω=0     Sep 2002 República de California 2×11×13×41 Posts The problem is that, due to roundoff errors and convolution output sizes, there is an efective precision threshold below which the vector lengths that would required grows far faster than linearly in the inout number size. See the following related thread: http://www.mersenneforum.org/showthr...light=equation Now, the formula in the paper I mention in that thread can equally well be applied to single-precision floats as to doubles. IEEE single-precision floats have 24 mantissa bits (including the usual hidden bit), so equation 8 in the CMP paper becomes 2*log2(W) + log2(C) + (1/2)*[log2(N) + log2(log2(N))] + (3/2)*[log2(ln(ln(N)))] < 24 . The asymptotic constant C is roughly 2 for doubles and there's no reason to assume it will be appreciably different for singles. So taking C = 2 we see that for an FFT length of 2^20, the maximum wordsize is bounded by W ~= 4.4 bits, which allows one to test M(p) only as large as ~4.5M. Compare that to a maximum wordsize of ~19 bits for doubles, which allows numbers with p up to ~20M to be tested. The other problem is that for every doubling of the input vector length N, we lose roughly an added 0.25-0.3 bits per input word. So even if we quadruple the length of our SP input vector to 2^22, we're only allowed ~3.8-3.9 bits per input now, meaning we can only test numbers with p up to roughly 16M. So even if we could make perfect use of a 4-way SP SIMD execution unit (as is available on the x86 MMX unit and the Apple PPC's AltiVec unit) and thus achieve 4x the SP floating throughput per cycle as with DP floats (which seems dubious in light of the fact that we've more than doubles our bandwidth to main memory because of the very same nonlinear increase in vector length with reduced precision) we're still worse off because we still can't even test numbers as large with that extra work. (And to make matters worse, most such SIMD units stall the main processor when executing, i.e. we can't efficiently do both DP and SP operations in parallel.) If anyone wants to play with the above formula using, say, the Unix bc utility (invoke with bc -l, to get the real-number version), here is the bc input I used for the N = 2^20 single-precision calculation. The first line, b, is the number of bits in the floating mantissa, the second is the input vector length to the FFT (these are the only required user inputs), the third is an intermediate step which computes log2(n), and the 4th line outputs the no. of bits in maximum average input wordsize: b = 24 n=2^20 x = l(n)/l(2) ((b-1) - ((x + l(x)/l(2))/2 + (1.5)*(l(l(l(n)))/l(2))))/2
 2003-12-23, 01:48 #3 GP2     Sep 2003 5×11×47 Posts Any hope for integer LL testing? For instance http://www.craig-wood.com/nick/armprime/math.html I suppose graphics cards tend to be floating-point oriented though...
 2003-12-23, 18:36 #4 lunna   Mar 2003 24 Posts When I asked the question I had especially graphics cards in mind, so I will write from that perspective here. If using an input vector length of 2^22 that alone would take 16MB of memory. My understanding of FFTs aren't exactly excellent but I assume we only need one of those in a read-only capacity, right? If targeting a 128MB card and allowing for actual memory for the display and the memory for the number to test would probably allow a 64MB FFT, in other words an input vector length of 2^24. I couldn't figure out what the limit would be for the numbers to be tested though. Is there any other algorithm that can be used? Maybe slower versions but that doesn't need as large input vectors? Slicing up the number or FFT and performing multiple passes of iterations? Or use two floats as an atomic unit in order to increase the precision? GP2: Doing integer LL testing might actually be a possibility for the WildcatVP cards from 3Dlabs. When glancing at the specs for the VP560 card I noticed there are 32-bit floating point and integer units. Unless they mean 32 bits per pixel, ie. 8 bits each for RGBA, that card can possibly do the trick. But I haven't checked how to program that card. Doing integer work on the common NVIDIA or ATI cards is a no-go, I think. From what I understand they translate all integers to the floating point realm.
 2003-12-24, 06:49 #5 Maybeso     Aug 2002 Portland, OR USA 1000100102 Posts There is another thread about Mixed floating/integer transform for faster LL testing Perhaps one VP560, or two of the other cards, could be used for this. Last fiddled with by Maybeso on 2003-12-24 at 06:49
 2003-12-26, 09:52 #6 lunna   Mar 2003 24 Posts I have found a double-double and quad-double library for the CPU. Basically it treats 2 or 4 doubles as a single unit thereby increasing the precision. http://www.cs.berkeley.edu/~yozo/ When looking at the code and paper it looks like the same could be done with single-precision float values. I will probably check if a float-float version is doable unless someone has any objections or sees any flaws. It won't happen until next year, but that's soon enough Would a quad-float version be of any value? For example by allowing a decreased FFT input vector size?
2003-12-26, 17:16   #7
ewmayer
2ω=0

Sep 2002
República de California

101101110011102 Posts

Quote:
 Originally posted by lunna I have found a double-double and quad-double library for the CPU. Basically it treats 2 or 4 doubles as a single unit thereby increasing the precision. http://www.cs.berkeley.edu/~yozo/ When looking at the code and paper it looks like the same could be done with single-precision float values. I will probably check if a float-float version is doable unless someone has any objections or sees any flaws. It won't happen until next year, but that's soon enough Would a quad-float version be of any value? For example by allowing a decreased FFT input vector size?
The problem with this kind of extended-precision math is that the overhead needed to emulate the higher-precision ops more than wipes out any gains that come from a reduced input vector length. When I first started coding LL tests about 7 years ago I played with an FFT based on the quad-precision math library available for the (then) DEC Alpha. That allowed inputs up to nearly 50 bits per word, but absolutely crawled relative to the code that used strictly doubles, because doubles are fully supported in hardware, whereas quads are emulated.

The best part of the mixed float-int transform is that it avoids precisely this emulation problem, i.e. gives the same kind of precision boost as quad floats do, but without the expense of emulating non-hardware-supported operations. The major potential bottleneck (at least as far as video-style SIMD processors are concerned) is that it needs good hardware support for integer multiply. For instance, we could contemplate doing a single-precision floating FFT in parallel with a pure-integer transform modulo a 32-bit prime, say p = M31 = 2^31 - 1, which allows an efficient modulo. The parallel integer transform would allow us to use roughly log2(p)/2 ~= 31/2 = 15.5 more bits per input than if we were doing a strictly floating-point transform. As we've already seen, single-precision math would allow only roughly 4 bits per input word at a vector length of 2^20. Adding a parallel transform mod M31 would boost this to ~19.5 bits per input, which is about as good (in fact a tad better) than we could get by doing a double-precision floating-point FFT. A 4-way SIMD unit like the AltiVec or MMX would allow most 32-bit floating and integer operations to be done very quickly, but the sticking point is that we'll need to calculate lots of full-length products of 31-bit ints, which AFAIK neither AltiVec nor MMX can do efficiently. If we could do at least one such product per cycle in the core CPU we might still be OK, but only if we can do continue to do 32-bit SIMD arithmetic in the graphics coprocessor at the same time. I'm no hardware expert, but I seem to recall that neither the PPC nor the x86 allow the core CPUs and the AltiVec/MMX coprocessors to blast away simultaneously. There is also the problem of how efficiently data can be shuffled between the two functional units - a fast 64-bit integer multiply in the core CPU is useless to us if the result (after modding) can't quickly be transferred back to the 32-bit SIMD unit, where it is needed.

I'm not saying it impossible to get this kind of scheme to run fast, just that there are a *lot* of highly nontrivial coding and hardware-architecture issues to be solved along the way.

 2003-12-28, 00:14 #9 ColdFury     Aug 2002 14016 Posts The precision problem is definitely the #1 hurdle. I'd be very curious if there's a way to emulate a double using the stream programming model of a GPU.
 2003-12-28, 02:01 #10 aaronl     Aug 2003 24×3 Posts I think there are some platforms out there with a 128 bit 'long double' type (sparc at least). I'm curious whether these would be particularly well suited for LL testing.
2003-12-28, 05:55   #11
ColdFury

Aug 2002

26×5 Posts

Quote:
 I think there are some platforms out there with a 128 bit 'long double' type (sparc at least).
AFAIK, the 128-bit float on a sparc is emulated. When you use them it traps to a software routine that emulates a 128-bit float.

 Similar Threads Thread Thread Starter Forum Replies Last Post ATH Lounge 3 2006-01-01 20:29 dsouza123 Hardware 2 2005-03-12 17:45 E_tron Math 4 2004-01-13 19:44 dsouza123 Software 3 2003-09-21 12:46 mda2376 Software 2 2003-06-12 04:45

All times are UTC. The time now is 15:47.

Wed May 18 15:47:14 UTC 2022 up 34 days, 13:48, 1 user, load averages: 2.54, 2.54, 2.33