mersenneforum.org Floating point precision
 Register FAQ Search Today's Posts Mark Forums Read

 2003-12-22, 09:09 #1 lunna   Mar 2003 1016 Posts Floating point precision Every time a potential new hardware platform that might be suitable for doing GIMPS work (or rather the LL test), one of the first things that are being checked is the precision of its floating point hardware, and the standard criteria is that a float has too low precision (32 bits) and that the size of a double (64 bits) is needed. But my question is: is it really so? When "The math" page (http://www.mersenne.org/math.htm) talks about convolution errors it suggests using a larger FFT size when the convolution error exceeds 0.5. But the size of doubles is still the same. The benchmark page has two sections: those machines using SSE2 instructions and those without it. When using SSE2 it says it has less precision so the exponent ranges are a little different. Shouldn't it therefor be possible to use hardware with less precision (for example a float) for its floating point values, only using a larger FFT size (twice as large?) from the beginning? If so, some hardware types should be re-evaluated for its usefulness for GIMPS work: graphics cards, Clearspeed, etc.
 2003-12-22, 21:45 #2 ewmayer ∂2ω=0     Sep 2002 República de California 3·7·13·43 Posts The problem is that, due to roundoff errors and convolution output sizes, there is an efective precision threshold below which the vector lengths that would required grows far faster than linearly in the inout number size. See the following related thread: http://www.mersenneforum.org/showthr...light=equation Now, the formula in the paper I mention in that thread can equally well be applied to single-precision floats as to doubles. IEEE single-precision floats have 24 mantissa bits (including the usual hidden bit), so equation 8 in the CMP paper becomes 2*log2(W) + log2(C) + (1/2)*[log2(N) + log2(log2(N))] + (3/2)*[log2(ln(ln(N)))] < 24 . The asymptotic constant C is roughly 2 for doubles and there's no reason to assume it will be appreciably different for singles. So taking C = 2 we see that for an FFT length of 2^20, the maximum wordsize is bounded by W ~= 4.4 bits, which allows one to test M(p) only as large as ~4.5M. Compare that to a maximum wordsize of ~19 bits for doubles, which allows numbers with p up to ~20M to be tested. The other problem is that for every doubling of the input vector length N, we lose roughly an added 0.25-0.3 bits per input word. So even if we quadruple the length of our SP input vector to 2^22, we're only allowed ~3.8-3.9 bits per input now, meaning we can only test numbers with p up to roughly 16M. So even if we could make perfect use of a 4-way SP SIMD execution unit (as is available on the x86 MMX unit and the Apple PPC's AltiVec unit) and thus achieve 4x the SP floating throughput per cycle as with DP floats (which seems dubious in light of the fact that we've more than doubles our bandwidth to main memory because of the very same nonlinear increase in vector length with reduced precision) we're still worse off because we still can't even test numbers as large with that extra work. (And to make matters worse, most such SIMD units stall the main processor when executing, i.e. we can't efficiently do both DP and SP operations in parallel.) If anyone wants to play with the above formula using, say, the Unix bc utility (invoke with bc -l, to get the real-number version), here is the bc input I used for the N = 2^20 single-precision calculation. The first line, b, is the number of bits in the floating mantissa, the second is the input vector length to the FFT (these are the only required user inputs), the third is an intermediate step which computes log2(n), and the 4th line outputs the no. of bits in maximum average input wordsize: b = 24 n=2^20 x = l(n)/l(2) ((b-1) - ((x + l(x)/l(2))/2 + (1.5)*(l(l(l(n)))/l(2))))/2
 2003-12-23, 01:48 #3 GP2     Sep 2003 13·199 Posts Any hope for integer LL testing? For instance http://www.craig-wood.com/nick/armprime/math.html I suppose graphics cards tend to be floating-point oriented though...
 2003-12-23, 18:36 #4 lunna   Mar 2003 100002 Posts When I asked the question I had especially graphics cards in mind, so I will write from that perspective here. If using an input vector length of 2^22 that alone would take 16MB of memory. My understanding of FFTs aren't exactly excellent but I assume we only need one of those in a read-only capacity, right? If targeting a 128MB card and allowing for actual memory for the display and the memory for the number to test would probably allow a 64MB FFT, in other words an input vector length of 2^24. I couldn't figure out what the limit would be for the numbers to be tested though. Is there any other algorithm that can be used? Maybe slower versions but that doesn't need as large input vectors? Slicing up the number or FFT and performing multiple passes of iterations? Or use two floats as an atomic unit in order to increase the precision? GP2: Doing integer LL testing might actually be a possibility for the WildcatVP cards from 3Dlabs. When glancing at the specs for the VP560 card I noticed there are 32-bit floating point and integer units. Unless they mean 32 bits per pixel, ie. 8 bits each for RGBA, that card can possibly do the trick. But I haven't checked how to program that card. Doing integer work on the common NVIDIA or ATI cards is a no-go, I think. From what I understand they translate all integers to the floating point realm.
 2003-12-24, 06:49 #5 Maybeso     Aug 2002 Portland, OR USA 4228 Posts There is another thread about Mixed floating/integer transform for faster LL testing Perhaps one VP560, or two of the other cards, could be used for this. Last fiddled with by Maybeso on 2003-12-24 at 06:49
 2003-12-26, 09:52 #6 lunna   Mar 2003 1016 Posts I have found a double-double and quad-double library for the CPU. Basically it treats 2 or 4 doubles as a single unit thereby increasing the precision. http://www.cs.berkeley.edu/~yozo/ When looking at the code and paper it looks like the same could be done with single-precision float values. I will probably check if a float-float version is doable unless someone has any objections or sees any flaws. It won't happen until next year, but that's soon enough Would a quad-float version be of any value? For example by allowing a decreased FFT input vector size?
2003-12-26, 17:16   #7
ewmayer
2ω=0

Sep 2002
República de California

1173910 Posts

Quote:
 Originally posted by lunna I have found a double-double and quad-double library for the CPU. Basically it treats 2 or 4 doubles as a single unit thereby increasing the precision. http://www.cs.berkeley.edu/~yozo/ When looking at the code and paper it looks like the same could be done with single-precision float values. I will probably check if a float-float version is doable unless someone has any objections or sees any flaws. It won't happen until next year, but that's soon enough Would a quad-float version be of any value? For example by allowing a decreased FFT input vector size?
The problem with this kind of extended-precision math is that the overhead needed to emulate the higher-precision ops more than wipes out any gains that come from a reduced input vector length. When I first started coding LL tests about 7 years ago I played with an FFT based on the quad-precision math library available for the (then) DEC Alpha. That allowed inputs up to nearly 50 bits per word, but absolutely crawled relative to the code that used strictly doubles, because doubles are fully supported in hardware, whereas quads are emulated.

The best part of the mixed float-int transform is that it avoids precisely this emulation problem, i.e. gives the same kind of precision boost as quad floats do, but without the expense of emulating non-hardware-supported operations. The major potential bottleneck (at least as far as video-style SIMD processors are concerned) is that it needs good hardware support for integer multiply. For instance, we could contemplate doing a single-precision floating FFT in parallel with a pure-integer transform modulo a 32-bit prime, say p = M31 = 2^31 - 1, which allows an efficient modulo. The parallel integer transform would allow us to use roughly log2(p)/2 ~= 31/2 = 15.5 more bits per input than if we were doing a strictly floating-point transform. As we've already seen, single-precision math would allow only roughly 4 bits per input word at a vector length of 2^20. Adding a parallel transform mod M31 would boost this to ~19.5 bits per input, which is about as good (in fact a tad better) than we could get by doing a double-precision floating-point FFT. A 4-way SIMD unit like the AltiVec or MMX would allow most 32-bit floating and integer operations to be done very quickly, but the sticking point is that we'll need to calculate lots of full-length products of 31-bit ints, which AFAIK neither AltiVec nor MMX can do efficiently. If we could do at least one such product per cycle in the core CPU we might still be OK, but only if we can do continue to do 32-bit SIMD arithmetic in the graphics coprocessor at the same time. I'm no hardware expert, but I seem to recall that neither the PPC nor the x86 allow the core CPUs and the AltiVec/MMX coprocessors to blast away simultaneously. There is also the problem of how efficiently data can be shuffled between the two functional units - a fast 64-bit integer multiply in the core CPU is useless to us if the result (after modding) can't quickly be transferred back to the 32-bit SIMD unit, where it is needed.

I'm not saying it impossible to get this kind of scheme to run fast, just that there are a *lot* of highly nontrivial coding and hardware-architecture issues to be solved along the way.

 2003-12-27, 10:37 #8 lunna   Mar 2003 1016 Posts People have different reasons for participating in GIMPS: some are in it for the mathematical value, some are stats junkies, some are in it for the chance of winning the prize money, some want to stress test their system to ensure it is behaving correctly, some hate seeing good processing power go to waste by being idle, and there are certainly more reasons than these. Whatever the reason is, there are still some unused areas in a computer system that could be put to good use and what I would like to do is to utilize a standard part in the system that isn't used for finding the next largest known prime number. Some of the limits/constraints for doing calculations on the graphics card are: * The GPU works with single-precision floats only * The program (vertex shader or pixel shader) sent to the GPU may only take up a few instruction slots, how few is card specific and is dependant on what generation it is: a DirectX 9 part has more slots than a DirectX 8 part, for example * The instruction set is limited and geared for graphics use * The graphics processor is parallel by nature, operating at up to eight pixels at a time * The graphics processor has much higher bandwith to its memory than ordinary CPU to system memory * Uploading data to the graphics card is relatively slow, reading data back to the CPU is even slower * There is a limited amount of memory on the card, with no chance of upgrading it without replacing the whole card * If a screen mode change occurs, the buffers may be destroyed * ... and so on ... Instead of demanding that someone implements a method to do calculations on the graphics card I am volunteering to do so myself. For it to be any useful I don't intend to disturb the Prime95 program too much, it is a tried and true way of working. The biggest chance of success as I see it is to make as much of the calculations on the card itself, the CPU will only be directing the card. While doing the iterations on the card the CPU will be idle (or running Prime95 more likely). If both the CPU and GPU need access to the data it would use up too much memory bandwidth. The showstopper has thus been that the GPU only has single-precision floats, but if I can work around that... Sure enough the emulated double-precision is slower than if it had been implemented in hardware, but right now the GPU isn't used at all, so even the emulated way is faster than what we have currently. The first step is to make this work. When it does, it is time to think about how/if to incorporate it into Prime95. --- As I said previously this is talking about the graphics hardware. Some of it can be applied to other hardware too.
 2003-12-28, 00:14 #9 ColdFury     Aug 2002 26×5 Posts The precision problem is definitely the #1 hurdle. I'd be very curious if there's a way to emulate a double using the stream programming model of a GPU.
 2003-12-28, 02:01 #10 aaronl     Aug 2003 24·3 Posts I think there are some platforms out there with a 128 bit 'long double' type (sparc at least). I'm curious whether these would be particularly well suited for LL testing.
2003-12-28, 05:55   #11
ColdFury

Aug 2002

26·5 Posts

Quote:
 I think there are some platforms out there with a 128 bit 'long double' type (sparc at least).
AFAIK, the 128-bit float on a sparc is emulated. When you use them it traps to a software routine that emulates a 128-bit float.

 Similar Threads Thread Thread Starter Forum Replies Last Post ATH Lounge 3 2006-01-01 20:29 dsouza123 Hardware 2 2005-03-12 17:45 E_tron Math 4 2004-01-13 19:44 dsouza123 Software 3 2003-09-21 12:46 mda2376 Software 2 2003-06-12 04:45

All times are UTC. The time now is 14:47.

Wed Aug 10 14:47:06 UTC 2022 up 34 days, 9:34, 3 users, load averages: 0.96, 1.15, 1.23