![]() |
|
|
#34 | |
|
"Mihai Preda"
Apr 2015
25338 Posts |
Quote:
|
|
|
|
|
|
|
#35 |
|
"/X\(‘-‘)/X\"
Jan 2013
22·733 Posts |
Agreed, but I was thinking section 5.1's techniques starting on page 104 for large integer addition and subtraction may also be useful for floating point? He found performance benefits using a second kernel for carry propogation, and I wonder if there's an opportunity to do something similar here. Though you may well already be doing something similar.
|
|
|
|
|
|
#36 | |
|
"Mihai Preda"
Apr 2015
3×457 Posts |
Quote:
|
|
|
|
|
|
|
#37 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
I've been thinking some more about a practical SP FFT implementation on GPUs, and here are some problems/ideas:
1. FFT twiddles, i.e. the trigonometric constants (sin+cos) used in the FFT. These need to be computed to full precision, meaning to the same number of bits as the word-size. We can't use twiddles that have significantly less bits. So if the word-size is 2xSP, we need to compute sin/cos to 2xSP precision, if the word is 3xSP we need sin/cos to 3xSP precision and so on. Now, I don't readily know how to compute trigs to 2xSP precision on SP (i.e. using SP fma() and SP coefficients). I could compute DP trigs and truncate to 2xSP, but that requires DP fma() with DP coefficients which deafeats the point. A practical solution I've found is this: compute 1xSP trigs (this is v. fast), and extend that to the required precision using a precomputed table lookup. For example, for 2xSP words, we compute 1xSP trigs and combine that with a table-lookup of 1xSP. This simple trick saves 1xSP from the required table lookup size. Nevertheless, the main cost of this technique is the additional memory access (memory already being the bottleneck). So if we use "SP+lookup" twiddles we're increasing the memory pressure, but that's still the only solution for twiddles that I have at this point. 2. Compact memory representation Memory access is already the bottleneck, we're adding memory pressure with the twiddles lookup, and SP values are less dense than DP values. (SP density== 24/32==0.75, DP density==53/64==0.83). So I'm thinking of using a compressed memory respresentation which stores 3xSP into 1xSP+48bits (this allows to store 2x(3xSP) values into 5x32b words). The idea is that in the 3xSP values we only need the sign and the exponent once (from the leading SP), and from the follow-up SPs we only need to store the mantissa bits. I think it is possible to compress/decompress 3xSP to this representation reasonably efficiently on GPU, but I haven't validated this yet. Another thing I don't particularly like is the step of *5* 32bit-words of the compressed representation. With such a setup, one FFT-word (3xSP) would have about 70bits of precission stored on 80bits (0.875 density), a bit better density than 1xDP. What worries me is the high cost&complexity of 3xSP arithmetic, particularly the ADD (which is frequent in FFTs). |
|
|
|
|
|
#38 |
|
"Mihai Preda"
Apr 2015
101010110112 Posts |
Funnily, I'm struggling even with implementing a 3xSP ADD :).
Here is a good paper with the solution for quad-double: https://web.mit.edu/tabbott/Public/q...ld/docs/qd.pdf I'm trying to adapt Figure6 there to tri-word, and this is what I come up with: a[k], b[k] input, c[k] output c0,e0 = twoSum(a0, b0) d1,e11 = twoSum(a1, b1) c1,e12 = twoSum(d1, e0) d2,e21 = twoSum(a2, b2) f2,e22 = twoSum(d2, e11) c2,e23 = twoSum(f2, e12) c3 = e21 + e22 + e23 followed by: result(3 words) = renormalize(c0,c1,c2,c3) Does this sound reasonable? Last fiddled with by preda on 2021-02-12 at 20:44 |
|
|
|
|
|
#39 | |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Quote:
c0,e0 = twoSum(a0, b0) d1,e11 = twoSum(a1, b1) c1,e12 = twoSum(d1, e0) c2 = a2 + b2 + e11 + e12 which looks pretty good (i.e. simpler than I was expecting) |
|
|
|
|
|
|
#40 | |
|
"Mihai Preda"
Apr 2015
137110 Posts |
Unfortunately the sum() I have up to now is a beast: 54 ADDs.
Quote:
To see some corner-cases that sum() must handle, here is one example: given "x", we'd like to compute the fractional part of x using x - round(x), i.e. frac(x) = sum(x, -round(x)). Last fiddled with by preda on 2021-02-20 at 08:55 |
|
|
|
|
|
|
#41 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
Some interesting threads on the topic:
https://www.mersenneforum.org/showthread.php?t=19486 https://www.mersenneforum.org/showthread.php?t=22622 |
|
|
|
|
|
#42 |
|
"Mihai Preda"
Apr 2015
3·457 Posts |
(sorry for the below being so trivial)
At the core of a FFT there are "small multiplications", word-size or some small multiple of word-size, and I've been thinking a bit about their cost. Let's say the basic word-size is B=32bits. (I chose 32bits because it's still the basic unit on GPUs, while modern CPUs may move towards 64bits) There are two fundamental multiplications: low-mul (produces output of the same bit-size as the inputs) and wide-mul (output of double bit-size as the inputs). We can assume that the cost of a wide-mul is double the cost of a low-mul. And that doubling the bit-size quadruples these costs. Taking as cost-unit the cost of a low-mul of B=32bits, we have: Low(32)=1, Wide(32)=2, Low(64)=4, Wide(64)=8 Low(128)=16, Wide(128)=32 About the cost of a high-mul (which produces output of the same bit-size as inputs, representing the high-half of the wide-mul): computing a low-mul and a high-mul together is equivalent to a wide-mul. OTOH computing *just* the high-mul is a bit harder than computing the low-mul, because of the carries that transition from the low to high half. With the observation that when it's implemented in hardware, the high-mul is usually as fast as the low-mul. On GPUs (AMD) we only have high-mul(32) in hardware, so: High(32)=1 We can compute an approximate of high-mul by ignoring the carries coming in from the low-half; let's call it HighNC (from "No Carry"), and then it has the same cost as the Low of the same bitsize. HighNC(x)==Low(x) |
|
|
|
|
|
#43 |
|
"Mihai Preda"
Apr 2015
101010110112 Posts |
Lately I've been experimenting with some non-FP64 FFT transforms.
These are briefly some directions I've looked into, for representing the values that the FFT operates on: 1. a set of 4 SP floating point values to represent a larger-precision MP floating value. 2. a fixed-precision (integer) 128bit value 3. an emulated ("software") floating point on 128bits, with a 7-bit exponent 4. a NTT with a 64-bit modulo The expected weak points of these were: 1. (4xSP): slow addition/substraction, which are common operations in the FFT. 2. some bits are sacrificed to the fixed-precision "guard bits" in the FTT. Also seems difficult to get an efficient implementation. 3. a lot of bit-shifting involved into aligning the values for ADD/SUB. At this point, I'm confident option 1 (4xSP) is a bad idea and I would not investigate it further. Options 2 and 3 might both work to some degree. Between the two, I tend to favor option 3, which involves emulating a huge 128-bit floating point format with a tiny exponent (e.g. 7bits), thus allocating a lot of bits for the mantissa (for increased precision). The options 1-3 would still use the usual complex-numbers FFT structure (with sin/cos twiddles), just with a different kind of values instead of FP64. I went a bit deeper experimenting with option 4, which consists of a NTT modulo the prime P:= 2^64 - 2^32 + 1 This particular prime has a few qualities: a) it is very close to 2^64, thus using "maximally" the information represented on 64 bits (i.e. efficient memory use). b) has roots of both 1 and 2, of degree at least 2^26 (so is appropriate for large IBDWT NTT). c) affords simpler modular reduction due to the regular bit-structure involving 2^32. In my experiments I focused initially on a 4M FFT. It seems that at this FFT size we can achieve a bit over 25 bits/word, which is pretty good (being also the expected value). Next I'll see if I can achieve competitive speed at this particular FFT size (ideally we'd like better performance than the existing FP64 implementation). PS: I can go into more details about the implementation of the options mentioned, if anybody's interested. Last fiddled with by preda on 2021-05-25 at 00:58 |
|
|
|
|
|
#44 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2×53×71 Posts |
When I looked at this many years ago, I played with ....
3a) Float implemented as 128-bit mantissa and an implied exponent (or was it 96-bit mantissas with an implied exponent??) This can reduce the amount of bit shifting. Adds and subtracts require no shifts as all mantissa are using the same implied exponent. Every FFT level (or maybe every few FFT levels), shift all the mantissas changing the implied exponent. IIRC, on a GTX 580 (or was it a GTX 660?), my testing indicated I'd get roughly the same performance as the cuFFT library using 64-bit floats. |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| What does net neutrality mean for the future? | jasong | jasong | 1 | 2015-04-26 08:55 |
| The future of Msieve | jasonp | Msieve | 23 | 2008-10-30 02:23 |
| Future of Primes. | mfgoode | Lounge | 3 | 2006-11-18 23:43 |
| The future of NFSNET | JHansen | NFSNET Discussion | 15 | 2004-06-01 19:58 |
| 15k Future? | PrimeFun | Lounge | 21 | 2003-07-25 02:50 |