20201029, 01:36  #34  
"Mihai Preda"
Apr 2015
3×11×41 Posts 
Quote:


20201029, 04:15  #35 
"/X\(‘‘)/X\"
Jan 2013
2929_{10} Posts 
Agreed, but I was thinking section 5.1's techniques starting on page 104 for large integer addition and subtraction may also be useful for floating point? He found performance benefits using a second kernel for carry propogation, and I wonder if there's an opportunity to do something similar here. Though you may well already be doing something similar.

20201029, 05:16  #36  
"Mihai Preda"
Apr 2015
3×11×41 Posts 
Quote:


20210212, 08:42  #37 
"Mihai Preda"
Apr 2015
3×11×41 Posts 
SP plan
I've been thinking some more about a practical SP FFT implementation on GPUs, and here are some problems/ideas:
1. FFT twiddles, i.e. the trigonometric constants (sin+cos) used in the FFT. These need to be computed to full precision, meaning to the same number of bits as the wordsize. We can't use twiddles that have significantly less bits. So if the wordsize is 2xSP, we need to compute sin/cos to 2xSP precision, if the word is 3xSP we need sin/cos to 3xSP precision and so on. Now, I don't readily know how to compute trigs to 2xSP precision on SP (i.e. using SP fma() and SP coefficients). I could compute DP trigs and truncate to 2xSP, but that requires DP fma() with DP coefficients which deafeats the point. A practical solution I've found is this: compute 1xSP trigs (this is v. fast), and extend that to the required precision using a precomputed table lookup. For example, for 2xSP words, we compute 1xSP trigs and combine that with a tablelookup of 1xSP. This simple trick saves 1xSP from the required table lookup size. Nevertheless, the main cost of this technique is the additional memory access (memory already being the bottleneck). So if we use "SP+lookup" twiddles we're increasing the memory pressure, but that's still the only solution for twiddles that I have at this point. 2. Compact memory representation Memory access is already the bottleneck, we're adding memory pressure with the twiddles lookup, and SP values are less dense than DP values. (SP density== 24/32==0.75, DP density==53/64==0.83). So I'm thinking of using a compressed memory respresentation which stores 3xSP into 1xSP+48bits (this allows to store 2x(3xSP) values into 5x32b words). The idea is that in the 3xSP values we only need the sign and the exponent once (from the leading SP), and from the followup SPs we only need to store the mantissa bits. I think it is possible to compress/decompress 3xSP to this representation reasonably efficiently on GPU, but I haven't validated this yet. Another thing I don't particularly like is the step of *5* 32bitwords of the compressed representation. With such a setup, one FFTword (3xSP) would have about 70bits of precission stored on 80bits (0.875 density), a bit better density than 1xDP. What worries me is the high cost&complexity of 3xSP arithmetic, particularly the ADD (which is frequent in FFTs). 
20210212, 19:45  #38 
"Mihai Preda"
Apr 2015
3·11·41 Posts 
Funnily, I'm struggling even with implementing a 3xSP ADD :).
Here is a good paper with the solution for quaddouble: https://web.mit.edu/tabbott/Public/q...ld/docs/qd.pdf I'm trying to adapt Figure6 there to triword, and this is what I come up with: a[k], b[k] input, c[k] output c0,e0 = twoSum(a0, b0) d1,e11 = twoSum(a1, b1) c1,e12 = twoSum(d1, e0) d2,e21 = twoSum(a2, b2) f2,e22 = twoSum(d2, e11) c2,e23 = twoSum(f2, e12) c3 = e21 + e22 + e23 followed by: result(3 words) = renormalize(c0,c1,c2,c3) Does this sound reasonable? Last fiddled with by preda on 20210212 at 20:44 
20210212, 21:02  #39  
"Mihai Preda"
Apr 2015
3·11·41 Posts 
Quote:
c0,e0 = twoSum(a0, b0) d1,e11 = twoSum(a1, b1) c1,e12 = twoSum(d1, e0) c2 = a2 + b2 + e11 + e12 which looks pretty good (i.e. simpler than I was expecting) 

20210220, 08:48  #40  
"Mihai Preda"
Apr 2015
549_{16} Posts 
3xSP sum()
Unfortunately the sum() I have up to now is a beast: 54 ADDs.
Quote:
To see some cornercases that sum() must handle, here is one example: given "x", we'd like to compute the fractional part of x using x  round(x), i.e. frac(x) = sum(x, round(x)). Last fiddled with by preda on 20210220 at 08:55 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
What does net neutrality mean for the future?  jasong  jasong  1  20150426 08:55 
The future of Msieve  jasonp  Msieve  23  20081030 02:23 
Future of Primes.  mfgoode  Lounge  3  20061118 23:43 
The future of NFSNET  JHansen  NFSNET Discussion  15  20040601 19:58 
15k Future?  PrimeFun  Lounge  21  20030725 02:50 