mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2020-10-29, 01:36   #34
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×11×41 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Thank you. While the thesys contains a wide overview, it's does not go too deep into multi-precision FP (at least based on my cursory inspection).
preda is offline   Reply With Quote
Old 2020-10-29, 04:15   #35
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

292910 Posts
Default

Quote:
Originally Posted by preda View Post
Thank you. While the thesys contains a wide overview, it's does not go too deep into multi-precision FP (at least based on my cursory inspection).
Agreed, but I was thinking section 5.1's techniques starting on page 104 for large integer addition and subtraction may also be useful for floating point? He found performance benefits using a second kernel for carry propogation, and I wonder if there's an opportunity to do something similar here. Though you may well already be doing something similar.
Mark Rose is offline   Reply With Quote
Old 2020-10-29, 05:16   #36
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×11×41 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
Agreed, but I was thinking section 5.1's techniques starting on page 104 for large integer addition and subtraction may also be useful for floating point? He found performance benefits using a second kernel for carry propogation, and I wonder if there's an opportunity to do something similar here. Though you may well already be doing something similar.
Yes I think we're the state of the art in what concerns carry propagation :)
preda is offline   Reply With Quote
Old 2021-02-12, 08:42   #37
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3×11×41 Posts
Default SP plan

I've been thinking some more about a practical SP FFT implementation on GPUs, and here are some problems/ideas:

1. FFT twiddles, i.e. the trigonometric constants (sin+cos) used in the FFT.

These need to be computed to full precision, meaning to the same number of bits as the word-size. We can't use twiddles that have significantly less bits. So if the word-size is 2xSP, we need to compute sin/cos to 2xSP precision, if the word is 3xSP we need sin/cos to 3xSP precision and so on.

Now, I don't readily know how to compute trigs to 2xSP precision on SP (i.e. using SP fma() and SP coefficients). I could compute DP trigs and truncate to 2xSP, but that requires DP fma() with DP coefficients which deafeats the point.

A practical solution I've found is this:
compute 1xSP trigs (this is v. fast), and extend that to the required precision using a precomputed table lookup. For example, for 2xSP words, we compute 1xSP trigs and combine that with a table-lookup of 1xSP. This simple trick saves 1xSP from the required table lookup size.

Nevertheless, the main cost of this technique is the additional memory access (memory already being the bottleneck). So if we use "SP+lookup" twiddles we're increasing the memory pressure, but that's still the only solution for twiddles that I have at this point.


2. Compact memory representation

Memory access is already the bottleneck, we're adding memory pressure with the twiddles lookup, and SP values are less dense than DP values. (SP density== 24/32==0.75, DP density==53/64==0.83).

So I'm thinking of using a compressed memory respresentation which stores 3xSP into 1xSP+48bits (this allows to store 2x(3xSP) values into 5x32b words). The idea is that in the 3xSP values we only need the sign and the exponent once (from the leading SP), and from the follow-up SPs we only need to store the mantissa bits. I think it is possible to compress/decompress 3xSP to this representation reasonably efficiently on GPU, but I haven't validated this yet. Another thing I don't particularly like is the step of *5* 32bit-words of the compressed representation.

With such a setup, one FFT-word (3xSP) would have about 70bits of precission stored on 80bits (0.875 density), a bit better density than 1xDP.

What worries me is the high cost&complexity of 3xSP arithmetic, particularly the ADD (which is frequent in FFTs).
preda is offline   Reply With Quote
Old 2021-02-12, 19:45   #38
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·11·41 Posts
Default

Funnily, I'm struggling even with implementing a 3xSP ADD :).

Here is a good paper with the solution for quad-double:
https://web.mit.edu/tabbott/Public/q...ld/docs/qd.pdf

I'm trying to adapt Figure6 there to tri-word, and this is what I come up with:
a[k], b[k] input, c[k] output

c0,e0 = twoSum(a0, b0)

d1,e11 = twoSum(a1, b1)
c1,e12 = twoSum(d1, e0)

d2,e21 = twoSum(a2, b2)
f2,e22 = twoSum(d2, e11)
c2,e23 = twoSum(f2, e12)

c3 = e21 + e22 + e23

followed by: result(3 words) = renormalize(c0,c1,c2,c3)

Does this sound reasonable?

Last fiddled with by preda on 2021-02-12 at 20:44
preda is offline   Reply With Quote
Old 2021-02-12, 21:02   #39
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

3·11·41 Posts
Default

Quote:
Originally Posted by preda View Post
Funnily, I'm struggling even with implementing a 3xSP ADD :).

Here is a good paper with the solution for quad-double:
https://web.mit.edu/tabbott/Public/q...ld/docs/qd.pdf
Figure 10 seems to indicate:

c0,e0 = twoSum(a0, b0)

d1,e11 = twoSum(a1, b1)
c1,e12 = twoSum(d1, e0)

c2 = a2 + b2 + e11 + e12

which looks pretty good (i.e. simpler than I was expecting)
preda is offline   Reply With Quote
Old 2021-02-20, 08:48   #40
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

54916 Posts
Default 3xSP sum()

Unfortunately the sum() I have up to now is a beast: 54 ADDs.
Quote:
// 16 ADD.
float3 renormalize(float a, float b, float c, float d) {
c = quickTwoSum(c, d, &d);
b = quickTwoSum(b, c, &c);
a = quickTwoSum(a, b, &b);

c = quickTwoSum(c, d, &d);
b = quickTwoSum(b, c, &c);

return (float3) (a, b, c + d);
}

// 54 ADD.
OVERLOAD float3 sum(float3 u, float3 v) {
float a, b, c, d, e, f;
a = twoSum(u.x, v.x, &e);

b = twoSum(u.y, v.y, &f);
b = twoSum(b, e, &e);

c = twoSum(u.z, v.z, &d);
c = twoSum(c, f, &f);
c = twoSum(c, e, &e);

return renormalize(a, b, c, d + (f + e));
}
This seems a rather very expensive sum()..

To see some corner-cases that sum() must handle, here is one example: given "x", we'd like to compute the fractional part of x using x - round(x), i.e. frac(x) = sum(x, -round(x)).

Last fiddled with by preda on 2021-02-20 at 08:55
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
What does net neutrality mean for the future? jasong jasong 1 2015-04-26 08:55
The future of Msieve jasonp Msieve 23 2008-10-30 02:23
Future of Primes. mfgoode Lounge 3 2006-11-18 23:43
The future of NFSNET JHansen NFSNET Discussion 15 2004-06-01 19:58
15k Future? PrimeFun Lounge 21 2003-07-25 02:50

All times are UTC. The time now is 05:20.

Tue Apr 13 05:20:52 UTC 2021 up 5 days, 1 min, 1 user, load averages: 2.16, 2.28, 2.31

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.