![]() |
|
|
#12 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
Yeah, about 5-15% in my experience, especially with dual channel systems. The faster the RAM clock, the more likely dual rank will help saturate the RAM channel bandwidth, but I've seen a big improvement even with 2 core Haswell running DDR3-1333.
I remember mackerel saying there is less benefit on quad channel systems, but I think with high core count Intel chips, it is still a good idea. On Threadripper, you still want to saturate the RAM channels when needed to avoid CPU stalls, so dual rank is still a good idea. Once Prime95 workers use NUMA aware memory allocation it will matter more. Quote:
When Prime95 fetches from memory, it wants it hard and fast. |
|
|
|
|
|
|
#13 | |
|
Sep 2016
22·5·19 Posts |
Quote:
32-bit NTT is very fast - I believe faster than DD FFT. But 32/31-bit primes don't have the necessary modular roots of two. 64-bit NTT is about 2x slower - due to the lack of an efficient way to get the upper 64 bits of a 64 x 64 multiply. (If it weren't for the memory bottleneck on the 32-bit NTT, it would be more like 3-4x slower) My best implementation of this needs like 10 - 15 instructions to get that upper half. Also, on my first (and only pass so far), I wasn't able find enough suitable primes with both roots-of-unity and roots-of-two of sufficient depth. Ernst's DP + NTT: This one is tricky. If we use M61 as the moduli and go over the Gaussian space, we lose the ability to fit into double-precision multiplier hardware. So it will encounter the same problem as the 64-bit NTTs. Dropping down to 50-52-bit primes, that won't work in the real-number space since I had enough trouble finding such primes going up to 2^64. But my math is not good enough in the area to search for Gaussian primes < 2^50 that would work. We'll need Ernst's input here. Assuming we do find a suitable prime, I haven't yet worked out the computational cost of doing such an NTT. AVX512-IFMA will help here. But that's not on Skylake X. Last fiddled with by Mysticial on 2017-10-04 at 19:24 |
|
|
|
|
|
|
#14 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
@Above: Unclear to me precisely how memory-bandwidth limitations factor into this, but I get ~1.6x speedup for my AVX-512 implementation over the AVX2 one, running on the same hardware. On Intel CPUs that is roughly the performace differential between George's and my code, so the result is that my AVX-512 code is competitive with George's AVX2 code on Intel systems supporting both. (But note comparartive timings only on a small sample of CPUs to date - KNL and Google Cloud's Skylake Xeon instances.)
Interestingly, even though I did all my AVX2 developement on Intel systems (mainly my Haswell quad), on AMD's latest chips (which support AVX2 but not AVX-512), George's and my AVX2 codes run very close in terms of maximum achievable total system throughput. I will surely revisit these interesting performance issues late this year and early next, but currently my coding bandwidth is saturated with my ongoing ARM Neon 128-bit SIMD assembler port. |
|
|
|
|
|
#15 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
|
|
|
|
|
|
|
#16 | ||
|
Sep 2016
17C16 Posts |
Quote:
So the natural question is: Is your LL FFT as well as George's "memory optimal". As in, do they achieve (or are very close to) the theoretical minimum # of bytes pulled through memory. Admittedly, I have no experience in implementing real -> complex float FFTs at this level of optimization (let alone an LL iteration). So I might be spitting out garbage. But the vision that I currently have of a "memory optimal" LL is this:
I have no idea how accurate this is. Some of these hold true for complex->complex FFTs and real-integer NTTs. But I honestly have no idea how the real->complex shift affects things. (especially in terms of memory layout and the "in-placeness" - something I've never personally solved) Quote:
Though I'd expect around a 2x penalty going to complex due to the multiplies. Last fiddled with by Mysticial on 2017-10-05 at 00:15 |
||
|
|
|
|
|
#17 | |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Quote:
I estimate the real/complex wrapper step costs me ~10% overhead versus George's real-vector-optimized implementation. |
|
|
|
|
|
|
#18 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
2·3·1,693 Posts |
Quote:
|
|
|
|
|
|
|
#19 | |
|
Sep 2016
22·5·19 Posts |
Quote:
Code:
Word Size = 64 + 32 = 96 bits = 12 bytes Usable Bits / word = 53 + 31 = 84 bits For transform length 2^24: Code:
32 * 2 + 24/2 = 76 bit coefficients (8 bits of safety) So summarizing the memory efficiencies:
Computationally, the best I can come up with is:
Since word-size is half for the M31 NTT, the amount of work done is doubled. Once we combine that with the cost of the radix 4 DP FFT we get: 170 / 2 + 24 = 109 AVX512 uops. So the relative "computation/useful work" comparison is now:
-------- This is the (untested) pseudo-code that I used to arrive at 52 uops per radix 2 butterfly. It may be possible to do better. Or it may be worse if I got something wrong. Code:
// a + b*w
// a - b*w
// p = 2^31 - 1
// Input Domain: r0, i0, r1, i1 ∈ [0, 2p)
// Output Domain: R0, I0, R1, I1 ∈ [0, 2p)
// Precomputed Twiddle Factor Values
// wr, wi ∈ [0, p)
// reverse(wr)
// reverse(wi)
// wr' = floor(wr 2^32 / p)
// wi' = floor(wi 2^32 / p)
// Inputs:
a = {r0 + i0*i};
b = {r1 + i1*i};
w = {wr + wi*i};
// ---- Begin Computation ----
r0 = reduce_positive(r0);
i0 = reduce_positive(i0);
tr = mul(r1, wr, wr') - mul(i1, wi, wi')
ti = mul(r1, wi, wi') + mul(i1, wr, wr')
tr = reduce_negative(tr);
ti = reduce_positive(ti);
R0 = r0 + r1;
I0 = i0 + i1;
R1 = r0 - r1 + p;
I1 = i0 - i1 + p;
// ---- End Computation ----
// Outputs:
a = {R0 + I0*i}
b = {R1 + I1*i}
Code:
Operator "+" = vpaddd (1 uop)
Operator "-" = vpsubd (1 uop)
reduce_positive(x){
// 2 uops
__mmask16 m = _mm512_cmpge_epu32_mask(x, p);
return _mm512_mask_sub_epi32(x, m, x, p);
}
reduce_negative(x){
// 2 uops
__mmask16 m = _mm512_cmplt_epi32_mask(x, _mm512_setzero_si512());
return _mm512_mask_add_epi32(x, m, x, p);
}
reverse(x){
// 1 uop
return _mm512_shuffle_epi32(x, _MM_PERM_CDAB);
}
mul(x, w, w'){
// x ∈ [0, 2^32)
// 9 uops
q0 = vpmuludq(x, w);
q1 = vpmuludq(reverse(x), reverse(w));
q = _mm512_mask_shuffle_epi32(q1, 0x5555, q0, _MM_PERM_CDAB)
return vpmulld(x, w) - vpmulld(q, w');
}
Last fiddled with by Mysticial on 2017-10-06 at 04:40 |
|
|
|
|
|
|
#20 |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
On the GPU side, the cheap GPUs (gaming/consumer) have severely penalized double performance (relative to float). I wonder if the techniques discussed here could be used to some benefit by using a pair-of-floats instead of a double in a classical FFT setup.
|
|
|
|
|
|
#21 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
I have this question about the NTT performance:
For mersenne multiplication, would the NTT size need to be twice the size of the corresponding floating-point FFT, because the NTT can't make use of the "irrational base" trick (which is based on a pre-weighting with a floating-point weight)? If there is no equivalent trick to the "irrational base weighting" for the NTT to halve the transform size, then I think it's hard for NTT to compete. (I'm interested in NTT because it would avoid the double-precision penalty of the GPUs). |
|
|
|
|
|
#22 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
205716 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Double stars | fivemack | Astronomy | 5 | 2017-05-14 08:50 |
| x.265 half the size, double the computation; so if you double again? 1/4th? | jasong | jasong | 7 | 2015-08-17 10:56 |
| Double Check | Unregistered | Information & Answers | 3 | 2011-10-01 04:38 |
| Double the area, Double the volume. | Uncwilly | Puzzles | 8 | 2006-07-03 16:02 |
| Double Check P-1 | PhilF | PrimeNet | 6 | 2005-07-03 14:36 |