mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-10-04, 18:54   #12
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24×199 Posts
Default

Quote:
Originally Posted by kladner View Post
IIRC, the boost from Dual Rank is around 15%.
Yeah, about 5-15% in my experience, especially with dual channel systems. The faster the RAM clock, the more likely dual rank will help saturate the RAM channel bandwidth, but I've seen a big improvement even with 2 core Haswell running DDR3-1333.

I remember mackerel saying there is less benefit on quad channel systems, but I think with high core count Intel chips, it is still a good idea.

On Threadripper, you still want to saturate the RAM channels when needed to avoid CPU stalls, so dual rank is still a good idea. Once Prime95 workers use NUMA aware memory allocation it will matter more.

Quote:
Subject to some limitations, ranks can be accessed independently, although not simultaneously as the data lines are still shared between ranks on a channel. For example, the controller can send write data to one rank while it awaits read data previously selected from another rank. While the write data is consumed from the data bus, the other rank could perform read-related operations such as the activation of a row or internal transfer of the data to the output drivers. Once the CA bus is free from noise from the previous read, the DRAM can drive out the read data. Controlling interleaved accesses like so is done by the memory controller.

There is a small performance reduction for multi-rank systems as they require some pipeline stalls between accessing different ranks. For two ranks on a single DIMM it might not even be required, but this parameter is often programmed independently of the rank location in the system (if on the same DIMM or different DIMMs). Nevertheless, this pipeline stall is negligible compared to the aforementioned effects.
https://en.wikipedia.org/wiki/Memory_rank


When Prime95 fetches from memory, it wants it hard and fast.
Mark Rose is offline   Reply With Quote
Old 2017-10-04, 19:12   #13
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22·5·19 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Question: Is 3.73X computational difference the best we can do? How about Ernst's DP+INT? An NTT? I once experimented with a 128-bit fixed-point FP CUDA implementation -- we can make add fast, but muls are slower. It was awhile ago, so I don't know how well it would work in AVX-512.
I don't know actually. But running down the other possibilities doesn't leave much room.

32-bit NTT is very fast - I believe faster than DD FFT. But 32/31-bit primes don't have the necessary modular roots of two.

64-bit NTT is about 2x slower - due to the lack of an efficient way to get the upper 64 bits of a 64 x 64 multiply. (If it weren't for the memory bottleneck on the 32-bit NTT, it would be more like 3-4x slower) My best implementation of this needs like 10 - 15 instructions to get that upper half. Also, on my first (and only pass so far), I wasn't able find enough suitable primes with both roots-of-unity and roots-of-two of sufficient depth.

Ernst's DP + NTT:

This one is tricky. If we use M61 as the moduli and go over the Gaussian space, we lose the ability to fit into double-precision multiplier hardware. So it will encounter the same problem as the 64-bit NTTs.

Dropping down to 50-52-bit primes, that won't work in the real-number space since I had enough trouble finding such primes going up to 2^64. But my math is not good enough in the area to search for Gaussian primes < 2^50 that would work.

We'll need Ernst's input here.

Assuming we do find a suitable prime, I haven't yet worked out the computational cost of doing such an NTT. AVX512-IFMA will help here. But that's not on Skylake X.

Last fiddled with by Mysticial on 2017-10-04 at 19:24
Mysticial is offline   Reply With Quote
Old 2017-10-04, 22:54   #14
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

@Above: Unclear to me precisely how memory-bandwidth limitations factor into this, but I get ~1.6x speedup for my AVX-512 implementation over the AVX2 one, running on the same hardware. On Intel CPUs that is roughly the performace differential between George's and my code, so the result is that my AVX-512 code is competitive with George's AVX2 code on Intel systems supporting both. (But note comparartive timings only on a small sample of CPUs to date - KNL and Google Cloud's Skylake Xeon instances.)

Interestingly, even though I did all my AVX2 developement on Intel systems (mainly my Haswell quad), on AMD's latest chips (which support AVX2 but not AVX-512), George's and my AVX2 codes run very close in terms of maximum achievable total system throughput. I will surely revisit these interesting performance issues late this year and early next, but currently my coding bandwidth is saturated with my ongoing ARM Neon 128-bit SIMD assembler port.
ewmayer is offline   Reply With Quote
Old 2017-10-04, 23:08   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Ernst's DP + NTT:

This one is tricky. If we use M61 as the moduli and go over the Gaussian space, we lose the ability to fit into double-precision multiplier hardware. So it will encounter the same problem as the 64-bit NTTs.

Dropping down to 50-52-bit primes, that won't work in the real-number space since I had enough trouble finding such primes going up to 2^64. But my math is not good enough in the area to search for Gaussian primes < 2^50 that would work.

We'll need Ernst's input here.
My earlier-favored M61-based hybrid FFT/NTT stuff was based on the leading 64-bit architectures of the late 1990s, which all supported decently fast 64x64 -> 128-bit integer MUL, and when SSE2/AVX-style vector arithmetic was just a gleam in some CPU-roadmappers' eyes. Based on how vector units have evolved since, using M31 as the complex-arithmetic modulus makes much more sense - we have to do modular reductions more often (due to just one 'guard bit' at the high end for 31-bit data in a 32-bit slot versus 3 for 61-bit in a 64-bit slot) and lose a few small-odd-prime transform-length options, but the crucial integer-MUL bottleneck is far less severe for 32-bit inputs. So much interesting experimentation to do, so little time (sigh).
ewmayer is offline   Reply With Quote
Old 2017-10-04, 23:47   #16
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by ewmayer View Post
@Above: Unclear to me precisely how memory-bandwidth limitations factor into this, but I get ~1.6x speedup for my AVX-512 implementation over the AVX2 one, running on the same hardware. On Intel CPUs that is roughly the performace differential between George's and my code, so the result is that my AVX-512 code is competitive with George's AVX2 code on Intel systems supporting both. (But note comparartive timings only on a small sample of CPUs to date - KNL and Google Cloud's Skylake Xeon instances.)

Interestingly, even though I did all my AVX2 developement on Intel systems (mainly my Haswell quad), on AMD's latest chips (which support AVX2 but not AVX-512), George's and my AVX2 codes run very close in terms of maximum achievable total system throughput. I will surely revisit these interesting performance issues late this year and early next, but currently my coding bandwidth is saturated with my ongoing ARM Neon 128-bit SIMD assembler port.
Based on your description of being able to match George's AVX2 code with AVX512, the first thing that popped up to mind was that you're hitting the same memory bottleneck. But that would only makes sense if both implementations have the same amount of memory-access.

So the natural question is: Is your LL FFT as well as George's "memory optimal". As in, do they achieve (or are very close to) the theoretical minimum # of bytes pulled through memory.

Admittedly, I have no experience in implementing real -> complex float FFTs at this level of optimization (let alone an LL iteration). So I might be spitting out garbage. But the vision that I currently have of a "memory optimal" LL is this:
  • For an FFT size that doesn't fit in cache (but not ridiculously large), it can be done in two read/write passes over the dataset.
  • The size of the "dataset" in this case is the FFT-length x 8 bytes. (let's ignore twiddle factor tables for now)
  • Each FFT consists of two passes. But, the "end" of each forward FFT is fused with the "start" of the inverse FFT. And the "end" of the inverse FFT is fused with the "start" of the forward FFT of the next iteration. Therefore, there is only one read/write pass per iteration.

I have no idea how accurate this is. Some of these hold true for complex->complex FFTs and real-integer NTTs. But I honestly have no idea how the real->complex shift affects things. (especially in terms of memory layout and the "in-placeness" - something I've never personally solved)

Quote:
Originally Posted by ewmayer View Post
My earlier-favored M61-based hybrid FFT/NTT stuff was based on the leading 64-bit architectures of the late 1990s, which all supported decently fast 64x64 -> 128-bit integer MUL, and when SSE2/AVX-style vector arithmetic was just a gleam in some CPU-roadmappers' eyes. Based on how vector units have evolved since, using M31 as the complex-arithmetic modulus makes much more sense - we have to do modular reductions more often (due to just one 'guard bit' at the high end for 31-bit data in a 32-bit slot versus 3 for 61-bit in a 64-bit slot) and lose a few small-odd-prime transform-length options, but the crucial integer-MUL bottleneck is far less severe for 32-bit inputs. So much interesting experimentation to do, so little time (sigh).
A 31-bit prime works perfectly with Victor Shoup's NTT algorithm. I have an implementation of this (albeit real-numbers instead of Gaussian). And it's memory-bound on my 7900X with AVX512 - though the implementation isn't memory-optimal.

Though I'd expect around a 2x penalty going to complex due to the multiplies.

Last fiddled with by Mysticial on 2017-10-05 at 00:15
Mysticial is offline   Reply With Quote
Old 2017-10-05, 03:38   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Based on your description of being able to match George's AVX2 code with AVX512, the first thing that popped up to mind was that you're hitting the same memory bottleneck. But that would only makes sense if both implementations have the same amount of memory-access.

So the natural question is: Is your LL FFT as well as George's "memory optimal". As in, do they achieve (or are very close to) the theoretical minimum # of bytes pulled through memory.

Admittedly, I have no experience in implementing real -> complex float FFTs at this level of optimization (let alone an LL iteration). So I might be spitting out garbage. But the vision that I currently have of a "memory optimal" LL is this:
  • For an FFT size that doesn't fit in cache (but not ridiculously large), it can be done in two read/write passes over the dataset.
  • The size of the "dataset" in this case is the FFT-length x 8 bytes. (let's ignore twiddle factor tables for now)
  • Each FFT consists of two passes. But, the "end" of each forward FFT is fused with the "start" of the inverse FFT. And the "end" of the inverse FFT is fused with the "start" of the forward FFT of the next iteration. Therefore, there is only one read/write pass per iteration.

I have no idea how accurate this is. Some of these hold true for complex->complex FFTs and real-integer NTTs. But I honestly have no idea how the real->complex shift affects things. (especially in terms of memory layout and the "in-placeness" - something I've never personally solved).
There is a forum thread in which I describe exactly what you discuss and George replies and we conclude that aside from my real/complex wrapper step needed to my underlying complex-FFT (better for Fermat numbers) both of our codes more or less follow the above schema. Alas, the forum's search functionality has been so !%%#%*( broken for a long time now that I am unable to find the thread.

I estimate the real/complex wrapper step costs me ~10% overhead versus George's real-vector-optimized implementation.
ewmayer is offline   Reply With Quote
Old 2017-10-05, 03:59   #18
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2·3·1,693 Posts
Default

Quote:
Originally Posted by ewmayer View Post
There is a forum thread in which I describe exactly what you discuss and George replies and we conclude that aside from my real/complex wrapper step needed to my underlying complex-FFT (better for Fermat numbers) both of our codes more or less follow the above schema. Alas, the forum's search functionality has been so !%%#%*( broken for a long time now that I am unable to find the thread.

I estimate the real/complex wrapper step costs me ~10% overhead versus George's real-vector-optimized implementation.
Google>site:mersenneforum.org [search term]
kladner is offline   Reply With Quote
Old 2017-10-06, 03:46   #19
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22·5·19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
My earlier-favored M61-based hybrid FFT/NTT stuff was based on the leading 64-bit architectures of the late 1990s, which all supported decently fast 64x64 -> 128-bit integer MUL, and when SSE2/AVX-style vector arithmetic was just a gleam in some CPU-roadmappers' eyes. Based on how vector units have evolved since, using M31 as the complex-arithmetic modulus makes much more sense - we have to do modular reductions more often (due to just one 'guard bit' at the high end for 31-bit data in a 32-bit slot versus 3 for 61-bit in a 64-bit slot) and lose a few small-odd-prime transform-length options, but the crucial integer-MUL bottleneck is far less severe for 32-bit inputs. So much interesting experimentation to do, so little time (sigh).
I did some calculations for the DP + M31 method. (I've edited this multiple times to fix several calculation errors.)

Code:
Word Size = 64 + 32 = 96 bits = 12 bytes
Usable Bits / word = 53 + 31 = 84 bits
The most bits of data we can fit into each word is about 32 bits.

For transform length 2^24:
Code:
32 * 2 + 24/2 = 76 bit coefficients (8 bits of  safety)
(However, we may lose more bits in order to properly resolve the combining step.)

So summarizing the memory efficiencies:
  • DP FFT: 17/64 = 27%
  • DD FFT: 43/128 = 34%
  • DP+M31: 32/96 = 33%
So the memory usage (and therefore the upper-bound on improvement) is about the same for DD FFT and DP+M31.

Computationally, the best I can come up with is:
  • Twiddled Radix 2 Butterfly over M31: 52 integer AVX512 uops
  • Untwiddled Radix 2 Butterfly over M31: 14 integer AVX512 uops
  • Radix 4 Butterfly over M31: 3*52 + 14 = 170 integer AVX512 uops
Unlike the FP instructions, integer uops don't have identical capabilities over the two 512-bit ports (0+5). But I don't see any of the imbalances being a bottleneck. So we can assume the same 2/cycle throughput as FP.

Since word-size is half for the M31 NTT, the amount of work done is doubled. Once we combine that with the cost of the radix 4 DP FFT we get: 170 / 2 + 24 = 109 AVX512 uops.

So the relative "computation/useful work" comparison is now:
  • DP FFT: 24 / 17 = 1.41 AVX512 uops / bit
  • DD FFT: 226 / 43 = 5.26 AVX512 uops / bit
  • DP+M31: 109 / 32 = 3.41 AVX512 uops / bit
So DP+M31 looks promising as it's only computationally 2.42x slower than DP FFT.


--------

This is the (untested) pseudo-code that I used to arrive at 52 uops per radix 2 butterfly. It may be possible to do better. Or it may be worse if I got something wrong.

Code:
//  a + b*w
//  a - b*w
//  p = 2^31 - 1

//  Input  Domain: r0, i0, r1, i1 ∈ [0, 2p)
//  Output Domain: R0, I0, R1, I1 ∈ [0, 2p)

//  Precomputed Twiddle Factor Values
//  wr, wi ∈ [0, p)
//  reverse(wr)
//  reverse(wi)
//  wr' = floor(wr 2^32 / p)
//  wi' = floor(wi 2^32 / p)

//  Inputs:
a = {r0 + i0*i};
b = {r1 + i1*i};
w = {wr + wi*i};


// ---- Begin Computation ----

r0 = reduce_positive(r0);
i0 = reduce_positive(i0);

tr = mul(r1, wr, wr') - mul(i1, wi, wi')
ti = mul(r1, wi, wi') + mul(i1, wr, wr')

tr = reduce_negative(tr);
ti = reduce_positive(ti);

R0 = r0 + r1;
I0 = i0 + i1;

R1 = r0 - r1 + p;
I1 = i0 - i1 + p;

// ---- End Computation ----

//  Outputs:
a = {R0 + I0*i}
b = {R1 + I1*i}
Code:
Operator "+" = vpaddd (1 uop)
Operator "-" = vpsubd (1 uop)

reduce_positive(x){
    //  2 uops
    __mmask16 m = _mm512_cmpge_epu32_mask(x, p);
    return _mm512_mask_sub_epi32(x, m, x, p);
}
reduce_negative(x){
    //  2 uops
    __mmask16 m = _mm512_cmplt_epi32_mask(x, _mm512_setzero_si512());
    return _mm512_mask_add_epi32(x, m, x, p);
}

reverse(x){
    // 1 uop
    return _mm512_shuffle_epi32(x, _MM_PERM_CDAB);
}

mul(x, w, w'){
    //  x ∈ [0, 2^32)
    //  9 uops
    q0 = vpmuludq(x, w);
    q1 = vpmuludq(reverse(x), reverse(w));
    q = _mm512_mask_shuffle_epi32(q1, 0x5555, q0, _MM_PERM_CDAB)
    return vpmulld(x, w) - vpmulld(q, w');
}

Last fiddled with by Mysticial on 2017-10-06 at 04:40
Mysticial is offline   Reply With Quote
Old 2017-10-09, 12:27   #20
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·3·112 Posts
Default

On the GPU side, the cheap GPUs (gaming/consumer) have severely penalized double performance (relative to float). I wonder if the techniques discussed here could be used to some benefit by using a pair-of-floats instead of a double in a classical FFT setup.
preda is offline   Reply With Quote
Old 2017-10-09, 12:34   #21
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

I have this question about the NTT performance:

For mersenne multiplication, would the NTT size need to be twice the size of the corresponding floating-point FFT, because the NTT can't make use of the "irrational base" trick (which is based on a pre-weighting with a floating-point weight)?

If there is no equivalent trick to the "irrational base weighting" for the NTT to halve the transform size, then I think it's hard for NTT to compete.

(I'm interested in NTT because it would avoid the double-precision penalty of the GPUs).
preda is offline   Reply With Quote
Old 2017-10-09, 13:29   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

205716 Posts
Default

Quote:
Originally Posted by preda View Post
If there is no equivalent trick to the "irrational base weighting" for the NTT to halve the transform size, then I think it's hard for NTT to compete.
See https://www.mersenne.org/various/intfft.txt
Prime95 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Double stars fivemack Astronomy 5 2017-05-14 08:50
x.265 half the size, double the computation; so if you double again? 1/4th? jasong jasong 7 2015-08-17 10:56
Double Check Unregistered Information & Answers 3 2011-10-01 04:38
Double the area, Double the volume. Uncwilly Puzzles 8 2006-07-03 16:02
Double Check P-1 PhilF PrimeNet 6 2005-07-03 14:36

All times are UTC. The time now is 13:30.


Fri Jul 7 13:30:37 UTC 2023 up 323 days, 10:59, 0 users, load averages: 1.29, 1.28, 1.22

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔