20171004, 00:22  #1 
Sep 2016
2^{3}×43 Posts 
DoubleDouble Arithmetic
In light of the massive memory bottlenecks that I've observed with my 10core 7900X along with the launch of 18core chips, I felt the urge to bring back the topic of doubledouble arithmetic  even if it's much earlier than I had originally anticipated.
Going back to my post here, the motivation behind DD (doubledouble) is that DP FFTs are completely memory bound. DD has a larger "word" and therefore allows a higher memory efficiency. (I estimate about 20% lower memory/bandwidth consumption.) The tradeoff of course is that DD is much slower. But maybe not quite as much as it looks like at first glance. Doubledouble addition and multiplication can be done in as few as 7 and 4 instructions respectively if you cut all the corners (see bottom of post for specifics). But in cutting those corners, you'll need to rebalance the numbers every once in a while at the cost of 3 instructions. So a hypothetical radix4 butterfly (3 twiddle factors) would be: Code:
8 complex DD adds + 3 complex DD multiply = 16 DD add + 3 (4 DD mul + 2 DD add) = 22 DD add + 12 DD mul If we rebalance every two layers, the computational cost becomes: Code:
22 (7) + 12 (4) + 8 * 3 = 226 FPU instructions By comparison, a DP radix 4 butterfly can be done in about 24 FMAs. (not sure what's actually achieved now with P95 and mlucus) However, a DP FFT does ~17 bits/word of work while a DD FFT does ~43 bits/word. So the relative "computation/useful work" is:
So we start off a 3.73x computational difference between DP and DD right now. The memory difference is 20% in the other direction. So in the world where computation is free and memory access is the only factor, then DD will be 20% faster. As far as I understand right now, P95 (and LL in general) is completely memorybound with just AVX2. Now we have AVX512. So that 3.73x margin cuts in half to about 2x. Now the question is: How memory bound is LL right now with AVX2? Would dropping the memory footprint by 20% at the cost of 2x the computation lead to a net speedup? If not for the lowcore chips, what about the 18core Skylake X? Admittedly, my analysis overly simplified. I have neither an implementation nor a benchmark. So there's a lot room for error. But if it's reasonably accurate, I'm thinking we might be getting close to crossing that threshold if we haven't already with the 18core 7980XE. Doubledouble Multiply: 4 FP instructions Code:
A = {AL, AH} B = {BL, BH} H = AH * BH; L = AH * BH  H; // FMA L = AL * BH + L; // FMA L = AH * BL + L; // FMA return {L, H}; Code:
A = {AL, AH} B = {BL, BH} IH = abs_max(AH, BH); // AVX512DQ "vrangepd" IL = abs_min(AH, BH); // AVX512DQ "vrangepd" H = AH + BH; L = H  IH; L = IL  L; L = L + AL; L = L + BL; return {L, H}; Code:
A = {AL, AH} H = AL + AH; L = H  AH; L = AL  L; return {L, H}; 
20171004, 00:48  #2 
"Forget I exist"
Jul 2009
Dumbassville
8384_{10} Posts 
I'll only say this. markup vs margin. a 20% markdown ( a margin of the original time etc.) means only a total of 25% more computation time per computation ( assuming the same number of computations of course) so no unless something really magical cuts things down a bit more you can't do 20% markdown ( margin) and then a 100% markup and expect to take less of anything.
Last fiddled with by science_man_88 on 20171004 at 00:48 
20171004, 01:04  #3  
Sep 2016
101011000_{2} Posts 
Quote:
Instead of the DD algorithm being faster, the computational power goes up another factor of 2x without an improvement in memory bandwidth. (which is kind of what just happened with the 18core Skylake) But yes, even when both algorithms are completely memorybound, the DD algorithm won't be less than 80% of the runtime of the DP algorithm. Last fiddled with by Mysticial on 20171004 at 01:06 

20171004, 01:14  #4  
"Forget I exist"
Jul 2009
Dumbassville
20300_{8} Posts 
Quote:


20171004, 11:56  #5 
Feb 2016
UK
2^{3}·5·11 Posts 
The implementation side is beyond me, but I do have observations on performance. Some scenarios to consider:
For a "fast" modern Intel quad, say >=4 GHz, I found 3200 rated dual channel, dual rank ram to give pretty good scaling (relative small FFTs). Basic ram at 2133 just cripples it. For SkylakeX 7800X (6 cores) I'm running 3200 quad channel single rank, and on paper that should be comparable in CPU:ram ratio to the quad above for AVX2. For AVX512 I think it'll hurt without some change. Was there any timeline for AVX512 implementation? The expected to be launched tomorrow 8700k gives us 6 cores at high potential clock but still with dual channel ram. As consumer line it wont have AVX512 to worry about, but I think AVX2 will be starved without insane ram speeds. I get the feeling consumer level core counts will tend to increase more now, as will overall potential peak throughput, but they will remain at dual channel ram. We're going to have even more cores than we can feed. 
20171004, 17:49  #6  
Einyen
Dec 2003
Denmark
6357_{8} Posts 
Quote:
http://www.mersenneforum.org/showthread.php?t=20575 In post #8 you can see that only at 2240K FFT and 1 worker can dual channel keep up with quad channel, but with more workers and/or higher FFT quad channel is always faster. The question is then at what point is it memory bound with quad channel as well. 

20171004, 18:00  #7  
Sep 2016
2^{3}·43 Posts 
Quote:
If 3200 dualchannel is "enough" for a 4 GHz Skylake (AVX2), then the 7800X shouldn't have any issues at all with 4 channels (AVX2). Quote:
Quote:
IOW, there's a corecount war between AMD and Intel with little to no attention given to memory. For now, the chips with the largest memory bottleneck will still be the HCC Skylake X line with 18 cores + AVX512. Personally, I'd like to see quadchannel mainstream and 8channel HEDT. (IOW, every DIMM is a channel.) Though I don't know if it's even possible to route that many traces on a motherboard. Quote:
Once my 7900X becomes available again, I'll need to do some more targeted tests to see how bad it is right now with P95 AVX2. VTune has the ability to record memory bandwidth usage. In my application (with AVX512), the bandwidth usage is basically a backwards Poisson distribution where the "hump" kisses the limit of 75 GB/s on my system. Last fiddled with by Mysticial on 20171004 at 18:14 

20171004, 18:18  #8 
"Robert Gerbicz"
Oct 2005
Hungary
3037_{8} Posts 

20171004, 18:19  #9 
"Kieren"
Jul 2011
In My Own Galaxy!
27AE_{16} Posts 
IIRC, the boost from Dual Rank is around 15%.

20171004, 18:24  #10 
Sep 2016
158_{16} Posts 
AL=BL=1, AH=BH=0 is an invalid doubledouble. AH must be greater in absolute magnitude than AL (unless both are zero  in which the entire doubledouble is zero). (Or precisely, AH should be around 2^52 times larger than AL.) Same applies to BH/BL. Last fiddled with by Mysticial on 20171004 at 18:27 
20171004, 18:35  #11  
P90 years forever!
Aug 2002
Yeehaw, FL
7886_{10} Posts 
Quote:
My experience with dualchannel DDR42400 KabyLake is prime95 become memory bound at roughly 3 cores. Some backoftheenvelope estimates: Using 3200, about 4 cores. Skylake X with quadchannel memory bound at about 8 cores using. A perfect AVX512 implementation on SkylakeX would be memory bound at 4 cores. So on a 16core chip that gives you 1/4 FPU utilization  enough to handle the 3.73X computational difference. Question: Is 3.73X computational difference the best we can do? How about Ernst's DP+INT? An NTT? I once experimented with a 128bit fixedpoint FP CUDA implementation  we can make add fast, but muls are slower. It was awhile ago, so I don't know how well it would work in AVX512. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Double stars  fivemack  Astronomy  5  20170514 08:50 
x.265 half the size, double the computation; so if you double again? 1/4th?  jasong  jasong  7  20150817 10:56 
Double Check  Unregistered  Information & Answers  3  20111001 04:38 
Double the area, Double the volume.  Uncwilly  Puzzles  8  20060703 16:02 
Double Check P1  PhilF  PrimeNet  6  20050703 14:36 