![]() |
![]() |
#1 |
Oct 2017
++41
53 Posts |
![]()
How much does AVX512 improve LL performance at the same clock rate compared to AVX2?
|
![]() |
![]() |
![]() |
#2 |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() |
![]() |
![]() |
![]() |
#3 |
Sep 2016
2·5·37 Posts |
![]()
FWIW, y-cruncher's untuned AVX512 gained anywhere from 5% to -10% on my system out-of-box depending on badly it throttled. (That minus 10% is not a typo. Once the system throttles, it throttles hard.)
Once I fixed the throttling, the AVX2 -> AVX512 gain hovered around 10%-ish. Once I tuned the AVX512 binary, that grew to about 15%. Once I overclocked the cache and memory, it grew up to 25%. Version 0.7.4 (ETA end of weekend) makes both the AVX2 and AVX512 faster, but more so the AVX512. And I think it widens the gap by another percent or two. By comparison, my BBP benchmark gained 90% (1.9x faster) going from AVX2 -> AVX512 in the absence of throttling. Cache and memory are irrelevant since it's L1 only. Last fiddled with by Mysticial on 2017-10-13 at 23:09 |
![]() |
![]() |
![]() |
#4 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
It would be interesting to compare your so-far-disappointing AVX512 gains for y-cruncher to Mlucas - back in July in this same thread you posted a bunch of Mlucas timings for an AVX512 build in a Ubuntu sandbox you installed, but I don't believe we considered doing an AVX2 build on your hardware. Comparing those 2 binaries would tell us if your AVX512 gains for y-cruncher are in line to those for my code running on the same hardware. Note if you're running Win10, build-under-linux is now greatly eased by MSFT having actually done something right for once by way of adding native linux-sandbox support to that version of their OS. The Mlucas readme page has info on that, but bottom line that once you open such a native shell, everything works beautifully. |
|
![]() |
![]() |
![]() |
#5 | ||
Sep 2016
2×5×37 Posts |
![]() Quote:
When I ran the Mlucas benchmarks, I had already fixed the throttling. So it doesn't invalid those results. But, I've have significantly changed the overclock settings since then. Now I have all 128GB running at 3800 MT/s - which is almost enough to match 6-channel 2666 MT/s on the servers. (This Samsung B-die stuff really is that good. Too bad it's so expensive now. And it's sold out everywhere so I can't even get another fix if I wanted to.) Quote:
Last fiddled with by Mysticial on 2017-10-15 at 19:44 |
||
![]() |
![]() |
![]() |
#6 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
AVX2: gcc -c -O3 -mavx2 -DUSE_AVX2 -DUSE_THREADS ../*.c >& build.log grep -i error build.log [Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt AVX512 gcc -c -O3 -march=skylake-avx512 -DUSE_AVX512 -DUSE_THREADS ../*.c >& build.log grep -i error build.log [Assuming above grep comes up empty] gcc -o Mlucas *.o -lm -lpthread -lrt Simple set of comparative timings at a GIMPS-representative FFT length should suffice for our basic purposes, but you are of course free to spend as much or as little time playing with this as you like. I suggest using 100-iters for self-tests only in 1-thread mode, 1000-iters will give more accurate timings for anything beyond that. I further suggest opening the mlucas.cfg file resulting from the initial set if self-tests in an editor and annotating each cfg-file best-timing line as it is printed with the -cpu options you used for that set of timings. (You can replace the reference-residues stuff that gets printed to the right of the 10-entry FFT radix-set on each line with your annotations, I find that a convenient method in my own work.) So let's say we want to do comparative timings @4096K FFT length. On an Intel 10-physical-core CPU I would like so, each line producing a new cfg-file entry. The code is heavily geared toward power-of-2 thread counts, so I would eviate from that only at the 'what the heck - let's try all 100 core' end: 1 thread per physical core: ./Mlucas -fftlen 4096 -iters 100 -cpu 0 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:9 2 threads per physical core: ./Mlucas -fftlen 4096 -iters 100 -cpu 0,10 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:1,10:11 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:3,10:13 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:7,10:17 ./Mlucas -fftlen 4096 -iters 1000 -cpu 0:19 Last fiddled with by ewmayer on 2017-10-15 at 21:38 |
|
![]() |
![]() |
![]() |
#7 |
Sep 2016
2·5·37 Posts |
![]()
AVX2:
Code:
17.0 4096 msec/iter = 27.05 ROE[avg,max] = [0.278125000, 0.312500000] radices = 64 32 32 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 16.42 ROE[avg,max] = [0.265871720, 0.328125000] radices = 256 8 8 8 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 10.70 ROE[avg,max] = [0.276559480, 0.343750000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 6.54 ROE[avg,max] = [0.276559480, 0.343750000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 6.19 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 24.78 ROE[avg,max] = [0.264118304, 0.296875000] radices = 256 8 8 8 16 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 14.92 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 9.64 ROE[avg,max] = [0.320515867, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.95 ROE[avg,max] = [0.320564191, 0.406250000] radices = 256 32 16 16 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.84 ROE[avg,max] = [0.265964343, 0.328125000] radices = 256 8 8 8 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 Code:
17.0 4096 msec/iter = 21.77 ROE[avg,max] = [0.245026507, 0.281250000] radices = 32 16 16 16 16 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 15.21 ROE[avg,max] = [0.244451338, 0.304687500] radices = 32 16 16 16 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 8.70 ROE[avg,max] = [0.275551707, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.37 ROE[avg,max] = [0.275551707, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.36 ROE[avg,max] = [0.300239107, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 18.90 ROE[avg,max] = [0.301116071, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 100-iteration Res mod 2^64, 2^35-1, 2^36-1 = 8CC30E314BF3E556, 22305398329, 64001568053 4096 msec/iter = 12.86 ROE[avg,max] = [0.300271323, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 7.91 ROE[avg,max] = [0.300093503, 0.406250000] radices = 64 8 16 16 16 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.42 ROE[avg,max] = [0.275567816, 0.375000000] radices = 64 32 32 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 4096 msec/iter = 5.00 ROE[avg,max] = [0.300271323, 0.375000000] radices = 256 16 16 32 0 0 0 0 0 0 1000-iteration Res mod 2^64, 2^35-1, 2^36-1 = 5F87421FA9DD8F1F, 26302807323, 54919604018 Hardware Specs:
Overall, this is very similar to what I'm seeing.
|
![]() |
![]() |
![]() |
#8 |
∂2ω=0
Sep 2002
República de California
2DEB16 Posts |
![]()
Thanks, Alex - so on the good-news front, "It's not you." :)
Does your big-FFT y-cruncher code get any benefit from running 2 threads per physical core, as we see in the Mlucas timings? Last fiddled with by ewmayer on 2017-10-18 at 00:41 |
![]() |
![]() |
![]() |
#9 | |
Sep 2016
37010 Posts |
![]() Quote:
But this is taken over an entire Pi computation/benchmark. The workload is a lot less homogeneous. So there's a lot of more and less optimized code. |
|
![]() |
![]() |
![]() |
#10 |
Sep 2016
37010 Posts |
![]()
My 3800 MT/s memory overclock soft-errored for a second time in the past 2 months. When I tested with the base clock bumped to 102 MHz (2% safety margin), it soft-errored within a minute. In retrospect, I never actually tested this overclock with any safety margin at all. So this prompted me to retest the entire overclock.
During this process, I lifted the temperature throttle limits and noticed that my 3.8 GHz AVX512 overclock was shooting way past the usual throttle point and well above 100C. 5 months ago, I picked 3.8 GHz because it was the highest speed that wouldn't exceed 85C. But in those 5 months, there have been enough memory improvements (both in software as well as the overclock) that the code simply runs a hella lot hotter than before. I've also noticed an increase in benchmark inconsistency recently, but I didn't realize how badly it was throttling since I don't usually have CPUz open during benchmarks. Without the throttling, the CPU utilization seems have to gone up - presumably because the throttling was causing a load imbalance between the cores (since the hotter cores throttle harder). Going by the same 85C limit, I seem to have lost about 200 MHz during these 5 months. Yet the net speedup is more than 40%. |
![]() |
![]() |
![]() |
#11 |
P90 years forever!
Aug 2002
Yeehaw, FL
41×199 Posts |
![]()
@mystical: Do you know if the vblendmpd instruction is running on the same ports as the FMA units?
http://users.atw.hu/instlatx64/Genui...InstLatX64.txt shows vblendmpd with a latency of one and a throughput of two. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A surprising case of a new shiny thing being step-function better | fivemack | Astronomy | 12 | 2017-01-24 17:49 |
Ooohhh.....shiny! | schickel | FactorDB | 2 | 2012-08-16 00:09 |
performance of Intel "Harpertown" series | ixfd64 | Hardware | 1 | 2007-09-24 08:28 |
64 bit performance? | zacariaz | Hardware | 1 | 2007-05-10 13:08 |
LLR performance on k and n | robert44444uk | 15k Search | 1 | 2006-02-09 01:43 |