View Single Post
Old 2019-10-16, 19:14   #47
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2×13×449 Posts
Default

Thanks for the build & test data - I see this particular new instance supports avx-512, so you'll want to prepare a second build that invokes those inline-asm macros in the code:

gcc -c -O3 -DUSE_AVX512 -march=skylake-avx512 -DUSE_THREADS ../src/*.c >& build.log

...and use a different name for the resulting executable, you could call the 2 binaries mlucas_avx2 and mlucas_avx512, say. "grep avx512 /proc/cpuinfo" on whatever system you get during a particular session will tell you which binary to use. Rerun the self-tests on this new system to see what kind of speedup you get from using avx-512.

(Wait - while working through your selftest.log data further down in this note, I came across these infoprints @7168K:

radix28_ditN_cy_dif1: No AVX-512 support; Skipping this leading radix.

So you did prepare and use an avx-512 build as per above compile flags for this set of runs? If so, that obviates the avx2-vs-avx512 parts of the commentary below.)

As to your avx2-build timings, I realized after posting my "seems slow' comment yesterday that I was thinking in terms of multicore running on hardware like my Haswell. For a single-physical-core running at 2 GHz, ~50 msec/iter at the current GIMPS wavefront (5120K) is not at all bad - for comparison, here is the mlucas.cfg file for all 4 physical cores (no hyperthreading on this CPU) of my 3.3GHz Haswell. On a single CPU the runtimes would be perhaps ~3.5x as large, so (say) at 5120K we'd expect ~47 msec/iter, only ~10% faster than your 1-core/2-thread timings, and this is at 3.3GHz vs your 2GHz:
Code:
18.0
      2048  msec/iter =    5.25  ROE[avg,max] = [0.222878714, 0.312500000]  radices =  64 16 32 32  0  0  0  0  0  0
      2304  msec/iter =    5.85  ROE[avg,max] = [0.259770659, 0.375000000]  radices = 144 16 16 32  0  0  0  0  0  0
      2560  msec/iter =    6.28  ROE[avg,max] = [0.252363335, 0.312500000]  radices = 160 16 16 32  0  0  0  0  0  0
      2816  msec/iter =    7.44  ROE[avg,max] = [0.239182557, 0.312500000]  radices = 176 16 16 32  0  0  0  0  0  0
      3072  msec/iter =    8.35  ROE[avg,max] = [0.251998996, 0.312500000]  radices = 192 16 16 32  0  0  0  0  0  0
      3328  msec/iter =    9.02  ROE[avg,max] = [0.243424657, 0.312500000]  radices = 208 16 16 32  0  0  0  0  0  0
      3584  msec/iter =    9.25  ROE[avg,max] = [0.248507344, 0.312500000]  radices = 224 16 16 32  0  0  0  0  0  0
      3840  msec/iter =   10.17  ROE[avg,max] = [0.256763639, 0.343750000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   10.63  ROE[avg,max] = [0.279075387, 0.343750000]  radices = 256 16 16 32  0  0  0  0  0  0
      4608  msec/iter =   12.21  ROE[avg,max] = [0.269211099, 0.343750000]  radices = 288 16 16 32  0  0  0  0  0  0
      5120  msec/iter =   13.48  ROE[avg,max] = [0.300527545, 0.375000000]  radices = 320 16 16 32  0  0  0  0  0  0
      5632  msec/iter =   15.42  ROE[avg,max] = [0.230105748, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =   17.51  ROE[avg,max] = [0.246608585, 0.312500000]  radices = 192 16 32 32  0  0  0  0  0  0
      6656  msec/iter =   18.60  ROE[avg,max] = [0.231292347, 0.312500000]  radices = 208 16 32 32  0  0  0  0  0  0
Further using an avx-512 build on this type of instance should give a nice added speedup, perhaps as much as 1.6x. And if/when a Prime95/mprime build for these systems comes online, that should be faster still.

Looking more closely at your selftest.log and mlucas.cfg files, I see "Excessive level of roundoff error detected" messages for individual FFT radix sets at 2816K, 3328K, 5120K and 7168K, but in none of those cases did the skipped radix set(s) happen to be the fastest one(s) at the FFT length in question.
ewmayer is online now   Reply With Quote