Thread: Mlucas v18 available View Single Post
 2019-10-16, 19:14 #47 ewmayer ∂2ω=0     Sep 2002 República de California 2×13×449 Posts Thanks for the build & test data - I see this particular new instance supports avx-512, so you'll want to prepare a second build that invokes those inline-asm macros in the code: gcc -c -O3 -DUSE_AVX512 -march=skylake-avx512 -DUSE_THREADS ../src/*.c >& build.log ...and use a different name for the resulting executable, you could call the 2 binaries mlucas_avx2 and mlucas_avx512, say. "grep avx512 /proc/cpuinfo" on whatever system you get during a particular session will tell you which binary to use. Rerun the self-tests on this new system to see what kind of speedup you get from using avx-512. (Wait - while working through your selftest.log data further down in this note, I came across these infoprints @7168K: radix28_ditN_cy_dif1: No AVX-512 support; Skipping this leading radix. So you did prepare and use an avx-512 build as per above compile flags for this set of runs? If so, that obviates the avx2-vs-avx512 parts of the commentary below.) As to your avx2-build timings, I realized after posting my "seems slow' comment yesterday that I was thinking in terms of multicore running on hardware like my Haswell. For a single-physical-core running at 2 GHz, ~50 msec/iter at the current GIMPS wavefront (5120K) is not at all bad - for comparison, here is the mlucas.cfg file for all 4 physical cores (no hyperthreading on this CPU) of my 3.3GHz Haswell. On a single CPU the runtimes would be perhaps ~3.5x as large, so (say) at 5120K we'd expect ~47 msec/iter, only ~10% faster than your 1-core/2-thread timings, and this is at 3.3GHz vs your 2GHz: Code: 18.0 2048 msec/iter = 5.25 ROE[avg,max] = [0.222878714, 0.312500000] radices = 64 16 32 32 0 0 0 0 0 0 2304 msec/iter = 5.85 ROE[avg,max] = [0.259770659, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0 2560 msec/iter = 6.28 ROE[avg,max] = [0.252363335, 0.312500000] radices = 160 16 16 32 0 0 0 0 0 0 2816 msec/iter = 7.44 ROE[avg,max] = [0.239182557, 0.312500000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 8.35 ROE[avg,max] = [0.251998996, 0.312500000] radices = 192 16 16 32 0 0 0 0 0 0 3328 msec/iter = 9.02 ROE[avg,max] = [0.243424657, 0.312500000] radices = 208 16 16 32 0 0 0 0 0 0 3584 msec/iter = 9.25 ROE[avg,max] = [0.248507344, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0 3840 msec/iter = 10.17 ROE[avg,max] = [0.256763639, 0.343750000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 10.63 ROE[avg,max] = [0.279075387, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0 4608 msec/iter = 12.21 ROE[avg,max] = [0.269211099, 0.343750000] radices = 288 16 16 32 0 0 0 0 0 0 5120 msec/iter = 13.48 ROE[avg,max] = [0.300527545, 0.375000000] radices = 320 16 16 32 0 0 0 0 0 0 5632 msec/iter = 15.42 ROE[avg,max] = [0.230105748, 0.281250000] radices = 176 16 32 32 0 0 0 0 0 0 6144 msec/iter = 17.51 ROE[avg,max] = [0.246608585, 0.312500000] radices = 192 16 32 32 0 0 0 0 0 0 6656 msec/iter = 18.60 ROE[avg,max] = [0.231292347, 0.312500000] radices = 208 16 32 32 0 0 0 0 0 0 Further using an avx-512 build on this type of instance should give a nice added speedup, perhaps as much as 1.6x. And if/when a Prime95/mprime build for these systems comes online, that should be faster still. Looking more closely at your selftest.log and mlucas.cfg files, I see "Excessive level of roundoff error detected" messages for individual FFT radix sets at 2816K, 3328K, 5120K and 7168K, but in none of those cases did the skipped radix set(s) happen to be the fastest one(s) at the FFT length in question.