Cut the runtime of my Mlucas build on KNL by 10% via tweaked compile options ... a few % came from adding mavx2 to the compile flags, but most of the gain is from rebuilding the various fused finaliFFTpass/carry/initialfFFTpass routines using the new LOACC flag I added support for this year. That triggers use of a carry macro streamlined by way of a chainedDWTweightsmultiply, instead of the default macro using a 2table multiply to generate DWT weights and their inverses. One has to be careful to keep the multiplychain length reasonably short via periodic highaccuracy weightsreinit, since roundoff errors increase in roughly geometric fashion in the chained algo. Using my current chain length settings, the max. exponent at each given FFT length is roughly 0.5% smaller using LOACC, which is well worth it, given the speedup.
The runtimes of my four sidebyside 16threaded DC runs @2304K dropped from 9ms/iter to 8: Code:
[Sep 25 01:32:29] M40****** Iter# = 29800000 [72.77% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 3B5EE3838B8FD15F. AvgMaxErr = 0.040980597. MaxErr = 0.058593750. [Sep 25 01:47:50] M40****** Iter# = 29900000 [73.01% complete] clocks = 00:00:00.000 [ 0.0092 sec/iter] Res64: 3BB36C01A1540D7A. AvgMaxErr = 0.040969252. MaxErr = 0.058593750. [Sep 25 02:03:02] M40****** Iter# = 30000000 [73.25% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: C849F61FDD6E2180. AvgMaxErr = 0.040978765. MaxErr = 0.058593750. [Sep 25 02:18:17] M40****** Iter# = 30100000 [73.50% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 334342BC216510AE. AvgMaxErr = 0.040982743. MaxErr = 0.058593750. Restarting M40****** at iteration = 30100000. Res64: 334342BC216510AE M40******: using FFT length 2304K = 2359296 8byte floats. Using complex FFT radices 144 16 16 32 [Sep 25 02:36:38] M40****** Iter# = 30200000 [73.74% complete] clocks = 00:00:00.000 [ 0.0083 sec/iter] Res64: E3D39125EA274FA1. AvgMaxErr = 0.045169654. MaxErr = 0.070312500. [Sep 25 02:50:22] M40****** Iter# = 30300000 [73.99% complete] clocks = 00:00:00.000 [ 0.0082 sec/iter] Res64: E8C9F31475196409. AvgMaxErr = 0.045193331. MaxErr = 0.070312500. Back to the TF code tomorrow ... I have ized all the various floatingdoublebased modpow routines (using various candidatefactor batch sizes, depending on the vector width of the SIMD build mode), but have to finish tracking down a memorycorruption bug in the 16candidatesatatime routine which gives the best overall throughput for AVX/AVX2 builds. Last fiddled with by ewmayer on 20160925 at 07:16 
The 4 DCs I ran over the last few days on the KNL just completed  3 of the results match the firsttime test, one does not.
Anybody feel like running a triplecheck on 40953091? 
Thanks  here is the primenet exponent status for that one, showing the 2 mismatching Res64s.
@xathor: Send me your email address and I'll shoot you a binary of my devbranch code. @anyone: For some reason I'm having trouble statically linking a binary  to rule out syntax fubars, where should static go in the following link sequence? gcc o Mlucas *o lm lrt lpthread Last fiddled with by ewmayer on 20160926 at 22:18 
The solution that works for me was: Code:
static Wl,wholearchive lpthread Wl,nowholearchive 

======================== On a 64core KNL running four C jobs each 16threaded as I did makes sense ... just played with some other threadcounts on the nowunloaded system, and see that a 72core system would have been nice for the DCs I was running FFT length @2304K = 18 * 64K. That's because my FFT code breaks each FFTmul into 2 steps, each with a different memory access pattern and optimal threadcount. The bigger step (~2/3 of runtime) needs #threads to divide the [leading FFT radix]/2 in order to keep all threads equally busy and no threads idling. In my case a leading FFT radix of 144 is best at 2304K, so we'd like to use 18threads rather than 16 since 144/2 = 72 gets done in 'waves' of this many parallelexecuting threads for those 2 threadcounts: 16thread: 16,16,16,16,8 (i.e. final wave uses only half the threads) 18thread: 18,18,18,18 (Not 36 is out because I see a rapid dropoff in  efficiency once I go over ~16threads using my current code on this system.) The smaller step needs a powerof2 threacount, i.e. 16. With nothing else running, here are timings for the two options (the 'real' cpmponent of each 3line linux 'time' result reflects wallclock time): 16+16: 1000 iterations of M44000003 with FFT length 2359296 = 2304 K Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0 real 0m8.316s user 1m31.282s sys 0m2.219s 18+16: 1000 iterations of M44000003 with FFT length 2359296 = 2304 K Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0 real 0m7.878s user 1m31.222s sys 0m2.338s On a 64core system four runs at [18+16]threads is out because the 18thread phases of the four jobs end up competing for the same 'overlap pairs' physical cores, e,g. job1 might use cpu 0:17, and job2 use 16:33, thus cores 16 ans 17 are oversubscribed, and the result is slower than just running all four jobs using [16+16]threads with no coreset overlap. On a 72core system it would (will) be interesting to compare which is faster: four jobs using [18+16], or five jobs using 4x[16,16], 1x[8,8]threads. 

[ewmayer@localhost obj_mlucas]$ gcc o Mlucas.static *o static lm static lrt static lpthread /usr/bin/ld: cannot find lm /usr/bin/ld: cannot find lrt /usr/bin/ld: cannot find lpthread /usr/bin/ld: cannot find lc collect2: error: ld returned 1 exit status 

Here's the command line from one of my projects: Code:
g++ ../Main.cpp I ../Source std=c++14 fnortti Wall Wnounusedfunction Wnounusedvariable savetemps O2 D YMP_STANDALONE static Wl,wholearchive lpthread Wl,nowholearchive march=knl D X64_16_KnightsLanding D YMP_BUILD_RELEASE o "ycruncher/Binaries/x64 AVX512CD" Last fiddled with by Mysticial on 20160927 at 01:57 

Anyhoo, @xathor, I've attached a zipped copy of my sharedlib binary, in hopes your setup is the same or similar enough to the CentOS+GCC5.1 install on our shareddev KNL system to allow it to run. Note this is the faster but slightly less accurate build I mentioned getting a 10% speedup from. Since I've not yet modified my selftest functions to use suitably smaller selftest exponents for such LOACC builds, if you try the default selftests, most will fail with fatal roundoff errors. So if the binary does run for you, just go ahead and propagate the following mlucas.cfg file (based on the higheraccuracy default build) to your various rundirs  I don't expect LOACC mode to affect the bestFFTradixset, so no worries about suboptimal FFT params on that account. The periter timings here reflect 16threaded run mode, but I found the same FFT params to be best at smaller thread counts as well: 16.0 2304 msec/iter = 8.38 ROE[avg,max] = [0.277738559, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0 2560 msec/iter = 8.93 ROE[avg,max] = [0.275268696, 0.328125000] radices = 160 16 16 32 0 0 0 0 0 0 2816 msec/iter = 10.25 ROE[avg,max] = [0.260132906, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 11.44 ROE[avg,max] = [0.269535088, 0.343750000] radices = 192 16 16 32 0 0 0 0 0 0 3328 msec/iter = 11.97 ROE[avg,max] = [0.269532634, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0 3584 msec/iter = 12.23 ROE[avg,max] = [0.259974261, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0 3840 msec/iter = 12.57 ROE[avg,max] = [0.285017820, 0.375000000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 13.71 ROE[avg,max] = [0.285594508, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0 4608 msec/iter = 14.53 ROE[avg,max] = [0.349673808, 0.437500000] radices = 288 16 16 32 0 0 0 0 0 0 

