![]() |
![]() |
#199 |
∂2ω=0
Sep 2002
República de California
267538 Posts |
![]()
Cut the runtime of my Mlucas build on KNL by 10% via tweaked compile options ... a few % came from adding -mavx2 to the compile flags, but most of the gain is from rebuilding the various fused final-iFFT-pass/carry/initial-fFFT-pass routines using the new LOACC flag I added support for this year. That triggers use of a carry macro streamlined by way of a chained-DWT-weights-multiply, instead of the default macro using a 2-table multiply to generate DWT weights and their inverses. One has to be careful to keep the multiply-chain length reasonably short via periodic high-accuracy weights-re-init, since roundoff errors increase in roughly geometric fashion in the chained algo. Using my current chain length settings, the max. exponent at each given FFT length is roughly 0.5% smaller using LOACC, which is well worth it, given the speedup.
The runtimes of my four side-by-side 16-threaded DC runs @2304K dropped from 9ms/iter to 8: Code:
[Sep 25 01:32:29] M40****** Iter# = 29800000 [72.77% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 3B5EE3838B8FD15F. AvgMaxErr = 0.040980597. MaxErr = 0.058593750. [Sep 25 01:47:50] M40****** Iter# = 29900000 [73.01% complete] clocks = 00:00:00.000 [ 0.0092 sec/iter] Res64: 3BB36C01A1540D7A. AvgMaxErr = 0.040969252. MaxErr = 0.058593750. [Sep 25 02:03:02] M40****** Iter# = 30000000 [73.25% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: C849F61FDD6E2180. AvgMaxErr = 0.040978765. MaxErr = 0.058593750. [Sep 25 02:18:17] M40****** Iter# = 30100000 [73.50% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 334342BC216510AE. AvgMaxErr = 0.040982743. MaxErr = 0.058593750. Restarting M40****** at iteration = 30100000. Res64: 334342BC216510AE M40******: using FFT length 2304K = 2359296 8-byte floats. Using complex FFT radices 144 16 16 32 [Sep 25 02:36:38] M40****** Iter# = 30200000 [73.74% complete] clocks = 00:00:00.000 [ 0.0083 sec/iter] Res64: E3D39125EA274FA1. AvgMaxErr = 0.045169654. MaxErr = 0.070312500. [Sep 25 02:50:22] M40****** Iter# = 30300000 [73.99% complete] clocks = 00:00:00.000 [ 0.0082 sec/iter] Res64: E8C9F31475196409. AvgMaxErr = 0.045193331. MaxErr = 0.070312500. Back to the TF code tomorrow ... I have ||ized all the various floating-double-based modpow routines (using various candidate-factor batch sizes, depending on the vector width of the SIMD build mode), but have to finish tracking down a memory-corruption bug in the 16-candidates-at-a-time routine which gives the best overall throughput for AVX/AVX2 builds. Last fiddled with by ewmayer on 2016-09-25 at 07:16 |
![]() |
![]() |
![]() |
#200 |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]()
The 4 DCs I ran over the last few days on the KNL just completed - 3 of the results match the first-time test, one does not.
Anybody feel like running a triple-check on 40953091? |
![]() |
![]() |
![]() |
#201 |
Jun 2016
19 Posts |
![]() |
![]() |
![]() |
![]() |
#202 |
Sep 2016
19 Posts |
![]() |
![]() |
![]() |
![]() |
#203 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s.
@xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code. @anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence? gcc -o Mlucas *o -lm -lrt -lpthread Last fiddled with by ewmayer on 2016-09-26 at 22:18 |
![]() |
![]() |
![]() |
#204 | |
"David"
Jul 2015
Ohio
11·47 Posts |
![]() Quote:
|
|
![]() |
![]() |
![]() |
#205 | |
Sep 2016
2·5·37 Posts |
![]() Quote:
The solution that works for me was: Code:
-static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive |
|
![]() |
![]() |
![]() |
#206 | |
∂2ω=0
Sep 2002
República de California
1175510 Posts |
![]() Quote:
======================== On a 64-core KNL running four C jobs each 16-threaded as I did makes sense ... just played with some other threadcounts on the now-unloaded system, and see that a 72-core system would have been nice for the DCs I was running FFT length @2304K = 18 * 64K. That's because my FFT code breaks each FFT-mul into 2 steps, each with a different memory access pattern and optimal threadcount. The bigger step (~2/3 of runtime) needs #threads to divide the [leading FFT radix]/2 in order to keep all threads equally busy and no threads idling. In my case a leading FFT radix of 144 is best at 2304K, so we'd like to use 18-threads rather than 16 since 144/2 = 72 gets done in 'waves' of this many parallel-executing threads for those 2 threadcounts: 16-thread: 16,16,16,16,8 (i.e. final wave uses only half the threads) 18-thread: 18,18,18,18 (Not 36 is out because I see a rapid dropoff in || efficiency once I go over ~16-threads using my current code on this system.) The smaller step needs a power-of-2 threacount, i.e. 16. With nothing else running, here are timings for the two options (the 'real' cpmponent of each 3-line linux 'time' result reflects wall-clock time): 16+16: 1000 iterations of M44000003 with FFT length 2359296 = 2304 K Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0 real 0m8.316s user 1m31.282s sys 0m2.219s 18+16: 1000 iterations of M44000003 with FFT length 2359296 = 2304 K Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0 real 0m7.878s user 1m31.222s sys 0m2.338s On a 64-core system four runs at [18+16]-threads is out because the 18-thread phases of the four jobs end up competing for the same 'overlap pairs' physical cores, e,g. job1 might use cpu 0:17, and job2 use 16:33, thus cores 16 ans 17 are oversubscribed, and the result is slower than just running all four jobs using [16+16]-threads with no coreset overlap. On a 72-core system it would (will) be interesting to compare which is faster: four jobs using [18+16], or five jobs using 4x[16,16], 1x[8,8]-threads. |
|
![]() |
![]() |
![]() |
#207 | |
∂2ω=0
Sep 2002
República de California
1175510 Posts |
![]() Quote:
[ewmayer@localhost obj_mlucas]$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lrt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status |
|
![]() |
![]() |
![]() |
#208 | |
Sep 2016
2×5×37 Posts |
![]() Quote:
Here's the command line from one of my projects: Code:
g++ ../Main.cpp -I ../Source -std=c++14 -fno-rtti -Wall -Wno-unused-function -Wno-unused-variable -save-temps -O2 -D YMP_STANDALONE -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -march=knl -D X64_16_KnightsLanding -D YMP_BUILD_RELEASE -o "y-cruncher/Binaries/x64 AVX512-CD" Last fiddled with by Mysticial on 2016-09-27 at 01:57 |
|
![]() |
![]() |
![]() |
#209 | |
∂2ω=0
Sep 2002
República de California
101101111010112 Posts |
![]() Quote:
Anyhoo, @xathor, I've attached a zipped copy of my shared-lib binary, in hopes your setup is the same or similar enough to the CentOS+GCC5.1 install on our shared-dev KNL system to allow it to run. Note this is the faster but slightly less accurate build I mentioned getting a 10% speedup from. Since I've not yet modified my self-test functions to use suitably smaller self-test exponents for such LOACC builds, if you try the default self-tests, most will fail with fatal roundoff errors. So if the binary does run for you, just go ahead and propagate the following mlucas.cfg file (based on the higher-accuracy default build) to your various rundirs - I don't expect LOACC mode to affect the best-FFT-radix-set, so no worries about suboptimal FFT params on that account. The per-iter timings here reflect 16-threaded run mode, but I found the same FFT params to be best at smaller thread counts as well: 16.0 2304 msec/iter = 8.38 ROE[avg,max] = [0.277738559, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0 2560 msec/iter = 8.93 ROE[avg,max] = [0.275268696, 0.328125000] radices = 160 16 16 32 0 0 0 0 0 0 2816 msec/iter = 10.25 ROE[avg,max] = [0.260132906, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0 3072 msec/iter = 11.44 ROE[avg,max] = [0.269535088, 0.343750000] radices = 192 16 16 32 0 0 0 0 0 0 3328 msec/iter = 11.97 ROE[avg,max] = [0.269532634, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0 3584 msec/iter = 12.23 ROE[avg,max] = [0.259974261, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0 3840 msec/iter = 12.57 ROE[avg,max] = [0.285017820, 0.375000000] radices = 240 16 16 32 0 0 0 0 0 0 4096 msec/iter = 13.71 ROE[avg,max] = [0.285594508, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0 4608 msec/iter = 14.53 ROE[avg,max] = [0.349673808, 0.437500000] radices = 288 16 16 32 0 0 0 0 0 0 |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
LLR development version 3.8.7 is available! | Jean Penné | Software | 39 | 2012-04-27 12:33 |
LLR 3.8.5 Development version | Jean Penné | Software | 6 | 2011-04-28 06:21 |
Do you have a dedicated system for gimps? | Surge | Hardware | 5 | 2010-12-09 04:07 |
Query - Running GIMPS on a 4 way system | Unregistered | Hardware | 6 | 2005-07-04 04:27 |
System tweaks to speed GIMPS | Uncwilly | Software | 46 | 2004-02-05 09:38 |