mersenneforum.org Your help wanted - Let's buy GIMPS a KNL development system!
 Register FAQ Search Today's Posts Mark Forums Read

 2016-09-25, 07:16 #199 ewmayer ∂2ω=0     Sep 2002 República de California 267538 Posts Cut the runtime of my Mlucas build on KNL by 10% via tweaked compile options ... a few % came from adding -mavx2 to the compile flags, but most of the gain is from rebuilding the various fused final-iFFT-pass/carry/initial-fFFT-pass routines using the new LOACC flag I added support for this year. That triggers use of a carry macro streamlined by way of a chained-DWT-weights-multiply, instead of the default macro using a 2-table multiply to generate DWT weights and their inverses. One has to be careful to keep the multiply-chain length reasonably short via periodic high-accuracy weights-re-init, since roundoff errors increase in roughly geometric fashion in the chained algo. Using my current chain length settings, the max. exponent at each given FFT length is roughly 0.5% smaller using LOACC, which is well worth it, given the speedup. The runtimes of my four side-by-side 16-threaded DC runs @2304K dropped from 9ms/iter to 8: Code: [Sep 25 01:32:29] M40****** Iter# = 29800000 [72.77% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 3B5EE3838B8FD15F. AvgMaxErr = 0.040980597. MaxErr = 0.058593750. [Sep 25 01:47:50] M40****** Iter# = 29900000 [73.01% complete] clocks = 00:00:00.000 [ 0.0092 sec/iter] Res64: 3BB36C01A1540D7A. AvgMaxErr = 0.040969252. MaxErr = 0.058593750. [Sep 25 02:03:02] M40****** Iter# = 30000000 [73.25% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: C849F61FDD6E2180. AvgMaxErr = 0.040978765. MaxErr = 0.058593750. [Sep 25 02:18:17] M40****** Iter# = 30100000 [73.50% complete] clocks = 00:00:00.000 [ 0.0091 sec/iter] Res64: 334342BC216510AE. AvgMaxErr = 0.040982743. MaxErr = 0.058593750. Restarting M40****** at iteration = 30100000. Res64: 334342BC216510AE M40******: using FFT length 2304K = 2359296 8-byte floats. Using complex FFT radices 144 16 16 32 [Sep 25 02:36:38] M40****** Iter# = 30200000 [73.74% complete] clocks = 00:00:00.000 [ 0.0083 sec/iter] Res64: E3D39125EA274FA1. AvgMaxErr = 0.045169654. MaxErr = 0.070312500. [Sep 25 02:50:22] M40****** Iter# = 30300000 [73.99% complete] clocks = 00:00:00.000 [ 0.0082 sec/iter] Res64: E8C9F31475196409. AvgMaxErr = 0.045193331. MaxErr = 0.070312500. Even better, the LOACC carry macros will be much easier to port to AVX-512, since they dispense with the intricate array-index manipulations needed by the 2-table scheme. (Those are made intricate by combination of a basic 2-table scheme and the fact that my implementation uses the *same* precomputed small table for both foward and inverse weights, by making use of the DWT-weights identity wt[j] * wt[n-j] = 2.) Back to the TF code tomorrow ... I have ||ized all the various floating-double-based modpow routines (using various candidate-factor batch sizes, depending on the vector width of the SIMD build mode), but have to finish tracking down a memory-corruption bug in the 16-candidates-at-a-time routine which gives the best overall throughput for AVX/AVX2 builds. Last fiddled with by ewmayer on 2016-09-25 at 07:16
 2016-09-26, 08:03 #200 ewmayer ∂2ω=0     Sep 2002 República de California 5·2,351 Posts The 4 DCs I ran over the last few days on the KNL just completed - 3 of the results match the first-time test, one does not. Anybody feel like running a triple-check on 40953091?
2016-09-26, 09:38   #201
proxy2222

Jun 2016

19 Posts

Quote:
 Originally Posted by ewmayer The 4 DCs I ran over the last few days on the KNL just completed - 3 of the results match the first-time test, one does not. Anybody feel like running a triple-check on 40953091?
Ok, got it. ETA: 47 hours.

2016-09-26, 21:00   #202
xathor

Sep 2016

19 Posts

Quote:
 Originally Posted by proxy2222 Ok, got it. ETA: 47 hours.
I unfortunately can't give you guys access to the boxes but if ewmayer wants a long term test run, you could send me a tar file with the application and I'll run it for up to a week or so. I can give you my email address if needed.

2016-09-26, 21:56   #203
ewmayer
2ω=0

Sep 2002
República de California

5×2,351 Posts

Quote:
 Originally Posted by proxy2222 Ok, got it. ETA: 47 hours.
Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s.

@xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code.

@anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence?

gcc -o Mlucas *o -lm -lrt -lpthread

Last fiddled with by ewmayer on 2016-09-26 at 22:18

2016-09-27, 00:27   #204
airsquirrels

"David"
Jul 2015
Ohio

11·47 Posts

Quote:
 Originally Posted by ewmayer Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s. @xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code. @anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence? gcc -o Mlucas *o -lm -lrt -lpthread
What library are you static linking? It is usually not necessary for the listed libs.

2016-09-27, 00:53   #205
Mysticial

Sep 2016

2·5·37 Posts

Quote:
 Originally Posted by ewmayer Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s. @xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code. @anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence? gcc -o Mlucas *o -lm -lrt -lpthread
-lpthread has historically caused problems for me. While it may compile fine, it tends to crash at run-time if you don't do it right.

The solution that works for me was:

Code:
-static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive

2016-09-27, 01:20   #206
ewmayer
2ω=0

Sep 2002
República de California

1175510 Posts

Quote:
 Originally Posted by airsquirrels What library are you static linking? It is usually not necessary for the listed libs.
I just want a binary I can shoot to xathor which will run even if his lib versions are not identical to the ones on our shared system.

========================

(Not 36 is out because I see a rapid dropoff in || efficiency once I go over ~16-threads using my current code on this system.)

The smaller step needs a power-of-2 threacount, i.e. 16. With nothing else running, here are timings for the two options (the 'real' cpmponent of each 3-line linux 'time' result reflects wall-clock time):

16+16:
1000 iterations of M44000003 with FFT length 2359296 = 2304 K
Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0
real 0m8.316s
user 1m31.282s
sys 0m2.219s

18+16:
1000 iterations of M44000003 with FFT length 2359296 = 2304 K
Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0
real 0m7.878s
user 1m31.222s
sys 0m2.338s

On a 64-core system four runs at [18+16]-threads is out because the 18-thread phases of the four jobs end up competing for the same 'overlap pairs' physical cores, e,g. job1 might use cpu 0:17, and job2 use 16:33, thus cores 16 ans 17 are oversubscribed, and the result is slower than just running all four jobs using [16+16]-threads with no coreset overlap. On a 72-core system it would (will) be interesting to compare which is faster: four jobs using [18+16], or five jobs using 4x[16,16], 1x[8,8]-threads.

2016-09-27, 01:24   #207
ewmayer
2ω=0

Sep 2002
República de California

1175510 Posts

Quote:
 Originally Posted by Mysticial -lpthread has historically caused problems for me. While it may compile fine, it tends to crash at run-time if you don't do it right. The solution that works for me was: Code: -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
So the -static needs to be on a per-library basis? Just tried this:

[ewmayer@localhost obj_mlucas]$gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lrt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status 2016-09-27, 01:51 #208 Mysticial Sep 2016 2×5×37 Posts Quote:  Originally Posted by ewmayer So the -static needs to be on a per-library basis? Just tried this: [ewmayer@localhost obj_mlucas]$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lrt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status
Not per library. Just pthread. I believe static is only specified once at the beginning. But I'm not 100% sure.

Here's the command line from one of my projects:
Code:
g++ ../Main.cpp -I ../Source -std=c++14 -fno-rtti -Wall -Wno-unused-function -Wno-unused-variable -save-temps -O2 -D YMP_STANDALONE -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -march=knl -D X64_16_KnightsLanding -D YMP_BUILD_RELEASE -o "y-cruncher/Binaries/x64 AVX512-CD"
Granted, there's a slight difference here in that I'm not linking separately and that -lpthread is the only thing I'm explicitly linking. -lm is pulled in automatically for C++. So I'm unsure of the multiple library case. This is what worked for me after hours of trial-and-error.

Last fiddled with by Mysticial on 2016-09-27 at 01:57

2016-09-27, 05:49   #209
ewmayer
2ω=0

Sep 2002
República de California

101101111010112 Posts

Quote:
 Originally Posted by ewmayer So the -static needs to be on a per-library basis? Just tried this: [ewmayer@localhost obj_mlucas]\$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread /usr/bin/ld: cannot find -lm /usr/bin/ld: cannot find -lrt /usr/bin/ld: cannot find -lpthread /usr/bin/ld: cannot find -lc collect2: error: ld returned 1 exit status
BTW, I get the same set of linker errors using the -static flag the way I always understood it, that is, just the once immediately following the *.o ('link all object-files in the local dir') and preceding the set of library references. David, is it possible this system does not support static linkage?

Anyhoo, @xathor, I've attached a zipped copy of my shared-lib binary, in hopes your setup is the same or similar enough to the CentOS+GCC5.1 install on our shared-dev KNL system to allow it to run. Note this is the faster but slightly less accurate build I mentioned getting a 10% speedup from. Since I've not yet modified my self-test functions to use suitably smaller self-test exponents for such LOACC builds, if you try the default self-tests, most will fail with fatal roundoff errors. So if the binary does run for you, just go ahead and propagate the following mlucas.cfg file (based on the higher-accuracy default build) to your various rundirs - I don't expect LOACC mode to affect the best-FFT-radix-set, so no worries about suboptimal FFT params on that account. The per-iter timings here reflect 16-threaded run mode, but I found the same FFT params to be best at smaller thread counts as well:

16.0
2304 msec/iter = 8.38 ROE[avg,max] = [0.277738559, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0
2560 msec/iter = 8.93 ROE[avg,max] = [0.275268696, 0.328125000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 10.25 ROE[avg,max] = [0.260132906, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 11.44 ROE[avg,max] = [0.269535088, 0.343750000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 11.97 ROE[avg,max] = [0.269532634, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 12.23 ROE[avg,max] = [0.259974261, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 12.57 ROE[avg,max] = [0.285017820, 0.375000000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 13.71 ROE[avg,max] = [0.285594508, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 14.53 ROE[avg,max] = [0.349673808, 0.437500000] radices = 288 16 16 32 0 0 0 0 0 0
Attached Files
 Mlucas_loacc.bz2 (1.53 MB, 89 views)

 Similar Threads Thread Thread Starter Forum Replies Last Post Jean Penné Software 39 2012-04-27 12:33 Jean Penné Software 6 2011-04-28 06:21 Surge Hardware 5 2010-12-09 04:07 Unregistered Hardware 6 2005-07-04 04:27 Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 08:08.

Fri Jan 27 08:08:09 UTC 2023 up 162 days, 5:36, 0 users, load averages: 1.09, 1.19, 1.06