mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-25, 07:16   #199
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267538 Posts
Default

Cut the runtime of my Mlucas build on KNL by 10% via tweaked compile options ... a few % came from adding -mavx2 to the compile flags, but most of the gain is from rebuilding the various fused final-iFFT-pass/carry/initial-fFFT-pass routines using the new LOACC flag I added support for this year. That triggers use of a carry macro streamlined by way of a chained-DWT-weights-multiply, instead of the default macro using a 2-table multiply to generate DWT weights and their inverses. One has to be careful to keep the multiply-chain length reasonably short via periodic high-accuracy weights-re-init, since roundoff errors increase in roughly geometric fashion in the chained algo. Using my current chain length settings, the max. exponent at each given FFT length is roughly 0.5% smaller using LOACC, which is well worth it, given the speedup.

The runtimes of my four side-by-side 16-threaded DC runs @2304K dropped from 9ms/iter to 8:
Code:
[Sep 25 01:32:29] M40****** Iter# = 29800000 [72.77% complete] clocks = 00:00:00.000 [  0.0091 sec/iter] Res64: 3B5EE3838B8FD15F. AvgMaxErr = 0.040980597. MaxErr = 0.058593750.
[Sep 25 01:47:50] M40****** Iter# = 29900000 [73.01% complete] clocks = 00:00:00.000 [  0.0092 sec/iter] Res64: 3BB36C01A1540D7A. AvgMaxErr = 0.040969252. MaxErr = 0.058593750.
[Sep 25 02:03:02] M40****** Iter# = 30000000 [73.25% complete] clocks = 00:00:00.000 [  0.0091 sec/iter] Res64: C849F61FDD6E2180. AvgMaxErr = 0.040978765. MaxErr = 0.058593750.
[Sep 25 02:18:17] M40****** Iter# = 30100000 [73.50% complete] clocks = 00:00:00.000 [  0.0091 sec/iter] Res64: 334342BC216510AE. AvgMaxErr = 0.040982743. MaxErr = 0.058593750.
Restarting M40****** at iteration = 30100000. Res64: 334342BC216510AE
M40******: using FFT length 2304K = 2359296 8-byte floats.
Using complex FFT radices       144        16        16        32
[Sep 25 02:36:38] M40****** Iter# = 30200000 [73.74% complete] clocks = 00:00:00.000 [  0.0083 sec/iter] Res64: E3D39125EA274FA1. AvgMaxErr = 0.045169654. MaxErr = 0.070312500.
[Sep 25 02:50:22] M40****** Iter# = 30300000 [73.99% complete] clocks = 00:00:00.000 [  0.0082 sec/iter] Res64: E8C9F31475196409. AvgMaxErr = 0.045193331. MaxErr = 0.070312500.
Even better, the LOACC carry macros will be much easier to port to AVX-512, since they dispense with the intricate array-index manipulations needed by the 2-table scheme. (Those are made intricate by combination of a basic 2-table scheme and the fact that my implementation uses the *same* precomputed small table for both foward and inverse weights, by making use of the DWT-weights identity wt[j] * wt[n-j] = 2.)

Back to the TF code tomorrow ... I have ||ized all the various floating-double-based modpow routines (using various candidate-factor batch sizes, depending on the vector width of the SIMD build mode), but have to finish tracking down a memory-corruption bug in the 16-candidates-at-a-time routine which gives the best overall throughput for AVX/AVX2 builds.

Last fiddled with by ewmayer on 2016-09-25 at 07:16
ewmayer is offline   Reply With Quote
Old 2016-09-26, 08:03   #200
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

The 4 DCs I ran over the last few days on the KNL just completed - 3 of the results match the first-time test, one does not.

Anybody feel like running a triple-check on 40953091?
ewmayer is offline   Reply With Quote
Old 2016-09-26, 09:38   #201
proxy2222
 
Jun 2016

19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The 4 DCs I ran over the last few days on the KNL just completed - 3 of the results match the first-time test, one does not.

Anybody feel like running a triple-check on 40953091?
Ok, got it. ETA: 47 hours.
proxy2222 is offline   Reply With Quote
Old 2016-09-26, 21:00   #202
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by proxy2222 View Post
Ok, got it. ETA: 47 hours.
I unfortunately can't give you guys access to the boxes but if ewmayer wants a long term test run, you could send me a tar file with the application and I'll run it for up to a week or so. I can give you my email address if needed.
xathor is offline   Reply With Quote
Old 2016-09-26, 21:56   #203
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×2,351 Posts
Default

Quote:
Originally Posted by proxy2222 View Post
Ok, got it. ETA: 47 hours.
Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s.

@xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code.

@anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence?

gcc -o Mlucas *o -lm -lrt -lpthread

Last fiddled with by ewmayer on 2016-09-26 at 22:18
ewmayer is offline   Reply With Quote
Old 2016-09-27, 00:27   #204
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s.

@xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code.

@anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence?

gcc -o Mlucas *o -lm -lrt -lpthread
What library are you static linking? It is usually not necessary for the listed libs.
airsquirrels is offline   Reply With Quote
Old 2016-09-27, 00:53   #205
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·5·37 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thanks - here is the primenet exponent status for that one, showing the 2 mismatching Res64s.

@xathor: Send me your e-mail address and I'll shoot you a binary of my dev-branch code.

@anyone: For some reason I'm having trouble statically linking a binary - to rule out syntax fubars, where should -static go in the following link sequence?

gcc -o Mlucas *o -lm -lrt -lpthread
-lpthread has historically caused problems for me. While it may compile fine, it tends to crash at run-time if you don't do it right.

The solution that works for me was:

Code:
-static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
Mysticial is offline   Reply With Quote
Old 2016-09-27, 01:20   #206
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175510 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
What library are you static linking? It is usually not necessary for the listed libs.
I just want a binary I can shoot to xathor which will run even if his lib versions are not identical to the ones on our shared system.

========================

On a 64-core KNL running four C jobs each 16-threaded as I did makes sense ... just played with some other threadcounts on the now-unloaded system, and see that a 72-core system would have been nice for the DCs I was running FFT length @2304K = 18 * 64K. That's because my FFT code breaks each FFT-mul into 2 steps, each with a different memory access pattern and optimal threadcount. The bigger step (~2/3 of runtime) needs #threads to divide the [leading FFT radix]/2 in order to keep all threads equally busy and no threads idling. In my case a leading FFT radix of 144 is best at 2304K, so we'd like to use 18-threads rather than 16 since 144/2 = 72 gets done in 'waves' of this many parallel-executing threads for those 2 threadcounts:

16-thread: 16,16,16,16,8 (i.e. final wave uses only half the threads)
18-thread: 18,18,18,18

(Not 36 is out because I see a rapid dropoff in || efficiency once I go over ~16-threads using my current code on this system.)

The smaller step needs a power-of-2 threacount, i.e. 16. With nothing else running, here are timings for the two options (the 'real' cpmponent of each 3-line linux 'time' result reflects wall-clock time):

16+16:
1000 iterations of M44000003 with FFT length 2359296 = 2304 K
Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0
real 0m8.316s
user 1m31.282s
sys 0m2.219s

18+16:
1000 iterations of M44000003 with FFT length 2359296 = 2304 K
Res64: 4B92F6D969AE3B81. AvgMaxErr = 0.282141239. MaxErr = 0.375000000. Program: E16.0
real 0m7.878s
user 1m31.222s
sys 0m2.338s

On a 64-core system four runs at [18+16]-threads is out because the 18-thread phases of the four jobs end up competing for the same 'overlap pairs' physical cores, e,g. job1 might use cpu 0:17, and job2 use 16:33, thus cores 16 ans 17 are oversubscribed, and the result is slower than just running all four jobs using [16+16]-threads with no coreset overlap. On a 72-core system it would (will) be interesting to compare which is faster: four jobs using [18+16], or five jobs using 4x[16,16], 1x[8,8]-threads.
ewmayer is offline   Reply With Quote
Old 2016-09-27, 01:24   #207
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1175510 Posts
Default

Quote:
Originally Posted by Mysticial View Post
-lpthread has historically caused problems for me. While it may compile fine, it tends to crash at run-time if you don't do it right.

The solution that works for me was:

Code:
-static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive
So the -static needs to be on a per-library basis? Just tried this:

[ewmayer@localhost obj_mlucas]$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lrt
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status
ewmayer is offline   Reply With Quote
Old 2016-09-27, 01:51   #208
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×5×37 Posts
Default

Quote:
Originally Posted by ewmayer View Post
So the -static needs to be on a per-library basis? Just tried this:

[ewmayer@localhost obj_mlucas]$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lrt
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status
Not per library. Just pthread. I believe static is only specified once at the beginning. But I'm not 100% sure.

Here's the command line from one of my projects:
Code:
g++ ../Main.cpp -I ../Source -std=c++14 -fno-rtti -Wall -Wno-unused-function -Wno-unused-variable -save-temps -O2 -D YMP_STANDALONE -static -Wl,--whole-archive -lpthread -Wl,--no-whole-archive -march=knl -D X64_16_KnightsLanding -D YMP_BUILD_RELEASE -o "y-cruncher/Binaries/x64 AVX512-CD"
Granted, there's a slight difference here in that I'm not linking separately and that -lpthread is the only thing I'm explicitly linking. -lm is pulled in automatically for C++. So I'm unsure of the multiple library case. This is what worked for me after hours of trial-and-error.

Last fiddled with by Mysticial on 2016-09-27 at 01:57
Mysticial is offline   Reply With Quote
Old 2016-09-27, 05:49   #209
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101111010112 Posts
Default

Quote:
Originally Posted by ewmayer View Post
So the -static needs to be on a per-library basis? Just tried this:

[ewmayer@localhost obj_mlucas]$ gcc -o Mlucas.static *o -static -lm -static -lrt -static -lpthread
/usr/bin/ld: cannot find -lm
/usr/bin/ld: cannot find -lrt
/usr/bin/ld: cannot find -lpthread
/usr/bin/ld: cannot find -lc
collect2: error: ld returned 1 exit status
BTW, I get the same set of linker errors using the -static flag the way I always understood it, that is, just the once immediately following the *.o ('link all object-files in the local dir') and preceding the set of library references. David, is it possible this system does not support static linkage?

Anyhoo, @xathor, I've attached a zipped copy of my shared-lib binary, in hopes your setup is the same or similar enough to the CentOS+GCC5.1 install on our shared-dev KNL system to allow it to run. Note this is the faster but slightly less accurate build I mentioned getting a 10% speedup from. Since I've not yet modified my self-test functions to use suitably smaller self-test exponents for such LOACC builds, if you try the default self-tests, most will fail with fatal roundoff errors. So if the binary does run for you, just go ahead and propagate the following mlucas.cfg file (based on the higher-accuracy default build) to your various rundirs - I don't expect LOACC mode to affect the best-FFT-radix-set, so no worries about suboptimal FFT params on that account. The per-iter timings here reflect 16-threaded run mode, but I found the same FFT params to be best at smaller thread counts as well:

16.0
2304 msec/iter = 8.38 ROE[avg,max] = [0.277738559, 0.375000000] radices = 144 16 16 32 0 0 0 0 0 0
2560 msec/iter = 8.93 ROE[avg,max] = [0.275268696, 0.328125000] radices = 160 16 16 32 0 0 0 0 0 0
2816 msec/iter = 10.25 ROE[avg,max] = [0.260132906, 0.343750000] radices = 176 16 16 32 0 0 0 0 0 0
3072 msec/iter = 11.44 ROE[avg,max] = [0.269535088, 0.343750000] radices = 192 16 16 32 0 0 0 0 0 0
3328 msec/iter = 11.97 ROE[avg,max] = [0.269532634, 0.343750000] radices = 208 16 16 32 0 0 0 0 0 0
3584 msec/iter = 12.23 ROE[avg,max] = [0.259974261, 0.312500000] radices = 224 16 16 32 0 0 0 0 0 0
3840 msec/iter = 12.57 ROE[avg,max] = [0.285017820, 0.375000000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 13.71 ROE[avg,max] = [0.285594508, 0.343750000] radices = 256 16 16 32 0 0 0 0 0 0
4608 msec/iter = 14.53 ROE[avg,max] = [0.349673808, 0.437500000] radices = 288 16 16 32 0 0 0 0 0 0
Attached Files
File Type: bz2 Mlucas_loacc.bz2 (1.53 MB, 89 views)
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 08:08.


Fri Jan 27 08:08:09 UTC 2023 up 162 days, 5:36, 0 users, load averages: 1.09, 1.19, 1.06

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔