mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-16, 16:38   #133
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

7·13·41 Posts
Default

Quote:
Originally Posted by xathor View Post
Traditional hyperthreading is out of order and only used when needed. KNL hyperthreading is round robin in order.
KNC is round robin in order. KNL allows back to back instructions from a single thread. "Thread count requirement reduced to ~70 (one thread per core) from 120 (2 threads per core)". Etc. Unless the KNL guides from Colfax are completely false.

must register, but otherwise free:
colfaxresearch.com/knl-webinar

look at slide 26.

The hot chips paper/presentation is also a good read:
http://www.hotchips.org/wp-content/u...dani-Intel.pdf
http://ieeexplore.ieee.org/document/...number=7453080

Last fiddled with by bsquared on 2016-09-16 at 16:43 Reason: links
bsquared is offline   Reply With Quote
Old 2016-09-16, 17:16   #134
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by science_man_88 View Post
Forgive me if I am a little bit ignorant to this, but what command is he wanting me to run exactly? I have no problem doing it, I just want to make sure I get his request correct.


If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.

Last fiddled with by xathor on 2016-09-16 at 17:18
xathor is offline   Reply With Quote
Old 2016-09-16, 17:58   #135
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

37·163 Posts
Default

Quote:
Originally Posted by xathor View Post
Forgive me if I am a little bit ignorant to this, but what command is he wanting me to run exactly? I have no problem doing it, I just want to make sure I get his request correct.


If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.
Trial division under mprime to see how it scales with threads.
henryzz is offline   Reply With Quote
Old 2016-09-16, 18:23   #136
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

32×5×79 Posts
Default

Quote:
Originally Posted by xathor View Post
If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.
AFAIK there are no Mersenne testing programs that use MPI. Building one and tuning it for many MPI processes, each with ideally a few threads, would be very challenging. But potentially it can buy a lot of performance because it forces programmers to be explicit about what is shared between MPI instances, and that's really critical on highly-NUMA systems like this one.

For reference, I started the MPI port of Msieve back in 2010; because the multithreading scheme for the linear algebra was not that efficient back then the speedup was immediately better than multithreading alone, even when limited to a single (admittedly very nice) box. But it took a year or two before using MPI and multithreading together ran faster than either one alone on a single box. fivemack had to use black magic arguments to pin MPI processes to the correct physical nodes, but with that in place his old 48-core AMD system absolutely flew.
jasonp is offline   Reply With Quote
Old 2016-09-16, 21:21   #137
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by xathor View Post
I'm not sure what you guys mean by trial factoring.

Here is a Haswell (dual E5-2670v3 24c AVX2) for comparison:
1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292735259. MaxErr = 0.343750000. Program: E14.1
...
Clocks = 00:00:09.605

Here is a Ivy-Bridge (dual E5-2670v2 20c AVX) for comparison:

1000 iterations of M77597293 with FFT length 4194304 = 4096 K
Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.249028471. MaxErr = 0.312500000. Program: E14.1
...
Clocks = 00:00:08.712
How many threads are those with? Are you using the -nthread flag to control threadcount for those (an on your KNL)? Without that flag, Mlucas will use as many threads as virtual cores it detects on the system. (This seems to be OS-dependent - on my debian-running Haswell quad with HT enabled at boot that number is 4, on my dual-core Broadwell NUC under Ubuntu it is again 4, i.e. 2x the number of physical cores on the latter system.)
ewmayer is offline   Reply With Quote
Old 2016-09-16, 22:27   #138
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

815010 Posts
Default

Quote:
Originally Posted by xathor View Post
I'm not sure what you guys mean by trial factoring.
mprime/prime95 can also do trial factoring, but graphics cards are better suited to that job.

mprime/prime95 will have difficulty using a vast number of threads. As currently implemented, only one thread can do the sieving -- 63 TFing threads would easily outpace the one sieving thread. I'd suggest something like 8 workers using 16 threads or 16 workers using 8 threads. I do suspect hyperthreading will be beneficial.
Prime95 is offline   Reply With Quote
Old 2016-09-17, 00:49   #139
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

First-look timings for a build of my dev-branch Mlucas code (no major performance differences over current release, but has enhanced thread affinity-setting options. I need 8 cores to get a timing close to the 23 ms/iter xathor noted in post #123, without any mention of threadcount.

Build using gcc 5.1, default core-affinity [n threads ==> affinity set to cores 0:n-1], per-iter times in ms:
Code:
#thr iter(ms)
  1  197.72
  2  103.86
  4   48.26
  8   27.48
 16   13.71
 32    9.85
 64    9.95
Summary:
o Base 1-thread timing is dismal ... about the same as my aged Core2;
o Scaling quite good up to 32 threads, plateaus there.

Here is output of 'numactl -H' ... David, am I right in surmising that to mean "no NUMA clustering"?

available: 1 nodes (0)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255
node 0 size: 98178 MB
node 0 free: 94363 MB
node distances:
node 0
0: 10


Time to play with non-default thread affinity-setting, e.g. every-other-core and every-fourth-core. My new coreset-summary messaging at least is a definite improvement over the current release's verbose per-thread messaging:

Set affinity for the following 64 cores: 0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.
ewmayer is offline   Reply With Quote
Old 2016-09-17, 00:55   #140
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10000001012 Posts
Default

Quote:
Originally Posted by ewmayer View Post
...
Here is output of 'numactl -H' ... David, am I right in surmising that to mean "no NUMA clustering"?...
That's correct, currently the system is configured with MCDRAM as:

Mode: Cache
Clustering Mode: Quadrant

In theory that means the cores should be hitting MCDRAM for everything unless you happen to exceed a 16GB working set. After we've had some time to do benchmarking we will likely want to set this back to Flat and work with the Clustering modes. At that point "numactl -m 0 ./Mlucas" can be used to force DDR4 only and -m 1 can be used to force MCDRAM.

In the other clustering modes we would need to be pretty explicit about which cores are being used to maximize locality to the MCDRAM bank. http://colfaxresearch.com/knl-numa/ for more details.
airsquirrels is offline   Reply With Quote
Old 2016-09-17, 01:02   #141
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

1015810 Posts
Default

Both KNL threads are incredibly interesting and informative. I feel privileged to be able to audit them. All of the main participants are bloody amazing!

Thanks!
kladner is offline   Reply With Quote
Old 2016-09-17, 01:17   #142
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

PSA: Those who PM'd or emailed me with an SSH public key should now have accounts active. If you wanted an account and do not have one, please PM or email me an SSH public key and I will get that setup.
airsquirrels is offline   Reply With Quote
Old 2016-09-17, 01:34   #143
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by kladner View Post
Both KNL threads are incredibly interesting and informative. I feel privileged to be able to audit them. All of the main participants are bloody amazing!

Thanks!
Glad we can provide some entertainment!

Here is another fun tidbit. Based on the sample output of numactl -H when configured in subNuma clustering modes, latency within a quadrant is always 10 whether MCDRAM or DDR4 is used. However cross-quadrant latency is odd - DDR4 is always only 21 between quadrants, but MCDRAM has a latency of 41 between nodes. That suggests that there is a tradeoff between bandwidth and latency when scaling up beyond 16 threads, and that the DDR4 is going to be a bit quicker for sparse access unless we are saturating the 90Gb/s bandwidth...

Finally, the L2 cache for each tile pair is available via the cache grid to other tiles with latency 10 (same quadrant) or 21. That suggests that careful cache management could keep 32MB on die ahead of the HBM.

Last fiddled with by airsquirrels on 2016-09-17 at 01:39
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 08:16.


Fri Jan 27 08:16:25 UTC 2023 up 162 days, 5:44, 0 users, load averages: 1.20, 1.20, 1.11

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔