mersenneforum.org Your help wanted - Let's buy GIMPS a KNL development system!
 Register FAQ Search Today's Posts Mark Forums Read

2016-09-16, 16:38   #133
bsquared

"Ben"
Feb 2007

7·13·41 Posts

Quote:
 Originally Posted by xathor Traditional hyperthreading is out of order and only used when needed. KNL hyperthreading is round robin in order.
KNC is round robin in order. KNL allows back to back instructions from a single thread. "Thread count requirement reduced to ~70 (one thread per core) from 120 (2 threads per core)". Etc. Unless the KNL guides from Colfax are completely false.

must register, but otherwise free:
colfaxresearch.com/knl-webinar

look at slide 26.

The hot chips paper/presentation is also a good read:
http://www.hotchips.org/wp-content/u...dani-Intel.pdf
http://ieeexplore.ieee.org/document/...number=7453080

Last fiddled with by bsquared on 2016-09-16 at 16:43 Reason: links

2016-09-16, 17:16   #134
xathor

Sep 2016

19 Posts

Quote:
 Originally Posted by science_man_88 http://www.mersenne.org/various/math...rial_factoring
Forgive me if I am a little bit ignorant to this, but what command is he wanting me to run exactly? I have no problem doing it, I just want to make sure I get his request correct.

If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.

Last fiddled with by xathor on 2016-09-16 at 17:18

2016-09-16, 17:58   #135
henryzz
Just call me Henry

"David"
Sep 2007
Liverpool (GMT/BST)

37·163 Posts

Quote:
 Originally Posted by xathor Forgive me if I am a little bit ignorant to this, but what command is he wanting me to run exactly? I have no problem doing it, I just want to make sure I get his request correct. If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.
Trial division under mprime to see how it scales with threads.

2016-09-16, 18:23   #136
jasonp
Tribal Bullet

Oct 2004

32×5×79 Posts

Quote:
 Originally Posted by xathor If there are any MPI capable builds, I can scale this up quite high for testing. I also have a UV2000 available if you guys are interested in seeing scaling up to 256 physical cores with OpenMP.
AFAIK there are no Mersenne testing programs that use MPI. Building one and tuning it for many MPI processes, each with ideally a few threads, would be very challenging. But potentially it can buy a lot of performance because it forces programmers to be explicit about what is shared between MPI instances, and that's really critical on highly-NUMA systems like this one.

For reference, I started the MPI port of Msieve back in 2010; because the multithreading scheme for the linear algebra was not that efficient back then the speedup was immediately better than multithreading alone, even when limited to a single (admittedly very nice) box. But it took a year or two before using MPI and multithreading together ran faster than either one alone on a single box. fivemack had to use black magic arguments to pin MPI processes to the correct physical nodes, but with that in place his old 48-core AMD system absolutely flew.

2016-09-16, 21:21   #137
ewmayer
2ω=0

Sep 2002
República de California

5·2,351 Posts

Quote:
 Originally Posted by xathor I'm not sure what you guys mean by trial factoring. Here is a Haswell (dual E5-2670v3 24c AVX2) for comparison: 1000 iterations of M77597293 with FFT length 4194304 = 4096 K Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.292735259. MaxErr = 0.343750000. Program: E14.1 ... Clocks = 00:00:09.605 Here is a Ivy-Bridge (dual E5-2670v2 20c AVX) for comparison: 1000 iterations of M77597293 with FFT length 4194304 = 4096 K Res64: 5F87421FA9DD8F1F. AvgMaxErr = 0.249028471. MaxErr = 0.312500000. Program: E14.1 ... Clocks = 00:00:08.712
How many threads are those with? Are you using the -nthread flag to control threadcount for those (an on your KNL)? Without that flag, Mlucas will use as many threads as virtual cores it detects on the system. (This seems to be OS-dependent - on my debian-running Haswell quad with HT enabled at boot that number is 4, on my dual-core Broadwell NUC under Ubuntu it is again 4, i.e. 2x the number of physical cores on the latter system.)

2016-09-16, 22:27   #138
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

815010 Posts

Quote:
 Originally Posted by xathor I'm not sure what you guys mean by trial factoring.
mprime/prime95 can also do trial factoring, but graphics cards are better suited to that job.

mprime/prime95 will have difficulty using a vast number of threads. As currently implemented, only one thread can do the sieving -- 63 TFing threads would easily outpace the one sieving thread. I'd suggest something like 8 workers using 16 threads or 16 workers using 8 threads. I do suspect hyperthreading will be beneficial.

 2016-09-17, 00:49 #139 ewmayer ∂2ω=0     Sep 2002 República de California 5·2,351 Posts First-look timings for a build of my dev-branch Mlucas code (no major performance differences over current release, but has enhanced thread affinity-setting options. I need 8 cores to get a timing close to the 23 ms/iter xathor noted in post #123, without any mention of threadcount. Build using gcc 5.1, default core-affinity [n threads ==> affinity set to cores 0:n-1], per-iter times in ms: Code: #thr iter(ms) 1 197.72 2 103.86 4 48.26 8 27.48 16 13.71 32 9.85 64 9.95 Summary: o Base 1-thread timing is dismal ... about the same as my aged Core2; o Scaling quite good up to 32 threads, plateaus there. Here is output of 'numactl -H' ... David, am I right in surmising that to mean "no NUMA clustering"? available: 1 nodes (0) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 node 0 size: 98178 MB node 0 free: 94363 MB node distances: node 0 0: 10 Time to play with non-default thread affinity-setting, e.g. every-other-core and every-fourth-core. My new coreset-summary messaging at least is a definite improvement over the current release's verbose per-thread messaging: Set affinity for the following 64 cores: 0.1.2.3.4.5.6.7.8.9.10.11.12.13.14.15.16.17.18.19.20.21.22.23.24.25.26.27.28.29.30.31.32.33.34.35.36.37.38.39.40.41.42.43.44.45.46.47.48.49.50.51.52.53.54.55.56.57.58.59.60.61.62.63.
2016-09-17, 00:55   #140
airsquirrels

"David"
Jul 2015
Ohio

10000001012 Posts

Quote:
 Originally Posted by ewmayer ... Here is output of 'numactl -H' ... David, am I right in surmising that to mean "no NUMA clustering"?...
That's correct, currently the system is configured with MCDRAM as:

Mode: Cache

In theory that means the cores should be hitting MCDRAM for everything unless you happen to exceed a 16GB working set. After we've had some time to do benchmarking we will likely want to set this back to Flat and work with the Clustering modes. At that point "numactl -m 0 ./Mlucas" can be used to force DDR4 only and -m 1 can be used to force MCDRAM.

In the other clustering modes we would need to be pretty explicit about which cores are being used to maximize locality to the MCDRAM bank. http://colfaxresearch.com/knl-numa/ for more details.

 2016-09-17, 01:02 #141 kladner     "Kieren" Jul 2011 In My Own Galaxy! 1015810 Posts Both KNL threads are incredibly interesting and informative. I feel privileged to be able to audit them. All of the main participants are bloody amazing! Thanks!
 2016-09-17, 01:17 #142 airsquirrels     "David" Jul 2015 Ohio 11×47 Posts PSA: Those who PM'd or emailed me with an SSH public key should now have accounts active. If you wanted an account and do not have one, please PM or email me an SSH public key and I will get that setup.
2016-09-17, 01:34   #143
airsquirrels

"David"
Jul 2015
Ohio

11×47 Posts

Quote:
 Originally Posted by kladner Both KNL threads are incredibly interesting and informative. I feel privileged to be able to audit them. All of the main participants are bloody amazing! Thanks!
Glad we can provide some entertainment!

Here is another fun tidbit. Based on the sample output of numactl -H when configured in subNuma clustering modes, latency within a quadrant is always 10 whether MCDRAM or DDR4 is used. However cross-quadrant latency is odd - DDR4 is always only 21 between quadrants, but MCDRAM has a latency of 41 between nodes. That suggests that there is a tradeoff between bandwidth and latency when scaling up beyond 16 threads, and that the DDR4 is going to be a bit quicker for sparse access unless we are saturating the 90Gb/s bandwidth...

Finally, the L2 cache for each tile pair is available via the cache grid to other tiles with latency 10 (same quadrant) or 21. That suggests that careful cache management could keep 32MB on die ahead of the HBM.

Last fiddled with by airsquirrels on 2016-09-17 at 01:39

 Similar Threads Thread Thread Starter Forum Replies Last Post Jean Penné Software 39 2012-04-27 12:33 Jean Penné Software 6 2011-04-28 06:21 Surge Hardware 5 2010-12-09 04:07 Unregistered Hardware 6 2005-07-04 04:27 Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 08:16.

Fri Jan 27 08:16:25 UTC 2023 up 162 days, 5:44, 0 users, load averages: 1.20, 1.20, 1.11