![]() |
|
|
#188 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×7×383 Posts |
|
|
|
|
|
|
#189 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2×7×383 Posts |
This Cray KNL slide set is interesting.
Cache mode: https://docs.nersc.gov/performance/knl/cache-mode/ Optimizing: https://www.nersc.gov/assets/Uploads...NL-swarren.pdf MCDRAM on KNL tutorial https://www.slideshare.net/IwantoutofVT/mcdram-tutorial Capability models vs. measured performance http://htor.inf.ethz.ch/publications...l-analysis.pdf Last fiddled with by kriesel on 2021-06-13 at 17:31 |
|
|
|
|
|
#190 | ||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
@kriesel: Thx, but the Cray docs refer to a particular KNL-node hardware configuration which they apparently used in some KNL-based supercomputers.
Quote:
MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide | Colfax Research Rebooted system, verified that newly-added RAM is being run 6-way interleaved. But see no boot menu options related to configuring the MCDRAM, which I suspect means it's defaulting to cache mode - hopefully the meminfo dump below will be informative. Cache mode is actually what I want, since all my current runs fit easily in the MCDRAM, with plenty of room to spare. So the question is, why isn't the OS just running them out of the MCDRAM? Quote:
Code:
MemTotal: 214147932 kB MemFree: 210226340 kB MemAvailable: 209402676 kB Buffers: 4384 kB Cached: 668728 kB SwapCached: 0 kB Active: 1677176 kB Inactive: 510388 kB Active(anon): 1517104 kB Inactive(anon): 14100 kB Active(file): 160072 kB Inactive(file): 496288 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 8183804 kB SwapFree: 8183804 kB Dirty: 12 kB Writeback: 0 kB AnonPages: 1484528 kB Mapped: 264436 kB Shmem: 16676 kB KReclaimable: 238516 kB Slab: 1154360 kB SReclaimable: 238516 kB SUnreclaim: 915844 kB KernelStack: 52512 kB PageTables: 38948 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 115257768 kB Committed_AS: 6245572 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB Percpu: 162112 kB HardwareCorrupted: 0 kB AnonHugePages: 892928 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 707124 kB DirectMap2M: 7555072 kB DirectMap1G: 211812352 kB |
||
|
|
|
|
|
#191 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote from the above-linked Colfax page:
Quote:
Last fiddled with by ewmayer on 2021-06-13 at 22:19 |
|
|
|
|
|
|
#192 |
|
Sep 2002
Database er0rr
3,739 Posts |
This is way out of my depth, but do you have a good reason for not using numactl in flat-mode? Have you tried timings using it?
|
|
|
|
|
|
#193 | |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
1. I don't know which mode is currently being set for MCDRAM, with regular RAM now installed, and found no related boot option; 2. 'numactl: command not found.' If the latter can be remedied via installation of an optional software package, I'm all ears. |
|
|
|
|
|
|
#194 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
2·7·383 Posts |
Quote:
It sure doesn't help that available SuperMicro documentation is silent on the memory model, and so is the K1SPE motherboard's BIOS display. Does Linux report round binary powers as MemTotal, or after deducting for BIOS ROM shadowing or other deductions? I'm assuming below, they really mean total, which should come out as whole GiB. (Although this top tutorial shows "total" not a round GiB number either.) Flat memory model would give 16+192=208 GiB; Cache would give 192 GiB. 214147932kB (kB assumed = kiB) - 6 * 32GiB DIMMs ~ 12.27 GiB MCDRAM ~75/25 split addressable/cache; 214147932kB (kB assumed = 1000B) - 6 * 32GiB DIMMs ~ 7.44 GiB MCDRAM, ~50/50equal addressable/cache; implying the MCDRAM is being divided to addressable/cache as in what Colfax calls hybrid mode or as shown on slide 10 of https://www.nersc.gov/assets/Uploads...rs-Feb2019.pdf You get nothing for numactl --hardware now, because it's not installed? Should be interesting when available. On Windows 10 Pro, systeminfo gave for MCDRAM only, Total Physical Memory: 16,260 MB (~15.88 GiB); with a 64GiB DIMM added, Total Physical Memory: 81,796 MB (~79.88 GiB), indicating Windows boots to a flat memory model. That and the ~40:1 latency range in the last link of https://www.mersenneforum.org/showpo...&postcount=189 pretty well matches the dramatic drop in prime95 performance I saw with a DIMM installed. Last fiddled with by kriesel on 2021-06-13 at 23:59 |
|
|
|
|
|
|
#195 | ||
|
Sep 2002
Database er0rr
3,739 Posts |
Quote:
![]() 2, Installation: Code:
sudo yum install numactl Code:
numactl -H Code:
numactl -m 1 ./yr_prog Quote:
Last fiddled with by paulunderwood on 2021-06-13 at 23:30 |
||
|
|
|
|
|
#196 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
@Paul - thx. NUMA-ness must be enabled by default in BIOS, because after installing the package, 'numactl -H' gives
Code:
[ewmayer@localhost ~]$ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 193005 MB node 0 free: 178568 MB node 1 cpus: node 1 size: 16123 MB node 1 free: 15992 MB node distances: node 0 1 0: 10 31 1: 31 10 |
|
|
|
|
|
#197 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
After debug & test of the integrated GMP gcd code over the weekend fired up one trial run p-1 run to bounds [b1,b2] = [1m,30m] on a known-factor case, to test the Mlucas v20 dev-code sure on a soup-to-nuts (complete stages 1 & 2) run. That shook out a couple bugs, and is now ~midway through stage 2 on one of the known-factor cases described next.
For shakeout testing I assembled 55 exponents, all the cases in which my GPU runs of gpuowl over the last year+ found a factor in p-1. Around 10 cases can use 5.5M FFT; the rest need 6M. Those runs used a mix of the foregoing bounds and the larger ones (typically [5.5m,165m]) used by the newer gpuowl versions. As it happens, all of said factors are findable with b1 = 1m, and 24 such need only stage 1 to that bound. Of the 31 remaining ones, just 2 need b2 > 30m, so I broke those into an opening run with [b1,b2] = [1m,30m] and as many subsequent deeper-stage-2-only runs in 30m-increments as needed to discover the factor in question. Then, in attempt to roughly equalize the work-per-instance, I figured 1 work unit (WU) for stage 1 to 1m and 1 WU for 30m worth of stage 2. Then I divvied them up - this includes the already-started run above, which needs stage 2 to ~155m, thus 2+1+1+1+1+1 = 7 WUs and thus will work only on that assignment - across 17 worktodo files, making sure has at least 1 assignment needing stage 2 only, and each of the 16 new-instance workfiles amounted to 5 WUs. My ongoing run on cores 64:67 needed 44ms/iter for stage 1 (b1 = 1m ==> ~1.44miters) and 58ms/iter for each stage 2 iteration, which at my memory settings for all these runs - 10% available RAM, ~20GB - processes ~1.5 stage 2 primes. (I hope to cut that 58/44 timing ratio closer to 1 in ongoing work.) This run was launched without using 'numactl' to force it to live in the 16GB onboard MCDRAM (only ~10GB is avilable for user stuff, in my experience), because stage 2 needs more than that. So fired up the 16 new jobs, each of which will start with stage 1 to 1m and thus with a tiny memory footprint, but again sans numactl just for an apples-to-apples comparison with the ongoing run. The resulting 'top' was very satisfying: Code:
[ewmayer@localhost obj_avx512]$ top
top - 16:18:40 up 4 days, 1:05, 2 users, load average: 37.88, 43.93, 45.72
Tasks: 2400 total, 2 running, 2396 sleeping, 2 stopped, 0 zombie
%Cpu(s): 23.5 us, 0.4 sy, 0.0 ni, 74.8 id, 0.0 wa, 1.1 hi, 0.1 si, 0.0 st
MiB Mem : 209128.8 total, 179472.7 free, 26840.6 used, 2815.5 buff/cache
MiB Swap: 7992.0 total, 7992.0 free, 0.0 used. 180465.7 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
86256 ewmayer 20 0 366692 211616 4116 S 383.3 0.1 1:52.26 Mlucas
86300 ewmayer 20 0 367040 211160 4184 S 382.7 0.1 1:09.98 Mlucas
86318 ewmayer 20 0 367164 211220 4136 S 382.2 0.1 1:08.98 Mlucas
83998 ewmayer 20 0 19.3g 19.2g 4100 S 379.9 9.4 554:08.26 Mlucas
86289 ewmayer 20 0 366700 212988 4116 S 379.9 0.1 1:09.01 Mlucas
86287 ewmayer 20 0 366696 214036 4136 S 379.3 0.1 1:09.53 Mlucas
86254 ewmayer 20 0 349776 200180 3908 S 378.8 0.1 1:50.79 Mlucas
86225 ewmayer 20 0 349048 199300 3924 S 377.1 0.1 5:18.41 Mlucas
86306 ewmayer 20 0 367112 210916 4088 S 376.8 0.1 1:08.94 Mlucas
86297 ewmayer 20 0 367004 211016 4064 S 376.5 0.1 1:08.58 Mlucas
86315 ewmayer 20 0 367140 211044 4100 S 375.9 0.1 1:08.72 Mlucas
86291 ewmayer 20 0 366916 213872 4172 S 374.2 0.1 1:09.21 Mlucas
86312 ewmayer 20 0 367132 213060 4136 S 374.2 0.1 1:08.51 Mlucas
86303 ewmayer 20 0 367044 215520 4136 S 373.4 0.1 1:08.79 Mlucas
86309 ewmayer 20 0 367128 211172 4100 S 372.2 0.1 1:09.30 Mlucas
86252 ewmayer 20 0 349524 198536 3896 S 371.7 0.1 1:50.21 Mlucas
86294 ewmayer 20 0 366980 214212 4184 S 364.0 0.1 1:07.36 Mlucas
86418 ewmayer 20 0 276924 7500 4156 R 13.3 0.0 0:03.20 top
10 root 20 0 0 0 0 I 0.8 0.0 4:20.04 rcu_sched
85938 root 20 0 0 0 0 I 0.8 0.0 0:02.60 kworker/u544:1-phy0
So, since the 16 in-stage-1 runs need little memory for now, killed and restarted them, now *with* 'numactl -m 1'. The 5.5M-FFT jobs are down to 37ms/iter, the 6M ones to 46ms/iter, and the early-start run had its stage 2 timing drop back to 59 ms/iter. Now I know the KNL is an odd beast for CPU-based machines, its MCDRAM makes it very GPU-like in terms of memory managment, but it would be nice on such NUMA hardware to somehow run stage 1 in fast-RAM, then switch the above numactl setting back to "run in regular RAM" for stage 2. |
|
|
|
|
|
#198 |
|
Sep 2002
Database er0rr
3,739 Posts |
I can only think you program this as:
numactl -m 1 stage1; numactl -m 0 stage2 |
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| AMD vs Intel | dtripp | Software | 3 | 2013-02-19 20:20 |
| Intel NUC | nucleon | Hardware | 2 | 2012-05-10 23:53 |
| Intel RNG API? | R.D. Silverman | Programming | 19 | 2011-09-17 01:43 |
| AMD or Intel | mack | Information & Answers | 7 | 2009-09-13 01:48 |
| Intel Mac? | penguain | NFSNET Discussion | 0 | 2006-06-12 01:31 |