mersenneforum.org Intel Xeon PHI?
 Register FAQ Search Today's Posts Mark Forums Read

2021-06-13, 13:24   #188
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124658 Posts

Quote:
 Originally Posted by ewmayer 192GB RAM successfully installed and running @2400MHz per the BIOS. But -
Congrats.
What do available utilities say about the memory mode, cache layout etc.?
Is that still on CentOS v8.2?
How about a quick mprime benchmark on the same inputs?

 2021-06-13, 17:22 #189 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 124658 Posts This Cray KNL slide set is interesting. Cache mode: https://docs.nersc.gov/performance/knl/cache-mode/ Optimizing: https://www.nersc.gov/assets/Uploads...NL-swarren.pdf MCDRAM on KNL tutorial https://www.slideshare.net/IwantoutofVT/mcdram-tutorial Capability models vs. measured performance http://htor.inf.ethz.ch/publications...l-analysis.pdf Last fiddled with by kriesel on 2021-06-13 at 17:31
2021-06-13, 21:56   #190
ewmayer
2ω=0

Sep 2002
República de California

19·613 Posts

@kriesel: Thx, but the Cray docs refer to a particular KNL-node hardware configuration which they apparently used in some KNL-based supercomputers.

Quote:
 Originally Posted by paulunderwood 1, is bad news. The access to the DIMMs must be much slower than the MCDRAM -- an expensive lesson. With them, do you have a great 68-instances P-1 machine? See the table on this page. There must be a way to program using mostly MCDRAM much like using caches. More programming info on this page.
I revisited this similar tutorial from Colfax Research which I'd studied previously:

MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide | Colfax Research

Rebooted system, verified that newly-added RAM is being run 6-way interleaved. But see no boot menu options related to configuring the MCDRAM, which I suspect means it's defaulting to cache mode - hopefully the meminfo dump below will be informative. Cache mode is actually what I want, since all my current runs fit easily in the MCDRAM, with plenty of room to spare. So the question is, why isn't the OS just running them out of the MCDRAM?
Quote:
 2. You have to run cat /proc/cpuinfo. cpuinfo is not an executable.
Doh - forgot to prepend 'cat' or 'more'. Here is /proc/meminfo:
Code:
MemTotal:       214147932 kB
MemFree:        210226340 kB
MemAvailable:   209402676 kB
Buffers:            4384 kB
Cached:           668728 kB
SwapCached:            0 kB
Active:          1677176 kB
Inactive:         510388 kB
Active(anon):    1517104 kB
Inactive(anon):    14100 kB
Active(file):     160072 kB
Inactive(file):   496288 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8183804 kB
SwapFree:        8183804 kB
Dirty:                12 kB
Writeback:             0 kB
AnonPages:       1484528 kB
Mapped:           264436 kB
Shmem:             16676 kB
KReclaimable:     238516 kB
Slab:            1154360 kB
SReclaimable:     238516 kB
SUnreclaim:       915844 kB
KernelStack:       52512 kB
PageTables:        38948 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    115257768 kB
Committed_AS:    6245572 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:           162112 kB
HardwareCorrupted:     0 kB
AnonHugePages:    892928 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      707124 kB
DirectMap2M:     7555072 kB
DirectMap1G:    211812352 kB

2021-06-13, 22:18   #191
ewmayer
2ω=0

Sep 2002
República de California

19×613 Posts

Quote from the above-linked Colfax page:
Quote:
 Case 1: The entire application fits in HBM. This is the best case scenario. If your application fits in the HBM, then set the configuration mode to Flat mode and follow the numactl instructions in Section 3.1. This usage mode does not require any code modification, and works with any application (written in any language) provided that it does not have special allocators that specifically allocate elsewhere. Note that, although this procedure requires no source code changes, applications could still benefit from general memory optimization. For more on memory traffic optimization, refer to the various online references on optimization such as the HOW Series. If numactl cannot be used, then using Cache mode could be an alternative. Because the problem fits in the HBM cache, there will only be a few HBM cache misses. HBM cache misses are the primary factor in the performance difference between addressable memory HBM and cache HBM, so using the Cache mode could get close to or even match the performance with Flat mode. However, there is still some inherent overhead associated with using HBM as cache, so if numactl is an option, we recommend to use that method.
Again, though, I've not found a way to control the configuration mode, the only visible NUMA-related subitems in the SuperMicro boot menu are disabled, and 'which numactl' comes up empty.

Last fiddled with by ewmayer on 2021-06-13 at 22:19

2021-06-13, 22:35   #192
paulunderwood

Sep 2002
Database er0rr

2·32·11·19 Posts

Quote:
 Originally Posted by ewmayer Again, though, I've not found a way to control the configuration mode, the only visible NUMA-related subitems in the SuperMicro boot menu are disabled, and 'which numactl' comes up empty.
This is way out of my depth, but do you have a good reason for not using numactl in flat-mode? Have you tried timings using it?

2021-06-13, 22:59   #193
ewmayer
2ω=0

Sep 2002
República de California

101101011111112 Posts

Quote:
 Originally Posted by paulunderwood This is way out of my depth, but do you have a good reason for not using numactl in flat-mode? Have you tried timings using it?
I have 2 good reasons for not using numactl:

1. I don't know which mode is currently being set for MCDRAM, with regular RAM now installed, and found no related boot option;

If the latter can be remedied via installation of an optional software package, I'm all ears.

2021-06-13, 23:00   #194
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

61×89 Posts

Quote:
 Originally Posted by ewmayer Thx but... Here is /proc/meminfo: Code: MemTotal: 214147932 kB... DirectMap4k: 707124 kB DirectMap2M: 7555072 kB DirectMap1G: 211812352 kB
I'm pretty sure Cray didn't change the 7250 design, or they'd be calling it something else. So I think it's safe to rely on Intel's specs for it remaining constant regardless of the info source.
It sure doesn't help that available SuperMicro documentation is silent on the memory model, and so is the K1SPE motherboard's BIOS display.
Does Linux report round binary powers as MemTotal, or after deducting for BIOS ROM shadowing or other deductions? I'm assuming below, they really mean total, which should come out as whole GiB. (Although this top tutorial shows "total" not a round GiB number either.)

Flat memory model would give 16+192=208 GiB;
Cache would give 192 GiB.

214147932kB (kB assumed = kiB) - 6 * 32GiB DIMMs ~ 12.27 GiB MCDRAM ~75/25 split addressable/cache;
214147932kB (kB assumed = 1000B) - 6 * 32GiB DIMMs ~ 7.44 GiB MCDRAM, ~50/50equal addressable/cache;
implying the MCDRAM is being divided to addressable/cache as in what Colfax calls hybrid mode or as shown on slide 10 of https://www.nersc.gov/assets/Uploads...rs-Feb2019.pdf

You get nothing for numactl --hardware now, because it's not installed? Should be interesting when available.

On Windows 10 Pro, systeminfo gave for MCDRAM only, Total Physical Memory: 16,260 MB (~15.88 GiB);
with a 64GiB DIMM added, Total Physical Memory: 81,796 MB (~79.88 GiB), indicating Windows boots to a flat memory model. That and the ~40:1 latency range in the last link of https://www.mersenneforum.org/showpo...&postcount=189 pretty well matches the dramatic drop in prime95 performance I saw with a DIMM installed.

Last fiddled with by kriesel on 2021-06-13 at 23:59

2021-06-13, 23:09   #195
paulunderwood

Sep 2002
Database er0rr

EB216 Posts

Quote:
 Originally Posted by ewmayer I have 2 good reasons for not using numactl: 1. I don't know which mode is currently being set for MCDRAM, with regular RAM now installed, and found no related boot option; 2. 'numactl: command not found.' If the latter can be remedied via installation of an optional software package, I'm all ears.
1. Check NUMA is on in BIOS whatever that is!

2,
Installation:
Code:
sudo yum install numactl
To see the memory map:
Code:
numactl -H
To run yr_prog entirely binded to MCDRAM:
Code:
numactl -m 1 ./yr_prog
Copying and pasting:
Quote:
 Preferring (-p) vs. requiring (-m) A big question when using MCDRAM is whether an allocation must use MCDRAM (numactl -m) or just prefer it (numactl –p). Generally, I find programmers want to require it and then debug their program if they have not precisely planned the layout perfectly. This seems like a good idea, but there are two complications. First, the operating system or libraries do not always clean up everything perfectly so 16Gb can become 15.9Gb, and a request for all 16Gb would not succeed. Gradually, we are figuring out how to make sure the system leaves all 16Gb for us – so requiring still can seem like the way to go. That brings us to the second complication: memory oversubscription only happens at page allocation time, not memory allocation time. That can be a bit hard to wrap our heads around – but it ultimately means that “requiring” memory, and then running out of it, generally causes Linux to abort the program. What is the operating system to do when it runs out of memory? I expect interfaces for this will continue to evolve, but for now requiring MCDRAM will abort a program long after the allocation when the oversubscription actually happens. In contrast, preferring MCDRAM will silently switch to DDR when MCDRAM is exhausted.

Last fiddled with by paulunderwood on 2021-06-13 at 23:30

 2021-06-13, 23:57 #196 ewmayer ∂2ω=0     Sep 2002 República de California 265778 Posts @Paul - thx. NUMA-ness must be enabled by default in BIOS, because after installing the package, 'numactl -H' gives Code: [ewmayer@localhost ~]$numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 node 0 size: 193005 MB node 0 free: 178568 MB node 1 cpus: node 1 size: 16123 MB node 1 free: 15992 MB node distances: node 0 1 0: 10 31 1: 31 10 and restarting my usual 2 jobs, now with 'numactl -m 1' prepended, timings revert to what they were pre-RAM install. Outstanding. As I noted, in my case the jobs are not even close to needing 16GB, so should be no worry about requiring MCDRAM usage. For bigger-mem p-1 work, clearly, that will be inappropriate. Still several days of code work needed before I'll be able to fire up multiple p-1 jobs and see what the resulting total throughput looks like.  2021-06-22, 03:21 #197 ewmayer ∂2ω=0 Sep 2002 República de California 19·613 Posts After debug & test of the integrated GMP gcd code over the weekend fired up one trial run p-1 run to bounds [b1,b2] = [1m,30m] on a known-factor case, to test the Mlucas v20 dev-code sure on a soup-to-nuts (complete stages 1 & 2) run. That shook out a couple bugs, and is now ~midway through stage 2 on one of the known-factor cases described next. For shakeout testing I assembled 55 exponents, all the cases in which my GPU runs of gpuowl over the last year+ found a factor in p-1. Around 10 cases can use 5.5M FFT; the rest need 6M. Those runs used a mix of the foregoing bounds and the larger ones (typically [5.5m,165m]) used by the newer gpuowl versions. As it happens, all of said factors are findable with b1 = 1m, and 24 such need only stage 1 to that bound. Of the 31 remaining ones, just 2 need b2 > 30m, so I broke those into an opening run with [b1,b2] = [1m,30m] and as many subsequent deeper-stage-2-only runs in 30m-increments as needed to discover the factor in question. Then, in attempt to roughly equalize the work-per-instance, I figured 1 work unit (WU) for stage 1 to 1m and 1 WU for 30m worth of stage 2. Then I divvied them up - this includes the already-started run above, which needs stage 2 to ~155m, thus 2+1+1+1+1+1 = 7 WUs and thus will work only on that assignment - across 17 worktodo files, making sure has at least 1 assignment needing stage 2 only, and each of the 16 new-instance workfiles amounted to 5 WUs. My ongoing run on cores 64:67 needed 44ms/iter for stage 1 (b1 = 1m ==> ~1.44miters) and 58ms/iter for each stage 2 iteration, which at my memory settings for all these runs - 10% available RAM, ~20GB - processes ~1.5 stage 2 primes. (I hope to cut that 58/44 timing ratio closer to 1 in ongoing work.) This run was launched without using 'numactl' to force it to live in the 16GB onboard MCDRAM (only ~10GB is avilable for user stuff, in my experience), because stage 2 needs more than that. So fired up the 16 new jobs, each of which will start with stage 1 to 1m and thus with a tiny memory footprint, but again sans numactl just for an apples-to-apples comparison with the ongoing run. The resulting 'top' was very satisfying: Code: [ewmayer@localhost obj_avx512]$ top top - 16:18:40 up 4 days, 1:05, 2 users, load average: 37.88, 43.93, 45.72 Tasks: 2400 total, 2 running, 2396 sleeping, 2 stopped, 0 zombie %Cpu(s): 23.5 us, 0.4 sy, 0.0 ni, 74.8 id, 0.0 wa, 1.1 hi, 0.1 si, 0.0 st MiB Mem : 209128.8 total, 179472.7 free, 26840.6 used, 2815.5 buff/cache MiB Swap: 7992.0 total, 7992.0 free, 0.0 used. 180465.7 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 86256 ewmayer 20 0 366692 211616 4116 S 383.3 0.1 1:52.26 Mlucas 86300 ewmayer 20 0 367040 211160 4184 S 382.7 0.1 1:09.98 Mlucas 86318 ewmayer 20 0 367164 211220 4136 S 382.2 0.1 1:08.98 Mlucas 83998 ewmayer 20 0 19.3g 19.2g 4100 S 379.9 9.4 554:08.26 Mlucas 86289 ewmayer 20 0 366700 212988 4116 S 379.9 0.1 1:09.01 Mlucas 86287 ewmayer 20 0 366696 214036 4136 S 379.3 0.1 1:09.53 Mlucas 86254 ewmayer 20 0 349776 200180 3908 S 378.8 0.1 1:50.79 Mlucas 86225 ewmayer 20 0 349048 199300 3924 S 377.1 0.1 5:18.41 Mlucas 86306 ewmayer 20 0 367112 210916 4088 S 376.8 0.1 1:08.94 Mlucas 86297 ewmayer 20 0 367004 211016 4064 S 376.5 0.1 1:08.58 Mlucas 86315 ewmayer 20 0 367140 211044 4100 S 375.9 0.1 1:08.72 Mlucas 86291 ewmayer 20 0 366916 213872 4172 S 374.2 0.1 1:09.21 Mlucas 86312 ewmayer 20 0 367132 213060 4136 S 374.2 0.1 1:08.51 Mlucas 86303 ewmayer 20 0 367044 215520 4136 S 373.4 0.1 1:08.79 Mlucas 86309 ewmayer 20 0 367128 211172 4100 S 372.2 0.1 1:09.30 Mlucas 86252 ewmayer 20 0 349524 198536 3896 S 371.7 0.1 1:50.21 Mlucas 86294 ewmayer 20 0 366980 214212 4184 S 364.0 0.1 1:07.36 Mlucas 86418 ewmayer 20 0 276924 7500 4156 R 13.3 0.0 0:03.20 top 10 root 20 0 0 0 0 I 0.8 0.0 4:20.04 rcu_sched 85938 root 20 0 0 0 0 I 0.8 0.0 0:02.60 kworker/u544:1-phy0 ... but the timings less so: The runs at 5,5M FFT needed 57ms/iter, those @6M a whopping 78ms/iter, nearly double what I got in stage 1 for the early-started run. *That*, meanwhile, had its stage 2 timings jump from 58 to 117ms/iter. Whoa. So, since the 16 in-stage-1 runs need little memory for now, killed and restarted them, now *with* 'numactl -m 1'. The 5.5M-FFT jobs are down to 37ms/iter, the 6M ones to 46ms/iter, and the early-start run had its stage 2 timing drop back to 59 ms/iter. Now I know the KNL is an odd beast for CPU-based machines, its MCDRAM makes it very GPU-like in terms of memory managment, but it would be nice on such NUMA hardware to somehow run stage 1 in fast-RAM, then switch the above numactl setting back to "run in regular RAM" for stage 2.
 2021-06-22, 05:12 #198 paulunderwood     Sep 2002 Database er0rr 2×32×11×19 Posts I can only think you program this as: numactl -m 1 stage1; numactl -m 0 stage2

 Similar Threads Thread Thread Starter Forum Replies Last Post dtripp Software 3 2013-02-19 20:20 nucleon Hardware 2 2012-05-10 23:53 R.D. Silverman Programming 19 2011-09-17 01:43 mack Information & Answers 7 2009-09-13 01:48 penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 02:14.

Thu Aug 5 02:14:15 UTC 2021 up 12 days, 20:43, 0 users, load averages: 3.36, 3.64, 3.49