mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2021-06-13, 13:24   #188
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

124528 Posts
Default

Quote:
Originally Posted by ewmayer View Post
192GB RAM successfully installed and running @2400MHz per the BIOS. But -
Congrats.
What do available utilities say about the memory mode, cache layout etc.?
Is that still on CentOS v8.2?
How about a quick mprime benchmark on the same inputs?
kriesel is online now   Reply With Quote
Old 2021-06-13, 17:22   #189
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2×32×7×43 Posts
Default

This Cray KNL slide set is interesting.

Cache mode: https://docs.nersc.gov/performance/knl/cache-mode/

Optimizing: https://www.nersc.gov/assets/Uploads...NL-swarren.pdf

MCDRAM on KNL tutorial https://www.slideshare.net/IwantoutofVT/mcdram-tutorial

Capability models vs. measured performance http://htor.inf.ethz.ch/publications...l-analysis.pdf

Last fiddled with by kriesel on 2021-06-13 at 17:31
kriesel is online now   Reply With Quote
Old 2021-06-13, 21:56   #190
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011111102 Posts
Default

@kriesel: Thx, but the Cray docs refer to a particular KNL-node hardware configuration which they apparently used in some KNL-based supercomputers.

Quote:
Originally Posted by paulunderwood View Post
1, is bad news. The access to the DIMMs must be much slower than the MCDRAM -- an expensive lesson. With them, do you have a great 68-instances P-1 machine? See the table on this page. There must be a way to program using mostly MCDRAM much like using caches. More programming info on this page.
I revisited this similar tutorial from Colfax Research which I'd studied previously:

MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing Processors: Developer’s Guide | Colfax Research

Rebooted system, verified that newly-added RAM is being run 6-way interleaved. But see no boot menu options related to configuring the MCDRAM, which I suspect means it's defaulting to cache mode - hopefully the meminfo dump below will be informative. Cache mode is actually what I want, since all my current runs fit easily in the MCDRAM, with plenty of room to spare. So the question is, why isn't the OS just running them out of the MCDRAM?
Quote:
2. You have to run cat /proc/cpuinfo. cpuinfo is not an executable.
Doh - forgot to prepend 'cat' or 'more'. Here is /proc/meminfo:
Code:
MemTotal:       214147932 kB
MemFree:        210226340 kB
MemAvailable:   209402676 kB
Buffers:            4384 kB
Cached:           668728 kB
SwapCached:            0 kB
Active:          1677176 kB
Inactive:         510388 kB
Active(anon):    1517104 kB
Inactive(anon):    14100 kB
Active(file):     160072 kB
Inactive(file):   496288 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       8183804 kB
SwapFree:        8183804 kB
Dirty:                12 kB
Writeback:             0 kB
AnonPages:       1484528 kB
Mapped:           264436 kB
Shmem:             16676 kB
KReclaimable:     238516 kB
Slab:            1154360 kB
SReclaimable:     238516 kB
SUnreclaim:       915844 kB
KernelStack:       52512 kB
PageTables:        38948 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    115257768 kB
Committed_AS:    6245572 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:           162112 kB
HardwareCorrupted:     0 kB
AnonHugePages:    892928 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:      707124 kB
DirectMap2M:     7555072 kB
DirectMap1G:    211812352 kB
ewmayer is offline   Reply With Quote
Old 2021-06-13, 22:18   #191
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

Quote from the above-linked Colfax page:
Quote:
Case 1: The entire application fits in HBM.

This is the best case scenario. If your application fits in the HBM, then set the configuration mode to Flat mode and follow the numactl instructions in Section 3.1. This usage mode does not require any code modification, and works with any application (written in any language) provided that it does not have special allocators that specifically allocate elsewhere. Note that, although this procedure requires no source code changes, applications could still benefit from general memory optimization. For more on memory traffic optimization, refer to the various online references on optimization such as the HOW Series.

If numactl cannot be used, then using Cache mode could be an alternative. Because the problem fits in the HBM cache, there will only be a few HBM cache misses. HBM cache misses are the primary factor in the performance difference between addressable memory HBM and cache HBM, so using the Cache mode could get close to or even match the performance with Flat mode. However, there is still some inherent overhead associated with using HBM as cache, so if numactl is an option, we recommend to use that method.
Again, though, I've not found a way to control the configuration mode, the only visible NUMA-related subitems in the SuperMicro boot menu are disabled, and 'which numactl' comes up empty.

Last fiddled with by ewmayer on 2021-06-13 at 22:19
ewmayer is offline   Reply With Quote
Old 2021-06-13, 22:35   #192
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3,761 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Again, though, I've not found a way to control the configuration mode, the only visible NUMA-related subitems in the SuperMicro boot menu are disabled, and 'which numactl' comes up empty.
This is way out of my depth, but do you have a good reason for not using numactl in flat-mode? Have you tried timings using it?
paulunderwood is offline   Reply With Quote
Old 2021-06-13, 22:59   #193
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1164610 Posts
Default

Quote:
Originally Posted by paulunderwood View Post
This is way out of my depth, but do you have a good reason for not using numactl in flat-mode? Have you tried timings using it?
I have 2 good reasons for not using numactl:

1. I don't know which mode is currently being set for MCDRAM, with regular RAM now installed, and found no related boot option;

2. 'numactl: command not found.'

If the latter can be remedied via installation of an optional software package, I'm all ears.
ewmayer is offline   Reply With Quote
Old 2021-06-13, 23:00   #194
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

2·32·7·43 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Thx but...
Here is /proc/meminfo:
Code:
MemTotal:       214147932 kB...
DirectMap4k:      707124 kB
DirectMap2M:     7555072 kB
 DirectMap1G:    211812352 kB
I'm pretty sure Cray didn't change the 7250 design, or they'd be calling it something else. So I think it's safe to rely on Intel's specs for it remaining constant regardless of the info source.
It sure doesn't help that available SuperMicro documentation is silent on the memory model, and so is the K1SPE motherboard's BIOS display.
Does Linux report round binary powers as MemTotal, or after deducting for BIOS ROM shadowing or other deductions? I'm assuming below, they really mean total, which should come out as whole GiB. (Although this top tutorial shows "total" not a round GiB number either.)

Flat memory model would give 16+192=208 GiB;
Cache would give 192 GiB.

214147932kB (kB assumed = kiB) - 6 * 32GiB DIMMs ~ 12.27 GiB MCDRAM ~75/25 split addressable/cache;
214147932kB (kB assumed = 1000B) - 6 * 32GiB DIMMs ~ 7.44 GiB MCDRAM, ~50/50equal addressable/cache;
implying the MCDRAM is being divided to addressable/cache as in what Colfax calls hybrid mode or as shown on slide 10 of https://www.nersc.gov/assets/Uploads...rs-Feb2019.pdf

You get nothing for numactl --hardware now, because it's not installed? Should be interesting when available.

On Windows 10 Pro, systeminfo gave for MCDRAM only, Total Physical Memory: 16,260 MB (~15.88 GiB);
with a 64GiB DIMM added, Total Physical Memory: 81,796 MB (~79.88 GiB), indicating Windows boots to a flat memory model. That and the ~40:1 latency range in the last link of https://www.mersenneforum.org/showpo...&postcount=189 pretty well matches the dramatic drop in prime95 performance I saw with a DIMM installed.

Last fiddled with by kriesel on 2021-06-13 at 23:59
kriesel is online now   Reply With Quote
Old 2021-06-13, 23:09   #195
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

376110 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I have 2 good reasons for not using numactl:

1. I don't know which mode is currently being set for MCDRAM, with regular RAM now installed, and found no related boot option;

2. 'numactl: command not found.'

If the latter can be remedied via installation of an optional software package, I'm all ears.
1. Check NUMA is on in BIOS whatever that is!

2,
Installation:
Code:
sudo yum install numactl
To see the memory map:
Code:
numactl -H
To run yr_prog entirely binded to MCDRAM:
Code:
numactl -m 1 ./yr_prog
Copying and pasting:
Quote:
Preferring (-p) vs. requiring (-m)

A big question when using MCDRAM is whether an allocation must use MCDRAM (numactl -m) or just prefer it (numactl –p). Generally, I find programmers want to require it and then debug their program if they have not precisely planned the layout perfectly. This seems like a good idea, but there are two complications. First, the operating system or libraries do not always clean up everything perfectly so 16Gb can become 15.9Gb, and a request for all 16Gb would not succeed. Gradually, we are figuring out how to make sure the system leaves all 16Gb for us – so requiring still can seem like the way to go. That brings us to the second complication: memory oversubscription only happens at page allocation time, not memory allocation time. That can be a bit hard to wrap our heads around – but it ultimately means that “requiring” memory, and then running out of it, generally causes Linux to abort the program. What is the operating system to do when it runs out of memory? I expect interfaces for this will continue to evolve, but for now requiring MCDRAM will abort a program long after the allocation when the oversubscription actually happens. In contrast, preferring MCDRAM will silently switch to DDR when MCDRAM is exhausted.

Last fiddled with by paulunderwood on 2021-06-13 at 23:30
paulunderwood is offline   Reply With Quote
Old 2021-06-13, 23:57   #196
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·32·647 Posts
Default

@Paul - thx. NUMA-ness must be enabled by default in BIOS, because after installing the package, 'numactl -H' gives
Code:
[ewmayer@localhost ~]$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271
node 0 size: 193005 MB
node 0 free: 178568 MB
node 1 cpus:
node 1 size: 16123 MB
node 1 free: 15992 MB
node distances:
node   0   1 
  0:  10  31 
  1:  31  10
and restarting my usual 2 jobs, now with 'numactl -m 1' prepended, timings revert to what they were pre-RAM install. Outstanding. As I noted, in my case the jobs are not even close to needing 16GB, so should be no worry about requiring MCDRAM usage. For bigger-mem p-1 work, clearly, that will be inappropriate. Still several days of code work needed before I'll be able to fire up multiple p-1 jobs and see what the resulting total throughput looks like.
ewmayer is offline   Reply With Quote
Old 2021-06-22, 03:21   #197
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7E16 Posts
Default

After debug & test of the integrated GMP gcd code over the weekend fired up one trial run p-1 run to bounds [b1,b2] = [1m,30m] on a known-factor case, to test the Mlucas v20 dev-code sure on a soup-to-nuts (complete stages 1 & 2) run. That shook out a couple bugs, and is now ~midway through stage 2 on one of the known-factor cases described next.

For shakeout testing I assembled 55 exponents, all the cases in which my GPU runs of gpuowl over the last year+ found a factor in p-1. Around 10 cases can use 5.5M FFT; the rest need 6M. Those runs used a mix of the foregoing bounds and the larger ones (typically [5.5m,165m]) used by the newer gpuowl versions. As it happens, all of said factors are findable with b1 = 1m, and 24 such need only stage 1 to that bound. Of the 31 remaining ones, just 2 need b2 > 30m, so I broke those into an opening run with [b1,b2] = [1m,30m] and as many subsequent deeper-stage-2-only runs in 30m-increments as needed to discover the factor in question.

Then, in attempt to roughly equalize the work-per-instance, I figured 1 work unit (WU) for stage 1 to 1m and 1 WU for 30m worth of stage 2. Then I divvied them up - this includes the already-started run above, which needs stage 2 to ~155m, thus 2+1+1+1+1+1 = 7 WUs and thus will work only on that assignment - across 17 worktodo files, making sure has at least 1 assignment needing stage 2 only, and each of the 16 new-instance workfiles amounted to 5 WUs.

My ongoing run on cores 64:67 needed 44ms/iter for stage 1 (b1 = 1m ==> ~1.44miters) and 58ms/iter for each stage 2 iteration, which at my memory settings for all these runs - 10% available RAM, ~20GB - processes ~1.5 stage 2 primes. (I hope to cut that 58/44 timing ratio closer to 1 in ongoing work.) This run was launched without using 'numactl' to force it to live in the 16GB onboard MCDRAM (only ~10GB is avilable for user stuff, in my experience), because stage 2 needs more than that.

So fired up the 16 new jobs, each of which will start with stage 1 to 1m and thus with a tiny memory footprint, but again sans numactl just for an apples-to-apples comparison with the ongoing run. The resulting 'top' was very satisfying:
Code:
[ewmayer@localhost obj_avx512]$ top

top - 16:18:40 up 4 days,  1:05,  2 users,  load average: 37.88, 43.93, 45.72
Tasks: 2400 total,   2 running, 2396 sleeping,   2 stopped,   0 zombie
%Cpu(s): 23.5 us,  0.4 sy,  0.0 ni, 74.8 id,  0.0 wa,  1.1 hi,  0.1 si,  0.0 st
MiB Mem : 209128.8 total, 179472.7 free,  26840.6 used,   2815.5 buff/cache
MiB Swap:   7992.0 total,   7992.0 free,      0.0 used. 180465.7 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                 
  86256 ewmayer   20   0  366692 211616   4116 S 383.3   0.1   1:52.26 Mlucas                                                  
  86300 ewmayer   20   0  367040 211160   4184 S 382.7   0.1   1:09.98 Mlucas                                                  
  86318 ewmayer   20   0  367164 211220   4136 S 382.2   0.1   1:08.98 Mlucas                                                  
  83998 ewmayer   20   0   19.3g  19.2g   4100 S 379.9   9.4 554:08.26 Mlucas                                                  
  86289 ewmayer   20   0  366700 212988   4116 S 379.9   0.1   1:09.01 Mlucas                                                  
  86287 ewmayer   20   0  366696 214036   4136 S 379.3   0.1   1:09.53 Mlucas                                                  
  86254 ewmayer   20   0  349776 200180   3908 S 378.8   0.1   1:50.79 Mlucas                                                  
  86225 ewmayer   20   0  349048 199300   3924 S 377.1   0.1   5:18.41 Mlucas                                                  
  86306 ewmayer   20   0  367112 210916   4088 S 376.8   0.1   1:08.94 Mlucas                                                  
  86297 ewmayer   20   0  367004 211016   4064 S 376.5   0.1   1:08.58 Mlucas                                                  
  86315 ewmayer   20   0  367140 211044   4100 S 375.9   0.1   1:08.72 Mlucas                                                  
  86291 ewmayer   20   0  366916 213872   4172 S 374.2   0.1   1:09.21 Mlucas                                                  
  86312 ewmayer   20   0  367132 213060   4136 S 374.2   0.1   1:08.51 Mlucas                                                  
  86303 ewmayer   20   0  367044 215520   4136 S 373.4   0.1   1:08.79 Mlucas                                                  
  86309 ewmayer   20   0  367128 211172   4100 S 372.2   0.1   1:09.30 Mlucas                                                  
  86252 ewmayer   20   0  349524 198536   3896 S 371.7   0.1   1:50.21 Mlucas                                                  
  86294 ewmayer   20   0  366980 214212   4184 S 364.0   0.1   1:07.36 Mlucas                                                  
  86418 ewmayer   20   0  276924   7500   4156 R  13.3   0.0   0:03.20 top                                                     
     10 root      20   0       0      0      0 I   0.8   0.0   4:20.04 rcu_sched                                               
  85938 root      20   0       0      0      0 I   0.8   0.0   0:02.60 kworker/u544:1-phy0
... but the timings less so: The runs at 5,5M FFT needed 57ms/iter, those @6M a whopping 78ms/iter, nearly double what I got in stage 1 for the early-started run. *That*, meanwhile, had its stage 2 timings jump from 58 to 117ms/iter. Whoa.

So, since the 16 in-stage-1 runs need little memory for now, killed and restarted them, now *with* 'numactl -m 1'. The 5.5M-FFT jobs are down to 37ms/iter, the 6M ones to 46ms/iter, and the early-start run had its stage 2 timing drop back to 59 ms/iter.

Now I know the KNL is an odd beast for CPU-based machines, its MCDRAM makes it very GPU-like in terms of memory managment, but it would be nice on such NUMA hardware to somehow run stage 1 in fast-RAM, then switch the above numactl setting back to "run in regular RAM" for stage 2.
ewmayer is offline   Reply With Quote
Old 2021-06-22, 05:12   #198
paulunderwood
 
paulunderwood's Avatar
 
Sep 2002
Database er0rr

3,761 Posts
Default

I can only think you program this as:

numactl -m 1 stage1; numactl -m 0 stage2
paulunderwood is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
AMD vs Intel dtripp Software 3 2013-02-19 20:20
Intel NUC nucleon Hardware 2 2012-05-10 23:53
Intel RNG API? R.D. Silverman Programming 19 2011-09-17 01:43
AMD or Intel mack Information & Answers 7 2009-09-13 01:48
Intel Mac? penguain NFSNET Discussion 0 2006-06-12 01:31

All times are UTC. The time now is 17:59.


Sun Aug 1 17:59:06 UTC 2021 up 9 days, 12:28, 0 users, load averages: 2.24, 2.32, 2.12

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.