mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-06-18, 00:33   #1
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default Kaby Lake Memory Speed

I'm doing some optimization work on a non-GIMPs related project, and I'm seeing some very unexpected results from a very basic benchmark across a number of different CPUs.

The code in question simply mallocs and then fills up a 2GB buffer in a tight slightly unrolled loop of:

Code:
imulq %r13, %r13
movq %r13, (%r15, %rbx, 8)
imulq %r13, %r13
movq %r13, 8(%r15, %rbx, 8)
imulq %r13, %r13
movq %r13, 16(%r15, %rbx, 8)
imulq %r13, %r13
movq %r13, 24(%r15, %rbx, 8)
// test/cmp/jb normal loop code.
I time this with some simple gettimeofday calls before and after the loop.

On a i3-7100 51W Kaby Lake (3.9 Ghz) this yields a speed of 26924.74 MB/s with two sticks of 2133MHz DDR4. Subsequent 1,000,000 rounds of random 64 byte reads come in at 1759.24 MB/s. The random reads are queued up in a worker thread pool of size 2x virtual cores (seemed to be best throughput on all chips).

My 2016 MacbookPro (TouchBar) has a Skylake 2.9GHz Quad Core with two DIMMs of 8GB 2133 MHz LPDDR3 and yields 7998.58 MB/s, with 1671MB/s for subsequent 64 byte random reads.

Then things get weird.
A 5960k Haswell-E with Quad Channel 2400 MHz DDR4 Yields 10048.88 MB/s for the fill, and only 561.54 MB/s for the random reads.

A 6950k Broadwell-E with Quad Channel 3200 MHz DDR Yields 13907.02 MB/s for the fill, and only 701.57 MB/s for the random reads.

The dual Xeon 2698v3 with 8 2133 MHz DIMMs is even worse, 8325.30 MB/s, 412.09 MB/s

KNL is abysmally slow despite "16GB of HBM" sitting next to 64 cores/256 threads. Varying thread count from 10 up to 1024 didn't make much difference relative to the Kaby Lake system.

What gives? Does Skylake/Kaby Lake have a much much better memory dispatcher/cache handling? Some difference can be attributed to singled threaded clock speed especially during the write/fill portion, however the laptop's slow speed is still much quicker than a faster 6950k, and the multi-threaded reads should be much quicker on the 4-8 core machines than the 2 core Kaby Lake. Especially the 3400 MHz Broadwell-E system.

Anyone have any theories/wisdom to impart?

Last fiddled with by airsquirrels on 2017-06-18 at 00:37
airsquirrels is offline   Reply With Quote
Old 2017-06-18, 05:33   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Cache/prefetch effects may be responsible for much of your timing variations - especially for the random-reads - but I suggest you first try to obviate latency effects in your IMUL timing loop. On the CPUs you tried, IMUL latency ranges from 3-8 cycles, with the KNL being the 8, and as such fully 2x worse than any of the others. Your test loop forces all the IMULs to execute in strict sequence, i.e. each write and subsequent IMUL must wait until the preceding IMUL finishes.

Try a loop in which the body does 8 independent IMULs and ensuing writes-to-memory and post the resulting 'write loop' throughputs.

Also, is your random-read loop high-level code or does it also have an ASM core? I'm wondering whether vector gather-loads, despite their generally horrendous latency, might help, and if so, how much the resulting speedup factors vary between CPU flavors.

Last fiddled with by ewmayer on 2017-06-18 at 05:34
ewmayer is offline   Reply With Quote
Old 2017-06-18, 07:29   #3
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

24·199 Posts
Default

Besides the prefetching behaviour, it may also be possible that you're hitting the same memory bank in the Haswell and Broadwell chips. What happens if you swap RAM sticks? Have you tried using a large amount of allocated memory to do your random reads from, like 1 GB?

DDR4 sticks have either 2 or 4 bank groups, each with 4 sub-banks. Bank accesses to different bank groups are faster.

Reading this DDR4 design document from Micron:

Quote:
Look-up tables (LUTs) are a write-once, read-many application. From a worst-case modeling scenario, LUTs are typically considered to be a 100% random/read application because if a look-up table did an update once every few seconds, there would be billions of clock cycles between each update. This would allow for hundreds of millions of reads between each write.

System goal: Use 4Gb DDR4-2400 (tCK = 0.83ns) devices to support 300 million accesses per second (MAPS).

DDR4 architecture: DDR4 uses an 8n-prefetch (BL = 8) architecture. To avoid contention on the data bus, the minimum separation from one access to the next must be four clock cycles (BL/2). Using BC4 mode for this access pattern does not provide any timing benefit. This restricts the command bus to one access every four clock cycles, or 300 MAPS (tCK = 0.83ns). In this theoretical scenario, the data bus is 100% utilized.

DDR4 timing constraints: The worst-case scenario for a random read application is accessing data stored in the same bank on every access. Accessing the same bank requires tRC to be satisfied between accesses. DDR4-2400 (-083E) tRC is 45.32ns or 55 clocks at 0.83ns tCK. Doing one access every 55 clock cycles allows only 22 MAPS.
It's possible the desktop chips are being more aggressive with the RAM timings, too.
Mark Rose is offline   Reply With Quote
Old 2017-06-18, 14:01   #4
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Cache/prefetch effects may be responsible for much of your timing variations - especially for the random-reads - but I suggest you first try to obviate latency effects in your IMUL timing loop. On the CPUs you tried, IMUL latency ranges from 3-8 cycles, with the KNL being the 8, and as such fully 2x worse than any of the others. Your test loop forces all the IMULs to execute in strict sequence, i.e. each write and subsequent IMUL must wait until the preceding IMUL finishes.

Try a loop in which the body does 8 independent IMULs and ensuing writes-to-memory and post the resulting 'write loop' throughputs.

Also, is your random-read loop high-level code or does it also have an ASM core? I'm wondering whether vector gather-loads, despite their generally horrendous latency, might help, and if so, how much the resulting speedup factors vary between CPU flavors.
I will certainly give these optimization’s a try and post numbers , although I am well aware that I can speed up the ASM itself. Ultimately it is the random reads from this 2GB buffer that I would like to make performant. I referenced this code because I found the difference between chips fascinating, and a similar difference is found in the reads.

I’ve considered whether gather instructions might be of some use. I’m more worried about total throughput than sequential latency.

The problem is such that I can read as many 64 byte chunks as I can keep in cache in parallel, but i must do some work on each batch before I know the sparse addresses where the next batch of chunks resides. It would also be beneficial if I could block the global reads from polluting the cache lines, since they are highly unlikely to be reused. It is also read only once I’ve populated it, if that helps.

Mark’s bank suggestions also raise a question that I’ve been mulling over and can’t find a good answer on. If I do a single large allocation on a modern CPU/GPU, is it allocated continuously across the modules, or split between channels automagically?

I have a similar bandwidth problem on a Fury X where I see about half the expected bandwidth from the HBM, however I’m using one large allocation that only covers about 8 of the 16 HBM stacks if it is being allocated continuously. I can also do some experiments with performing several smaller allocations for my data structure instead of one large chunk, but I wasn’t sure what the magic way is to force all available banks/channels to be used evenly.

This is a new area for me as I’m usually focused on code that is operating almost entirely in registers/cache, so sparse memory access isn’t an area I have developed much expertise in. I’m digging into the DDR4 doc now.

EDIT: I see what Mark was pointing to, LUT might be the answer. My question remains, in a modern CPU/GPU or OSis there anything that distributes a large allocation to be non contiguous?

Last fiddled with by airsquirrels on 2017-06-18 at 14:30
airsquirrels is offline   Reply With Quote
Old 2017-06-18, 17:38   #5
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/

1100011100002 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
My question remains, in a modern CPU/GPU or OSis there anything that distributes a large allocation to be non contiguous?
There is a malloc implementation called palloc that basically does the opposite: they wanted to isolate banks per core/process for real-time performance considerations.
Mark Rose is offline   Reply With Quote
Old 2017-06-18, 19:02   #6
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
There is a malloc implementation called palloc that basically does the opposite: they wanted to isolate banks per core/process for real-time performance considerations.
I’ll have to look into this. After a refresher on MMUs and the Linux malloc/allocator I was able to make some sense of allocations. The default does interleave pages across the available channels/banks. I tried making 10 copies of my 2GB allocation on the Broadwell-E system and assigning my threadpool across those copies. There was a negligible impact on performance. NUMA interleaving will likely make a bigger difference on KNL where I could assign a different thread group to a ‘local’ HBM quadrant. This may require me rebooting KNL and setting the clustering mode, so I’ll have to coordinate with Ewmayer to experiment.

The write performance seemed to be hampered by the MMU page faults due to lazy allocation. If I memset the buffer to 0 to pre-fault all the pages before filling I see about 40-80GB/s on the 3400 MHz quad channel. I was able to get random reads as high as 28 GB/sec by adjusting various variables and playing with Clangs optimized settings. I’m going to write a comprehensive benchmark to iterate all the configurations and compare between the different systems.
airsquirrels is offline   Reply With Quote
Old 2017-06-18, 21:08   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

One other thought re. the random-reads occurs to me: Let's assume that each of N physical cores is responsible for handling its own [total mem]/N-sized chunk of the read buffer. Would it make sense to on-the-fly compute which bin a given read hits and dynamically assign it to a thread running on the corresponding core? If doing that one-read-at-a-time is much to fine-grained (i.e. thread overhead overwhelms any gains from the memory-access side of the equation), perhaps collect a bunch of read addresses, on-the-fly bin them and at regular intervals let the N threads do all the accumulated reads in their respective bins?
ewmayer is offline   Reply With Quote
Old 2017-06-19, 02:31   #8
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

51710 Posts
Default

I'll add this other tidbit that may be relevant to GIMPs...
Linux Kernel 3.16 vs 4.4 on the same hardware made a pretty large difference in random reads. 7.16 GB/s vs 10.3 GB/s. Sequential write speeds were the same, ~74 GB/s. Quad-Channel DDR4 2133 on 5930k @ 3.7 Ghz. Identical physical DIMMs.

My assumption is that the allocator / MMU does a better job in 4.4, to the tune of a 43.8% speedup!
airsquirrels is offline   Reply With Quote
Old 2017-06-20, 07:08   #9
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

59710 Posts
Default

I don't know how far your random reads are apart from each other, but you might also experience TLB misses (TLB is a cache that stores virtual to physical address translations).

The Linux kernel allocates pages of 4KB by default, but is able to merge them when transparent huge pages are enabled, thus reducing pressure on TLB.

Skylake has 64 4KB TLB entries at level 0 which means it can map at most 256 KB (and if access really are random no TLB hardware prefetcher will work). At level 1 there are 1536 entries so that'd be 6 MB.

With 2MB pages, there are 32 entries at level 0 and 1536 at level 1.
ldesnogu is offline   Reply With Quote
Old 2017-06-21, 21:36   #10
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I don't know how far your random reads are apart from each other, but you might also experience TLB misses (TLB is a cache that stores virtual to physical address translations).

The Linux kernel allocates pages of 4KB by default, but is able to merge them when transparent huge pages are enabled, thus reducing pressure on TLB.

Skylake has 64 4KB TLB entries at level 0 which means it can map at most 256 KB (and if access really are random no TLB hardware prefetcher will work). At level 1 there are 1536 entries so that'd be 6 MB.

With 2MB pages, there are 32 entries at level 0 and 1536 at level 1.
The random reads are all over the 2GB space, the only guarantee I have is they are 32 byte aligned.

I went through and enabled hugepages / mmap'd the 2GB data buffer and verified /proc/meminfo showed those hugepages as in use. I also read out the /proc/<x>/pagemap against dmidecode -20 to verify that my huge pages are nicely interleaved across the DIMMs/physical address space. All of this had very little effect on the performance in reality. I also noticed very little change if I use the _m256_stream_load vs. _m256_load intrinsics. This seems to suggest that the TLB isn't what is mucking things up.

After all of my tuning, using huge pages, here are my benchmarks:

(I do an initial memset on all buffers before timed operations to remove initial fault/allocation latency)

I used the same binary, compiled on the Kaby Lake host using clang 3.8 with -O3 and -mavx2

My operations are:
memset buf1
fill buf1 by simple, non-AVX 64 bit writes of buf1[i] = (b=b*b);
memcpy buf1 -> buf2
read 20000 random 128 byte chunks (using avx 256 byte reads)
read and sum buf2 8 bytes at a time

Kaby Lake: (Dual 2133 DIMMs)
- Memset 2GB: 29130 MB/s
- FillWithExponation: 98476 MB/s (Why so fast?)
- Memcpy 2GB: 9343 MB/s
- Random Access (64 threads): 11091 MB/s
- ReadAndSum: 21531 MB/s

Broadwell-E (Quad 3200 DIMMS)
- Memset 2GB: 15311 MB/s
- FillWithExponation: 90033MB/s
- Memcpy 2GB: 11021 MB/s
- Random Access (64 threads): 31529 MB/s
- Random Access (256 threads): 32417 MB/s
- ReadAndSum: 17026 MB/s

Ryzen 1800x (4x 3200 DIMMs)[/B]
- Memset 2GB: 14049 MB/s
- FillWithExponation: 90187 MB/s
- Memcpy 2GB: 15016 MB/s
- Random Access (64 threads): 27661MB/s
- ReadAndSum: 26806 MB/s

Dual 2698v3 16 Core (8x2133 DIMMS)
- Memset 2GB: 6079 MB/s
- FillWithExponation: 73520MB/s
- Memcpy 2GB: 8603 MB/s
- Random Access (64 threads): 55548MB/
- Random Access (128 threads): 67172 MB/s
- Random Access (2048 threads): 75664 MB/s
- ReadAndSum: 9754 MB/s

KNL (HBM as Cache, 6x2133 DIMMS - 6 Channel)
- Memset 2GB: 8987MB/s
- FillWithExponation: 42370 MB/s
- Memcpy 2GB: 4859 MB/s
- Random Access (64 threads): 53312 MB/
- Random Access (128 threads): 77621 MB/s
- Random Access (2048 threads): 107895 MB/s
- Random Access (8192 threads): 109861 MB/s
- ReadAndSum: 7296.86

KNL with AVX512 enabled (march=knl and ZMMs everywhere)
(non-random access was not affected)
- Random Access (64 threads): 80473 MB/
- Random Access (128 threads): 81615 MB/s
- Random Access (2048 threads): 122529 MB/s
- Random Access (8192 threads): 123639 MB/s

Looking at this data I have a few open questions/observations
1. Is write-caching or some other magic allowing the "Fill" step to reach close to RAM max speeds? Why is it so much faster than memset?
2. Memset/Memcopy performance is really all over the place, but I notice two trends:
- HighCoreCount/MultiSocket systems are showing lower performance in these benchmarks. Is this because of memory controller locality to the single thread?
- ReadAndSum shows a similar low result on these systems.
- Kaby Lake shows an incredibly high memset rate vs. the other platforms. It also shows the highest ReadAndSum speed and is the only system where these speeds exceed the Random access speeds. Note that could be limited because I only have 2 cores available.
3. The Ryzen system is quiet competitive. It's single sequential 64bit reads are the same speed as the "flooded bus" random throughput.
4. Why am I able to get such higher random reads with so many threads? Is the memory controller coalescing the reads in flight, and a huge number of threads a better chance for locality?

Last fiddled with by airsquirrels on 2017-06-21 at 21:38
airsquirrels is offline   Reply With Quote
Old 2017-06-21, 21:37   #11
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by ewmayer View Post
One other thought re. the random-reads occurs to me: Let's assume that each of N physical cores is responsible for handling its own [total mem]/N-sized chunk of the read buffer. Would it make sense to on-the-fly compute which bin a given read hits and dynamically assign it to a thread running on the corresponding core? If doing that one-read-at-a-time is much to fine-grained (i.e. thread overhead overwhelms any gains from the memory-access side of the equation), perhaps collect a bunch of read addresses, on-the-fly bin them and at regular intervals let the N threads do all the accumulated reads in their respective bins?
This is what I'm going to experiment with next, although as mentioned above, it seems like if I have enough cores/resources to fill with enough threads the CPU is doing a bit of this itself.
airsquirrels is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Hyperthreading broken in Skylake and Kaby Lake? GP2 Hardware 4 2017-06-26 02:08
Kaby Lake / Asrock disappointment, RAM weirdness Prime95 Hardware 17 2017-01-27 21:09
Kaby Lake processors: bor-ing ! tServo Hardware 11 2016-12-18 10:32
Kaby Lake chip Prime95 Hardware 0 2016-10-26 23:23
3LP sieving: memory and speed savings! FactorEyes Factoring 36 2010-10-04 20:29

All times are UTC. The time now is 16:09.


Fri Jul 7 16:09:15 UTC 2023 up 323 days, 13:37, 0 users, load averages: 1.70, 1.39, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔