![]() |
|
|
#1 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
I'm doing some optimization work on a non-GIMPs related project, and I'm seeing some very unexpected results from a very basic benchmark across a number of different CPUs.
The code in question simply mallocs and then fills up a 2GB buffer in a tight slightly unrolled loop of: Code:
imulq %r13, %r13 movq %r13, (%r15, %rbx, 8) imulq %r13, %r13 movq %r13, 8(%r15, %rbx, 8) imulq %r13, %r13 movq %r13, 16(%r15, %rbx, 8) imulq %r13, %r13 movq %r13, 24(%r15, %rbx, 8) // test/cmp/jb normal loop code. On a i3-7100 51W Kaby Lake (3.9 Ghz) this yields a speed of 26924.74 MB/s with two sticks of 2133MHz DDR4. Subsequent 1,000,000 rounds of random 64 byte reads come in at 1759.24 MB/s. The random reads are queued up in a worker thread pool of size 2x virtual cores (seemed to be best throughput on all chips). My 2016 MacbookPro (TouchBar) has a Skylake 2.9GHz Quad Core with two DIMMs of 8GB 2133 MHz LPDDR3 and yields 7998.58 MB/s, with 1671MB/s for subsequent 64 byte random reads. Then things get weird. A 5960k Haswell-E with Quad Channel 2400 MHz DDR4 Yields 10048.88 MB/s for the fill, and only 561.54 MB/s for the random reads. A 6950k Broadwell-E with Quad Channel 3200 MHz DDR Yields 13907.02 MB/s for the fill, and only 701.57 MB/s for the random reads. The dual Xeon 2698v3 with 8 2133 MHz DIMMs is even worse, 8325.30 MB/s, 412.09 MB/s KNL is abysmally slow despite "16GB of HBM" sitting next to 64 cores/256 threads. Varying thread count from 10 up to 1024 didn't make much difference relative to the Kaby Lake system. What gives? Does Skylake/Kaby Lake have a much much better memory dispatcher/cache handling? Some difference can be attributed to singled threaded clock speed especially during the write/fill portion, however the laptop's slow speed is still much quicker than a faster 6950k, and the multi-threaded reads should be much quicker on the 4-8 core machines than the 2 core Kaby Lake. Especially the 3400 MHz Broadwell-E system. Anyone have any theories/wisdom to impart? Last fiddled with by airsquirrels on 2017-06-18 at 00:37 |
|
|
|
|
|
#2 |
|
∂2ω=0
Sep 2002
República de California
267548 Posts |
Cache/prefetch effects may be responsible for much of your timing variations - especially for the random-reads - but I suggest you first try to obviate latency effects in your IMUL timing loop. On the CPUs you tried, IMUL latency ranges from 3-8 cycles, with the KNL being the 8, and as such fully 2x worse than any of the others. Your test loop forces all the IMULs to execute in strict sequence, i.e. each write and subsequent IMUL must wait until the preceding IMUL finishes.
Try a loop in which the body does 8 independent IMULs and ensuing writes-to-memory and post the resulting 'write loop' throughputs. Also, is your random-read loop high-level code or does it also have an ASM core? I'm wondering whether vector gather-loads, despite their generally horrendous latency, might help, and if so, how much the resulting speedup factors vary between CPU flavors. Last fiddled with by ewmayer on 2017-06-18 at 05:34 |
|
|
|
|
|
#3 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
Besides the prefetching behaviour, it may also be possible that you're hitting the same memory bank in the Haswell and Broadwell chips. What happens if you swap RAM sticks? Have you tried using a large amount of allocated memory to do your random reads from, like 1 GB?
DDR4 sticks have either 2 or 4 bank groups, each with 4 sub-banks. Bank accesses to different bank groups are faster. Reading this DDR4 design document from Micron: Quote:
|
|
|
|
|
|
|
#4 | |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Quote:
I’ve considered whether gather instructions might be of some use. I’m more worried about total throughput than sequential latency. The problem is such that I can read as many 64 byte chunks as I can keep in cache in parallel, but i must do some work on each batch before I know the sparse addresses where the next batch of chunks resides. It would also be beneficial if I could block the global reads from polluting the cache lines, since they are highly unlikely to be reused. It is also read only once I’ve populated it, if that helps. Mark’s bank suggestions also raise a question that I’ve been mulling over and can’t find a good answer on. If I do a single large allocation on a modern CPU/GPU, is it allocated continuously across the modules, or split between channels automagically? I have a similar bandwidth problem on a Fury X where I see about half the expected bandwidth from the HBM, however I’m using one large allocation that only covers about 8 of the 16 HBM stacks if it is being allocated continuously. I can also do some experiments with performing several smaller allocations for my data structure instead of one large chunk, but I wasn’t sure what the magic way is to force all available banks/channels to be used evenly. This is a new area for me as I’m usually focused on code that is operating almost entirely in registers/cache, so sparse memory access isn’t an area I have developed much expertise in. I’m digging into the DDR4 doc now. EDIT: I see what Mark was pointing to, LUT might be the answer. My question remains, in a modern CPU/GPU or OSis there anything that distributes a large allocation to be non contiguous? Last fiddled with by airsquirrels on 2017-06-18 at 14:30 |
|
|
|
|
|
|
#5 | |
|
"/X\(‘-‘)/X\"
Jan 2013
https://pedan.tech/
24×199 Posts |
Quote:
|
|
|
|
|
|
|
#6 | |
|
"David"
Jul 2015
Ohio
10000001012 Posts |
Quote:
The write performance seemed to be hampered by the MMU page faults due to lazy allocation. If I memset the buffer to 0 to pre-fault all the pages before filling I see about 40-80GB/s on the 3400 MHz quad channel. I was able to get random reads as high as 28 GB/sec by adjusting various variables and playing with Clangs optimized settings. I’m going to write a comprehensive benchmark to iterate all the configurations and compare between the different systems. |
|
|
|
|
|
|
#7 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
One other thought re. the random-reads occurs to me: Let's assume that each of N physical cores is responsible for handling its own [total mem]/N-sized chunk of the read buffer. Would it make sense to on-the-fly compute which bin a given read hits and dynamically assign it to a thread running on the corresponding core? If doing that one-read-at-a-time is much to fine-grained (i.e. thread overhead overwhelms any gains from the memory-access side of the equation), perhaps collect a bunch of read addresses, on-the-fly bin them and at regular intervals let the N threads do all the accumulated reads in their respective bins?
|
|
|
|
|
|
#8 |
|
"David"
Jul 2015
Ohio
10058 Posts |
I'll add this other tidbit that may be relevant to GIMPs...
Linux Kernel 3.16 vs 4.4 on the same hardware made a pretty large difference in random reads. 7.16 GB/s vs 10.3 GB/s. Sequential write speeds were the same, ~74 GB/s. Quad-Channel DDR4 2133 on 5930k @ 3.7 Ghz. Identical physical DIMMs. My assumption is that the allocator / MMU does a better job in 4.4, to the tune of a 43.8% speedup! |
|
|
|
|
|
#9 |
|
Jan 2008
France
11258 Posts |
I don't know how far your random reads are apart from each other, but you might also experience TLB misses (TLB is a cache that stores virtual to physical address translations).
The Linux kernel allocates pages of 4KB by default, but is able to merge them when transparent huge pages are enabled, thus reducing pressure on TLB. Skylake has 64 4KB TLB entries at level 0 which means it can map at most 256 KB (and if access really are random no TLB hardware prefetcher will work). At level 1 there are 1536 entries so that'd be 6 MB. With 2MB pages, there are 32 entries at level 0 and 1536 at level 1. |
|
|
|
|
|
#10 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
I went through and enabled hugepages / mmap'd the 2GB data buffer and verified /proc/meminfo showed those hugepages as in use. I also read out the /proc/<x>/pagemap against dmidecode -20 to verify that my huge pages are nicely interleaved across the DIMMs/physical address space. All of this had very little effect on the performance in reality. I also noticed very little change if I use the _m256_stream_load vs. _m256_load intrinsics. This seems to suggest that the TLB isn't what is mucking things up. After all of my tuning, using huge pages, here are my benchmarks: (I do an initial memset on all buffers before timed operations to remove initial fault/allocation latency) I used the same binary, compiled on the Kaby Lake host using clang 3.8 with -O3 and -mavx2 My operations are: memset buf1 fill buf1 by simple, non-AVX 64 bit writes of buf1[i] = (b=b*b); memcpy buf1 -> buf2 read 20000 random 128 byte chunks (using avx 256 byte reads) read and sum buf2 8 bytes at a time Kaby Lake: (Dual 2133 DIMMs) - Memset 2GB: 29130 MB/s - FillWithExponation: 98476 MB/s (Why so fast?) - Memcpy 2GB: 9343 MB/s - Random Access (64 threads): 11091 MB/s - ReadAndSum: 21531 MB/s Broadwell-E (Quad 3200 DIMMS) - Memset 2GB: 15311 MB/s - FillWithExponation: 90033MB/s - Memcpy 2GB: 11021 MB/s - Random Access (64 threads): 31529 MB/s - Random Access (256 threads): 32417 MB/s - ReadAndSum: 17026 MB/s Ryzen 1800x (4x 3200 DIMMs)[/B] - Memset 2GB: 14049 MB/s - FillWithExponation: 90187 MB/s - Memcpy 2GB: 15016 MB/s - Random Access (64 threads): 27661MB/s - ReadAndSum: 26806 MB/s Dual 2698v3 16 Core (8x2133 DIMMS) - Memset 2GB: 6079 MB/s - FillWithExponation: 73520MB/s - Memcpy 2GB: 8603 MB/s - Random Access (64 threads): 55548MB/ - Random Access (128 threads): 67172 MB/s - Random Access (2048 threads): 75664 MB/s - ReadAndSum: 9754 MB/s KNL (HBM as Cache, 6x2133 DIMMS - 6 Channel) - Memset 2GB: 8987MB/s - FillWithExponation: 42370 MB/s - Memcpy 2GB: 4859 MB/s - Random Access (64 threads): 53312 MB/ - Random Access (128 threads): 77621 MB/s - Random Access (2048 threads): 107895 MB/s - Random Access (8192 threads): 109861 MB/s - ReadAndSum: 7296.86 KNL with AVX512 enabled (march=knl and ZMMs everywhere) (non-random access was not affected) - Random Access (64 threads): 80473 MB/ - Random Access (128 threads): 81615 MB/s - Random Access (2048 threads): 122529 MB/s - Random Access (8192 threads): 123639 MB/s Looking at this data I have a few open questions/observations 1. Is write-caching or some other magic allowing the "Fill" step to reach close to RAM max speeds? Why is it so much faster than memset? 2. Memset/Memcopy performance is really all over the place, but I notice two trends: - HighCoreCount/MultiSocket systems are showing lower performance in these benchmarks. Is this because of memory controller locality to the single thread? - ReadAndSum shows a similar low result on these systems. - Kaby Lake shows an incredibly high memset rate vs. the other platforms. It also shows the highest ReadAndSum speed and is the only system where these speeds exceed the Random access speeds. Note that could be limited because I only have 2 cores available. 3. The Ryzen system is quiet competitive. It's single sequential 64bit reads are the same speed as the "flooded bus" random throughput. 4. Why am I able to get such higher random reads with so many threads? Is the memory controller coalescing the reads in flight, and a huge number of threads a better chance for locality? Last fiddled with by airsquirrels on 2017-06-21 at 21:38 |
|
|
|
|
|
|
#11 | |
|
"David"
Jul 2015
Ohio
11·47 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Hyperthreading broken in Skylake and Kaby Lake? | GP2 | Hardware | 4 | 2017-06-26 02:08 |
| Kaby Lake / Asrock disappointment, RAM weirdness | Prime95 | Hardware | 17 | 2017-01-27 21:09 |
| Kaby Lake processors: bor-ing ! | tServo | Hardware | 11 | 2016-12-18 10:32 |
| Kaby Lake chip | Prime95 | Hardware | 0 | 2016-10-26 23:23 |
| 3LP sieving: memory and speed savings! | FactorEyes | Factoring | 36 | 2010-10-04 20:29 |