![]() |
Skylake X prefetching -- paging Mystical
FYI - I have many FFT sizes coded with AVX512 instructions. The low level macros are mostly optimized. The extra registers make writing code much, much easier.
Skylake-X is screaming fast with one major caveat -- as long as the needed data is in the L1 cache. Today I started the two needed memory/caching optimizations. The first is to prefetch data from memory into either the L2 or L3 caches. The second optimization (later this week) is grouping low-level macros so they do as much work as possible while data is in the L1 cache (and maybe prefetch the next data into the L1 cache). While doing the first optimization, I ran into some strange timings. Hoping someone with AVX512 expertise may be able to shed some insight. These are the timings from an early prefetching attempt. The timings are for a single core running pass 1 length 1280 (pass 2 was 7680). timer 16: 21787032 timer 18: 6303010 timer 20: 6950758 timer 22: 7401330 timer 24: 9151008 timer 26: 11095368 There was a code bug whereby only half of the next pass 1 block's data was being prefetched. This explains the rather high clock count for timer 16. I fixed the bug so that all of the next pass 1 block was prefetched into the L2 cache: timer 16: 7862900 down 14 million -- perfect timer 18: 6976776 up 700K timer 20: 10384728 up 3 mil timer 22: 10651610 up 3 mil timer 24: 15386736 up 6 mil timer 26: 11061976 Timer 16 went down as expected -- reading its data from the L2 cache rather than memory. The distressing thing is timer 18/20/22 went up by an almost equal amount. This is where the prefetching is done. It certainly looks like the CPU is stalling waiting for prefetches to complete. It is my understanding that this is not how prefetching is supposed to work! BTW timer 26 did not change as that is where the sin/cos data for the next pass 1 block is prefetched. I'll run some more experiments. Prefetching to L3 instead of L2 made no difference. I'll next try making the prefetches less "clumpy" -- right now there are 4 or 8 consecutive prefetch instructions. |
[QUOTE=Prime95;488795]FYI - I have many FFT sizes coded with AVX512 instructions. The low level macros are mostly optimized. The extra registers make writing code much, much easier.
Skylake-X is screaming fast with one major caveat -- as long as the needed data is in the L1 cache. Today I started the two needed memory/caching optimizations. The first is to prefetch data from memory into either the L2 or L3 caches. The second optimization (later this week) is grouping low-level macros so they do as much work as possible while data is in the L1 cache (and maybe prefetch the next data into the L1 cache). While doing the first optimization, I ran into some strange timings. Hoping someone with AVX512 expertise may be able to shed some insight. These are the timings from an early prefetching attempt. The timings are for a single core running pass 1 length 1280 (pass 2 was 7680). timer 16: 21787032 timer 18: 6303010 timer 20: 6950758 timer 22: 7401330 timer 24: 9151008 timer 26: 11095368 There was a code bug whereby only half of the next pass 1 block's data was being prefetched. This explains the rather high clock count for timer 16. I fixed the bug so that all of the next pass 1 block was prefetched into the L2 cache: timer 16: 7862900 down 14 million -- perfect timer 18: 6976776 up 700K timer 20: 10384728 up 3 mil timer 22: 10651610 up 3 mil timer 24: 15386736 up 6 mil timer 26: 11061976 Timer 16 went down as expected -- reading its data from the L2 cache rather than memory. The distressing thing is timer 18/20/22 went up by an almost equal amount. This is where the prefetching is done. It certainly looks like the CPU is stalling waiting for prefetches to complete. It is my understanding that this is not how prefetching is supposed to work! BTW timer 26 did not change as that is where the sin/cos data for the next pass 1 block is prefetched. I'll run some more experiments. Prefetching to L3 instead of L2 made no difference. I'll next try making the prefetches less "clumpy" -- right now there are 4 or 8 consecutive prefetch instructions.[/QUOTE] Yeah, I've noticed this too. If you throw it under VTune, you'll see that the prefetches do indeed take up all the time. The prefetches are working, but because there are so many of them and they take so long to process, you max out the concurrency. After playing around with VTune and hardware counters, I suspect that the limiter is the off-core request queue*. All accesses (including prefetches) that go beyond the L2** must enter this queue. And when it's full, it starts holding up execution. Why does the queue get full so easily on Skylake X? It's probably a combination of the cache redesign along with the insufficient Skylake memory bandwidth. I forgot where, but there's an Intel doc somewhere saying that off-core requests on Skylake Xeon stay in the off-core request queue longer than in previous processors. So if they don't increase the size of the queue to compensate, it will get full very easily and will limit the number of in-flight prefetches/cache-misses. I've fiddled around with this enough and it's a very frustrating (near) zero-sum game. It may not even be possible to get perfect compute/memory-access overlap. If you prefetch early enough, you'll have so many prefetches in flight that you max out the off-core queue. And if you don't prefetch, you overrun the usual OOE window. *I don't know what the proper name for is. But I'll just call it "off-core request queue". **I don't know if the L2 is the actual boundary. It could be the L1. But for the sake of discussion, I'll just call it the L2. |
Scattering prefetches rather than clumping made no difference. Prefetchwt1 is faster than prefetcht1:
timer 16: 7847946 timer 18: 6907506 timer 20: 9710640 timer 22: 9923936 timer 24: 14490238 |
[QUOTE=Mysticial;488798]
I've fiddled around with this enough and it's a very frustrating (near) zero-sum game. It may not even be possible to get perfect compute/memory-access overlap. If you prefetch early enough, you'll have so many prefetches in flight that you max out the off-core queue. And if you don't prefetch, you overrun the usual OOE window.[/QUOTE] Ugh. Worse yet. Any time spent optimizing this for my setup will be less than optimal for users with different speed memory and different CPU core speeds. I'm guessing the best I can do is place the prefetches as uniformly as possible throughout the code. RAM should provide plenty of memory bandwidth to keep a single CPU core busy (the case I'm optimizing). Somewhat surprising that I'm seeing these stalls. Each CPU core must have its own request queue -- intolerable delays if 8 CPU cores were filling a single queue! |
@Mystical: Have you tried / do you think it would work to have a hyperthread do all the prefetching?
|
[QUOTE=Prime95;488800]Ugh.
Worse yet. Any time spent optimizing this for my setup will be less than optimal for users with different speed memory and different CPU core speeds. I'm guessing the best I can do is place the prefetches as uniformly as possible throughout the code. RAM should provide plenty of memory bandwidth to keep a single CPU core busy (the case I'm optimizing). Somewhat surprising that I'm seeing these stalls. Each CPU core must have its own request queue -- intolerable delays if 8 CPU cores were filling a single queue![/QUOTE] Uniformly as possible actually backfired for me. But I was targeting the hyperthreaded environment. In my case, the consumption of the to-be-prefetched memory is very bursty. If I spread out the prefetches between bursts, there may be thousands of cycles from prefetch-to-use. This is long enough where the other hyperthread has a high probability of trashing the entire L1. For prefetchnta'ed lines, this is bad since they seem to get evicted all the way back to memory (skipping the L2 and L3). So when the data needs to be used for real, it needs to pull it all the way back from memory - thereby consuming double the bandwidth. prefetcht0 is better behaved. I've seen some weird things with prefetcht1/2. (If the off-core request queue is used for everything above L1, then prefetcht1/2 would put double the pressure on the queue since both the prefetch and the real access would need to go through the queue. But there's too much noise for me to do more than just speculate.) In my case, the code is also completely memory-bound once all cores are running. This further increases the prefetch-to-use latency to tens of thousands of cycles - which is probably enough for a hyperthread to trash the L2 as well. Once this happens, it starts consuming more bandwidth leading to even higher prefetch-to-use latency - IOW a partial run-away effect that approaches 2x slowdown. So I don't spread out my prefetches. I sort of bunch them closer to use where the prefetch-to-use distance is no more than a few hundred cycles so that they have a much smaller window to be evicted by the other hyperthread. This does a poor job of spreading out the bandwidth consumption for that particular core/thread. But with many threads running across the entire chip, it still manages to smooth itself out decently well. [QUOTE=Prime95;488806]@Mystical: Have you tried / do you think it would work to have a hyperthread do all the prefetching?[/QUOTE] I haven't tried. The compute model that I work on is many random tasks running on many logical cores. No attempt is made to schedule things in specific places other than at the NUMA node level. |
[QUOTE=Prime95;488806]@Mystical: Have you tried / do you think it would work to have a hyperthread do all the prefetching?[/QUOTE]
FYI: Having a hyperthread do the prefetching does not solve the problem. I believe (no access to advanced tools like Vtune) that when the prefetching hyperthread stalls waiting for memory the compute hyperthread stalls too. The result is that all the prefetching is done, then all the compute. An interesting question: Is there a way to "stall" a hyperthread for x clock cycles? We really want the hyperthread to steal the minimum number of resources from the compute hyperthread while having a hefty number of no-op clocks between prefetch instructions. [I]Edit: just found the PAUSE instruction[/I] BTW, my prefetching problem may be exacerbated by TLB misses. In pass 1 of the FFT, the data is spread across many 4KB pages. My prefetching problem is definitely worse in pass 1. I'm not sure if my linux installation or mprime is set up to use large pages. |
[QUOTE=Prime95;489738]Is there a way to "stall" a hyperthread for x clock cycles? We really want the hyperthread to steal the minimum number of resources from the compute hyperthread while having a hefty number of no-op clocks between prefetch instructions. [I]Edit: just found the PAUSE instruction[/I][/QUOTE]
At the C-code level, have you tried (non-windows build, obviously) the Posix nanosleep command? |
[QUOTE=ewmayer;489779]At the C-code level, have you tried (non-windows build, obviously) the Posix nanosleep command?[/QUOTE]
I have not. Reading the man page suggests this call will get the OS involved with its thousands of clocks of task switching overhead. Reading the Intel optimization manual, [url]https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf[/url] indicates the PAUSE instruction is ideal. It was designed for spin loops, putting the thread in low power state for about 140 clocks in the Skylake architecture. |
[QUOTE=Prime95;489738]BTW, my prefetching problem may be exacerbated by TLB misses. In pass 1 of the FFT, the data is spread across many 4KB pages. My prefetching problem is definitely worse in pass 1. I'm not sure if my linux installation or mprime is set up to use large pages.[/QUOTE]
The kernel documentation has details on that: [url]https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt[/url] |
[QUOTE=Mark Rose;489829]The kernel documentation has details on that: [url]https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt[/url][/QUOTE]
As usual, I'm a failure at Linux. After researching, I did the following: [CODE]sudo apt install hugepages sudo groupadd hugepages sudo gpasswd -a george hugepages id george (to get hugepages group number) sudo nano /etc/sysctl.conf to add lines: # Set number of huge pages and hugepages group vm.nr_hugepages = 512 vm.hugetlb_shm_group = 1001 sudo nano /etc/security/limits.conf to add lines: @hugepages soft memlock unlimited @hugepages hard memlock unlimited [/CODE] Rebooted.....and mprime is not getting any huge pages. meminfo reports: [CODE] cat /proc/meminfo | grep Huge AnonHugePages: 112640 kB HugePages_Total: 512 HugePages_Free: 512 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB [/CODE] Hugeadm reports: [CODE]hugeadm --explain Total System Memory: 7676 MB Mount Point Options /dev/hugepages rw,relatime Huge page pools: Size Minimum Current Maximum Default 2097152 512 512 512 * 1073741824 0 0 0 Huge page sizes with configured pools: A /proc/sys/kernel/shmmax value of 9223372036854775807 bytes may be sub-optimal. To maximise shared memory usage, this should be set to the size of the largest shared memory segment size you want to be able to use. Alternatively, set it to a size matching the maximum possible allocation size of all huge pages. This can be done automatically, using the --set-recommended-shmmax option. The recommended shmmax for your currently allocated huge pages is 1073741824 bytes. To make shmmax settings persistent, add the following line to /etc/sysctl.conf: kernel.shmmax = 1073741824 hugeadm:WARNING: There is no swap space configured, resizing hugepage pool may fail hugeadm:WARNING: Use --add-temp-swap option to temporarily add swap during the resize To make your hugetlb_shm_group settings persistent, add the following line to /etc/sysctl.conf: vm.hugetlb_shm_group = 1001 Note: Permanent swap space should be preferred when dynamic huge page pools are used. [/CODE] ps says khugepaged is not using any CPU time: [CODE]ps ax | grep huge 32 ? SN 0:00 [khugepaged] 1177 pts/1 S+ 0:00 grep --color=auto huge [/CODE] Kernel logs not reporting anything of interest. Ideas? |
| All times are UTC. The time now is 23:31. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.