![]() |
Skylake X twizzler
I've run into a puzzling Skylake stall. The code below should take 12 clocks if in a tight loop, 3 for the vshuff64x2, 1 for the vshufpd, 8 for the two vaddpds. IACA agrees.
My Skylake machine takes 14 clocks. Any ideas what is causing a 2 clock stall? [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 3-5 vshuff64x2 zmm6, zmm6, zmm7, 10001000b ; 4-6 vaddpd zmm0, zmm0, zmm0 ; 4-7 vaddpd zmm2, zmm2, zmm3 ; 5-8 vaddpd zmm4, zmm4, zmm5 ; 6-9 vaddpd zmm6, zmm6, zmm7 ; 7-10 vshufpd zmm0, zmm0, zmm1, 11111111b ; 8 vshufpd zmm2, zmm2, zmm3, 11111111b ; 9 vshufpd zmm4, zmm4, zmm5, 11111111b ; 10 vshufpd zmm6, zmm6, zmm7, 11111111b ; 11 vaddpd zmm0, zmm0, zmm0 ; 9-12 vaddpd zmm2, zmm2, zmm3 ; 10-13 vaddpd zmm4, zmm4, zmm5 ; 11-14 vaddpd zmm6, zmm6, zmm7 ; 12-15 [/CODE] Note that this code takes 12 clocks as expected: [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vaddpd zmm0, zmm0, zmm0 ; 4-7 vshufpd zmm0, zmm0, zmm1, 11111111b ; 8 vaddpd zmm0, zmm0, zmm0 ; 9-12 [/CODE] |
What happens if you change all the zmms to ymms?
|
Try benchmarking it with two or three rows at a time to see where the slowdown starts.
Eg: [code] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 ; vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 3-5 ; vshuff64x2 zmm6, zmm6, zmm7, 10001000b ; 4-6 vaddpd zmm0, zmm0, zmm0 ; 4-7 vaddpd zmm2, zmm2, zmm3 ; 5-8 ; vaddpd zmm4, zmm4, zmm5 ; 6-9 ; vaddpd zmm6, zmm6, zmm7 ; 7-10 vshufpd zmm0, zmm0, zmm1, 11111111b ; 8 vshufpd zmm2, zmm2, zmm3, 11111111b ; 9 ; vshufpd zmm4, zmm4, zmm5, 11111111b ; 10 ; vshufpd zmm6, zmm6, zmm7, 11111111b ; 11 vaddpd zmm0, zmm0, zmm0 ; 9-12 vaddpd zmm2, zmm2, zmm3 ; 10-13 ; vaddpd zmm4, zmm4, zmm5 ; 11-14 ; vaddpd zmm6, zmm6, zmm7 ; 12-15 [/code]or: [code] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 ; vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 3-5 ; vshuff64x2 zmm6, zmm6, zmm7, 10001000b ; 4-6 vaddpd zmm0, zmm0, zmm0 ; 4-7 ; vaddpd zmm2, zmm2, zmm3 ; 5-8 vaddpd zmm4, zmm4, zmm5 ; 6-9 ; vaddpd zmm6, zmm6, zmm7 ; 7-10 vshufpd zmm0, zmm0, zmm1, 11111111b ; 8 ; vshufpd zmm2, zmm2, zmm3, 11111111b ; 9 vshufpd zmm4, zmm4, zmm5, 11111111b ; 10 ; vshufpd zmm6, zmm6, zmm7, 11111111b ; 11 vaddpd zmm0, zmm0, zmm0 ; 9-12 ; vaddpd zmm2, zmm2, zmm3 ; 10-13 vaddpd zmm4, zmm4, zmm5 ; 11-14 ; vaddpd zmm6, zmm6, zmm7 ; 12-15 [/code]or with just one set of rows removed for 3 at a time. Chris |
My working theory is there is no delay if the result of an operation is used on the same port, but a one clock delay if the result is used on a different port (Agner Fog has described similar behavior in past Intel CPUs).
This could lead to the clocks described below. At clock 4 the vaddpd is ready to go, but only on port 5. The scheduler has already assigned the vshuff64x2 instruction to port 5 on clock 4. Thus the vaddpd is instead scheduled for clock 5 on port 0. Similarly, at clock 9 the vshufpd is ready to go but the zmm0 data is on port 0 and vshufpd only runs on port 5. Thus, the vshufpd is scheduled for clock 10 instead. The surmised scheduling below exactly describes the observed results (next loop begins at clock 15). I'll try to work up a few more test cases to verify my theory. [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 3-5 vshuff64x2 zmm6, zmm6, zmm7, 10001000b ; 4-6 vaddpd zmm0, zmm0, zmm0 ; 5-8 vaddpd zmm2, zmm2, zmm3 ; 5-8 vaddpd zmm4, zmm4, zmm5 ; 6-9 vaddpd zmm6, zmm6, zmm7 ; 7-10 vshufpd zmm0, zmm0, zmm1, 11111111b ; 10 vshufpd zmm2, zmm2, zmm3, 11111111b ; 9 vshufpd zmm4, zmm4, zmm5, 11111111b ; 11 vaddpd zmm0, zmm0, zmm0 ; 12-15 vshufpd zmm6, zmm6, zmm7, 11111111b ; 12 vaddpd zmm2, zmm2, zmm3 ; 11-14 vaddpd zmm4, zmm4, zmm5 ; 13-16 vaddpd zmm6, zmm6, zmm7 ; 13-16 [/CODE] |
Scratch that theory (maybe). I can get a one clock delay without any port 0/5 conflicts.
This takes 7 clocks as expected: [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vaddpd zmm0, zmm0, zmm0 ; 4-7 [/CODE] This takes 8 clocks in a tight loop -- one more than expected: [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 vaddpd zmm0, zmm0, zmm0 ; 4-7 vaddpd zmm2, zmm2, zmm3 ; 5-8 [/CODE] To show that back-to-back shuffles aren't a problem, this takes the expected 6 clocks: [CODE] vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 1-3 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 2-4 vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 3-5 vshuff64x2 zmm0, zmm0, zmm1, 10001000b ; 4-6 vshuff64x2 zmm2, zmm2, zmm3, 10001000b ; 5-7 vshuff64x2 zmm4, zmm4, zmm5, 10001000b ; 6-8 [/CODE] Is there a way the "one clock penalty to transport data to another port" theory can be resurrected? Suppose the scheduler is blissfully unaware of this penalty and prefers in case 2 to alternate consecutive vaddpds to ports 0/5? Anyone else with a working theory? |
More evidence of a data forwarding delay.
I crafted vaddpds such that the inputs depend on earlier outputs of both ports 0 and 5. This takes 17+ clocks instead of 16. [CODE] vaddpd zmm0, zmm0, zmm0 vaddpd zmm1, zmm1, zmm1 vaddpd zmm2, zmm2, zmm2 vaddpd zmm3, zmm3, zmm3 vaddpd zmm4, zmm4, zmm4 vaddpd zmm5, zmm5, zmm5 vaddpd zmm6, zmm6, zmm6 vaddpd zmm7, zmm7, zmm7 vaddpd zmm8, zmm0, zmm1 vaddpd zmm9, zmm1, zmm0 vaddpd zmm10, zmm2, zmm3 vaddpd zmm11, zmm3, zmm2 vaddpd zmm12, zmm4, zmm5 vaddpd zmm13, zmm5, zmm4 vaddpd zmm14, zmm6, zmm7 vaddpd zmm15, zmm7, zmm6 vaddpd zmm8, zmm8, zmm8 vaddpd zmm9, zmm9, zmm9 vaddpd zmm10, zmm10, zmm10 vaddpd zmm11, zmm11, zmm11 vaddpd zmm12, zmm12, zmm12 vaddpd zmm13, zmm13, zmm13 vaddpd zmm14, zmm14, zmm14 vaddpd zmm15, zmm15, zmm15 vaddpd zmm0, zmm8, zmm9 vaddpd zmm1, zmm9, zmm8 vaddpd zmm2, zmm10, zmm11 vaddpd zmm3, zmm11, zmm10 vaddpd zmm4, zmm12, zmm13 vaddpd zmm5, zmm13, zmm12 vaddpd zmm6, zmm14, zmm15 vaddpd zmm7, zmm15, zmm14 [/CODE] |
and evidence against a data forwarding delay.... This takes the expected 16 clocks:
[CODE] vaddpd zmm0, zmm0, zmm0 ; 1-4 vaddpd zmm1, zmm1, zmm1 ; 1-4 vaddpd zmm2, zmm0, zmm1 ; 5-8 vaddpd zmm3, zmm0, zmm1 ; 5-8 vaddpd zmm4, zmm2, zmm3 ; 9-12 vaddpd zmm5, zmm2, zmm3 ; 9-12 vaddpd zmm0, zmm5, zmm4 ; 13-16 vaddpd zmm1, zmm4, zmm5 ; 13-16 [/CODE] |
[QUOTE=Prime95;500428]and evidence against a data forwarding delay.... This takes the expected 16 clocks:
[CODE] vaddpd zmm0, zmm0, zmm0 ; 1-4 vaddpd zmm1, zmm1, zmm1 ; 1-4 vaddpd zmm2, zmm0, zmm1 ; 5-8 vaddpd zmm3, zmm0, zmm1 ; 5-8 vaddpd zmm4, zmm2, zmm3 ; 9-12 vaddpd zmm5, zmm2, zmm3 ; 9-12 vaddpd zmm0, zmm5, zmm4 ; 13-16 vaddpd zmm1, zmm4, zmm5 ; 13-16 [/CODE][/QUOTE] This last one doesn't sound right. There is a known 1 cycle latency to get to and from the port5 FMA. (Having trouble finding the source for this.) But I've been under the impression that it doesn't apply to the shuffle unit on p5. (the p5 FMA is physically very far away from the rest of the core.) |
[QUOTE=Mysticial;500429]This last one doesn't sound right. There is a known 1 cycle latency to get to and from the port5 FMA. (Having trouble finding the source for this.) But I've been under the impression that it doesn't apply to the shuffle unit on p5. (the p5 FMA is physically very far away from the rest of the core.)[/QUOTE]
I thought the original announcement from Intel said the latency for port 5 FMA was longer than port 0 and Intel later retracted. |
[QUOTE=Prime95;500433]I thought the original announcement from Intel said the latency for port 5 FMA was longer than port 0 and Intel later retracted.[/QUOTE]
I didn't know anything about that retraction, though digging up Intel's manual, it looks like it's more complicated than that. Section 15.17: [URL]https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf[/URL] To summarize, p5 FMA does have a 1 cycle latency to get to and from it. But only if the data is not coming from *any* FMA unit. IOW, the 1 cycle latency doesn't exist if you move from p01 FMA <-> p5 FMA. But it does if you go from p5 FMA to anything else - including the p5 shuffle. ----- Looking at the [URL="https://www.realworldtech.com/forum/?threadid=138897&curpostid=179046"]die shots for SKX[/URL], this kinda makes sense. The p5 FMA is sitting above the SKL core. And you see a virtually identical pattern right below inside the SKL core. It's probably safe to assume that those are the FMA units. Specifically, the top part (off the SKL core) is the p5 FMA, and the part below it is the p0+1 FMA. Speculation: Since the p01 and p5 FMAs are physically next to each other, they can send data to each other with no latency. But getting from the p5 FMA to the register file and the other execution units takes longer since it needs to traverse the area of the p01 FMA. |
[QUOTE=Mysticial;500435]I didn't know anything about that retraction, though digging up Intel's manual, it looks like it's more complicated than that.[/quote]
My bad. I was remembering the controversy over whether the 6 and 8-core Skylake X's had full AVX512 throughput. [quote]Section 15.17: [URL]https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf[/URL] To summarize, p5 FMA does have a 1 cycle latency to get to and from it. But only if the data is not coming from *any* FMA unit. IOW, the 1 cycle latency doesn't exist if you move from p01 FMA <-> p5 FMA. But it does if you go from p5 FMA to anything else - including the p5 shuffle.[/quote] Good info, I'll study it thoroughly. I need to see if this explains all my timings. |
| All times are UTC. The time now is 07:17. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.