![]() |
|
|
#1 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
(First a bit of pedantic background, for the interest-reader but non-hardware-wonk who may be reading this:)
The x86 CPU architecture family is notoriously register poor - while things have gotten much better in 64-bit, it is still easy to run out of registers, especially if one is trying to manage instruction sequencing in order to maintain high throughput in the presence of higher-latency operations such as MUL. One major way the x86 ISA (instruction set architecture) helps mitigate the register-poorness is via "implied load", that is, allowing one of the inputs to most common arithmetic instructions be a memory reference. For example in GCC 32-bit inline-assemblerese, let's say I want to take a pair of SSE2 data x and y stored in registers xmm0 and xmm1 and do a simple radix-2 butterfly, producing outputs x+y and x-y. The constraint? This butterfly is competing for registers with other in-flight operations, thus I need it to be in-place, i.e. using just the 2 input registers, preferably without incurring a register spill. So here's one way - again note in GCC/AT&T syntax the source operand is leftmost and the destination is the right one: Code:
subpd xmm1,xmm0 // x -= y addpd xmm1,xmm1 // y *= 2 addpd xmm0,xmm1 // y += x Code:
subpd xmm1,xmm0 // x -= y mulpd (eax),xmm1 // y *= 2 addpd xmm0,xmm1 // y += x Or does it? Because if I were designing the microcode execution engine (MEE; not sure if there's a CS "standard initialism" for this chip functionality) for such an ISA, I would consider things like the following: [a] The (full 4-fold) code sequence has 4 such MULs, each invoking the same memory location; [b] A typical modern-day CPU architecture has many more "virtual" registers (both general-purpose and SIMD) than are visible to the coder, these are used for things like in-flight register-rename and speculative code execution. The obvious conclusion here is: the MEE should scan ahead for reused memory references in such implied loads, and whenever possible, load the memory datum into a virtual register, thus incurring just a single memory load. My question is: Do MEEs in modern-day Intel and AMD chips do such a thing? And if so, what kinds of constraints (e.g. depth-of-lookahead, number of distinct memory refs which can be cached at any one time this way) govern them? Are there any Intel or AMD engineers on this board who know? |
|
|
|
|
|
#2 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
11101011001102 Posts |
I don't know the answer.
Some observations: 1) Haswell has twice the L1 bandwidth over Sandy/Ivy. Thus, unless you are doing an enormous number of load/store ops, there should be little or no difference between the microcode reloading 2.0 every time vs. cleverly putting 2.0 in a virtual register. 2) I've ceased all development of 32-bit FFT code. With sixteen registers, I often can keep constants in registers. 3) With Skylake, due next year, we will have 32 registers and 8 doubles in an AVX register. All the more reason to abandon 32-bit, 8 register development. Alas, Microsoft's assembler does not support the new instructions yet. |
|
|
|
|
|
#3 |
|
∂2ω=0
Sep 2002
República de California
103·113 Posts |
Thanks, George - I similarly have been doing exclusively 64-bit work, except for a brief bout of "restore 32-bit buildability" to allow me to actually build and run SSE2 code on my replacement 2008-9 Macbook Core Duo. I just used 32-bit mode for purposes of my tiny example, because anyone who has coded 32-bit assembly for x86 knows how register-starved that is. (And yet materially better than the pre-32-bit x86 CPUs - I know a couple older guys who love to tell war stories about those, along the lines of the classic Dilbert "A: We had to code using just 0s and 1s. B: They gave you 1s? C: Ha! We didn't even have 0s - we had to use Os" cartoon.)
This is an issue that has nagged at me for a while, today finally felt compelled to write about it - it would simply be nice to *know* yes or no, because Intel and AMD's being so coy about such fairly basic and coder-impacting aspects of their architectures leads to a huge amount of wasted "flying blind" time on the part of coders. Basically, I felt like whining. :) |
|
|
|
|
|
#4 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
588010 Posts |
Do you have a source for this other than wikipedia?
|
|
|
|
|
|
#5 |
|
Tribal Bullet
Oct 2004
DD516 Posts |
If memory serves (haha) the x86 memory model is fully coherent; writes to memory by one processor are automatically visible to all others in the system at the time the write happens. While that doesn't preclude optimizing away redundant reads like you propose, the view of memory in one thread has to incorporate side effects produced by other threads.
The redundant load can also execute earlier in the pipeline than the point in the instruction stream where it appears, so you would also need logic to examine the address of the load and turn off the load removal if the addresses happen to be different. Plus if you really do two loads from the same address, the second will be in cache already and wouldn't have a long latency. I would think it unlikely there would be a big payoff in implementing this kind of optimization. Cache it yes, but not remove it entirely. |
|
|
|
|
|
#6 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2·53·71 Posts |
IIRC, Intel has not officially promised this but it was leaked at a developer conference. So, more than speculation but less than guaranteed.
My speculation is that a single ADD (or MUL) will be implemented as two 256-bit uops either executed on two different ports or two different clock cycles. Therefore, total FPU throughput would be the same as Haswell. AVX-512 will make programming easier and allows Intel to deliver a true 512-bit wide FPU at a later date and potentially double the throughput of existing AVX-512 programs. |
|
|
|
|
|
#7 | ||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#8 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
2·53·71 Posts |
Quote:
But, I believe the Pentium 4 did the baby steps approach in the move to 128-bit. Core 2 was the first true 128-bit. Sandy the first 256-bit. So Intel has a history of both approaches. If Intel does implement true 512-bit wide FPU it will be interesting to see the cache improvements that will be required to feed such a beast. |
|
|
|
|
|
|
#9 | |
|
Jan 2008
France
22616 Posts |
There are so many fake slides all around the place it's hard to know what's official or not.
Here is Intel official word on it. They only talk about Xeon which likely means we won't have AVX-512 on typical laptop/desktop CPU with Skylake generation. Another one (bold is mine): Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Add repeated digits after a number until it is prime | sweety439 | sweety439 | 73 | 2019-08-02 10:18 |
| msieve has stopped breathing in the middle of execution. | ray10may | Msieve | 15 | 2017-04-17 22:21 |
| Workload report times out before page loads | NBtarheel_33 | PrimeNet | 6 | 2011-09-09 02:49 |
| Trivial bug: repeated PM notifications | Christenson | Forum Feedback | 0 | 2011-03-21 03:49 |
| D.C. Sniper Execution | storm5510 | Soap Box | 57 | 2009-12-05 00:57 |