View Single Post
Old 2009-02-20, 19:17   #3
ewmayer's Avatar
Sep 2002
Rep├║blica de California

2·5·17·67 Posts

Originally Posted by jasonp View Post
Maybe PMULUDQ was designed for fixed-point operations where the result is expected to be rounded and shifted right to destroy the low bits. Splitting that across two registers would be very painful...
But in SSE4 they added a generate-4-low-halves-at-a-time version of dword mul - and I notice in AVX they finally got at least *something* right by going to a RISC-style 3-register format. So especially now no good reason not to support 4-way SIMD 32x32->64-bit multiply ... for instance they could do e.g.

pmuludq4 xmm0,xmm1,xmm2

where xmm0 and xmm1 are the inputs, stick the low halves of the 4 product in xmm1 and the high halves in xmm2. More elegantly, provide separate instructions to generate 4 lower and upper halves at a time, e.g.

pmulld xmm0,xmm1,xmm2
pmuludh xmm0,xmm1,xmm3

with the low halves output in xmm2 and the high halves in xmm3. This seems inefficient because it uses 2 instructions and the 2nd mul discards the lower halves, but one could add microcode support so that the hardware recognizes such paired lower-and-upper-half muls and fuses them into a single hardware operation, which splits the double-wide outputs into the 2 destination registers. All sorts of ways to do this.

I also think many of the quirks in SSE2 instructions boil down to a constrained instruction encoding space. The real problem is that multiple precision arithmetic was not on the agenda when this stuff was designed, so we'll just have to make do with what we have.
We don't necessarily need more instructions, we need more intelligently-thought-out instructions. As I noticed in the "efficient mod" thread, there are some instructions they added which are worse than useless - I used the mov2qdq instruction as a particularly egregious example - they could have saved themselves both mov2qdq and movdq2q by instead simply enhancing the existing movhp... and movlp... instructions to accept an mmx register as an operand.

Another example - the utterly idiotic lack of any support whatsoever for complex MUL in SSE and SSE2. With a little bit of thought they could have added just 2 or 3 instructions (or better, enhance some of the ones they did us) to permit for an efficient CMUL. These are supposed to be some of the world's top CPU and ISA people here - no excuse for those kinds of oversights.

Now, with AVX, floating-point support looks really good ... on the integer side they added some really nice crypto-related instructions, but they didn't do a goddamn thing to improve legacy generic-integer support (they didn't even widen the bandwidth from 128-bit to 256-bit as they did for floats), except for the aforementioned 3-operand format - which they should have done right from the start anyway, because with SSE they had in essence a blank slate to work with. You gave us a RISC-style register set, why not a set of RISC-style instructions to go along with it, which don't force us to do cycle-wasting register-operand copying at every turn? Now look how many hoops they have to jump through to retroactively do the right thing - a good fraction of their current and future ISA is in effect there only because its predecessors were so poorly thought out. I cite AltiVec as a basis for comparison Jason will be intimately familiar with. *There* was a reasonably well-thought-out SIMD ISA ... add double-float and 64-bit int support in later generations and it would have been killer.

Intel is great at shrinking a given architecture to incredibly small sizes and lowering power consumption, but good ISA designers, they are not. Not even close.

Last fiddled with by ewmayer on 2009-02-20 at 19:21
ewmayer is offline   Reply With Quote