mersenneforum.org Useless SSE instructions
 Register FAQ Search Today's Posts Mark Forums Read

 2009-02-20, 18:11 #1 __HRB__     Dec 2008 Boycotting the Soapbox 24·32·5 Posts Useless SSE instructions Occasionally you'll come across a really cool way of doing something using SSE. Then you discover that it won't work, because the designers had a 50/50 or better chance of doing it right - and did it wrong. pmuludq: It doesn't get any wronger than this. If you want to be fast using SSE, the trick is usually figuring out how to do it with one 8/16/32/64-bit value and then use SSE to process 16/8/4/2 values in one go. Instead of providing an unsigned multiply that delivers the high 32-bits for dword operands or the low 64-bits for qword operands, we get two 32x32->64 bit results. So, to do anything useful with this you'll ALWAYS need shuffles and/or unpacking, as the upper 32-bit inputs are ignored and have to be processed somewhere else. psll, psrl w/immediate Aw, c'mon guys. If you've ever used these instuctions, you'd know that 90% of the time you need a move to preserve the inputs. Why doesn't this have a SRC, DST form? pcmp: There is no excuse for leaving out unsigned versions. Don't tell me that it requires real effort to include them: all compares have an immediate byte with unused bits, so for 50 extra transistors you could have xor'ed one bit with the top bit of the input. pcmpestrm/pcmpistri: Finally! Now the only missing instruction is: paddcpuidtoweekdayxorbit19iftuesdayandstartinternetexplorer addsubpd: a.k.a. mycomplextypeis128bitsoSSEistheanswerpd addsubps: a.k.a. ivectorizecodethewrongwayps
 2009-02-20, 18:33 #2 jasonp Tribal Bullet     Oct 2004 33×131 Posts Maybe PMULUDQ was designed for fixed-point operations where the result is expected to be rounded and shifted right to destroy the low bits. Splitting that across two registers would be very painful... I also think many of the quirks in SSE2 instructions boil down to a constrained intruction encoding space. The real problem is that multiple precision arithmetic was not on the agenda when this stuff was designed, so we'll just have to make do with what we have.
2009-02-20, 19:17   #3
ewmayer
2ω=0

Sep 2002
República de California

2D6A16 Posts

Quote:
 Originally Posted by jasonp Maybe PMULUDQ was designed for fixed-point operations where the result is expected to be rounded and shifted right to destroy the low bits. Splitting that across two registers would be very painful...
But in SSE4 they added a generate-4-low-halves-at-a-time version of dword mul - and I notice in AVX they finally got at least *something* right by going to a RISC-style 3-register format. So especially now no good reason not to support 4-way SIMD 32x32->64-bit multiply ... for instance they could do e.g.

pmuludq4 xmm0,xmm1,xmm2

where xmm0 and xmm1 are the inputs, stick the low halves of the 4 product in xmm1 and the high halves in xmm2. More elegantly, provide separate instructions to generate 4 lower and upper halves at a time, e.g.

pmulld xmm0,xmm1,xmm2
pmuludh xmm0,xmm1,xmm3

with the low halves output in xmm2 and the high halves in xmm3. This seems inefficient because it uses 2 instructions and the 2nd mul discards the lower halves, but one could add microcode support so that the hardware recognizes such paired lower-and-upper-half muls and fuses them into a single hardware operation, which splits the double-wide outputs into the 2 destination registers. All sorts of ways to do this.

Quote:
 I also think many of the quirks in SSE2 instructions boil down to a constrained instruction encoding space. The real problem is that multiple precision arithmetic was not on the agenda when this stuff was designed, so we'll just have to make do with what we have.
We don't necessarily need more instructions, we need more intelligently-thought-out instructions. As I noticed in the "efficient mod" thread, there are some instructions they added which are worse than useless - I used the mov2qdq instruction as a particularly egregious example - they could have saved themselves both mov2qdq and movdq2q by instead simply enhancing the existing movhp... and movlp... instructions to accept an mmx register as an operand.

Another example - the utterly idiotic lack of any support whatsoever for complex MUL in SSE and SSE2. With a little bit of thought they could have added just 2 or 3 instructions (or better, enhance some of the ones they did us) to permit for an efficient CMUL. These are supposed to be some of the world's top CPU and ISA people here - no excuse for those kinds of oversights.

Now, with AVX, floating-point support looks really good ... on the integer side they added some really nice crypto-related instructions, but they didn't do a goddamn thing to improve legacy generic-integer support (they didn't even widen the bandwidth from 128-bit to 256-bit as they did for floats), except for the aforementioned 3-operand format - which they should have done right from the start anyway, because with SSE they had in essence a blank slate to work with. You gave us a RISC-style register set, why not a set of RISC-style instructions to go along with it, which don't force us to do cycle-wasting register-operand copying at every turn? Now look how many hoops they have to jump through to retroactively do the right thing - a good fraction of their current and future ISA is in effect there only because its predecessors were so poorly thought out. I cite AltiVec as a basis for comparison Jason will be intimately familiar with. *There* was a reasonably well-thought-out SIMD ISA ... add double-float and 64-bit int support in later generations and it would have been killer.

Intel is great at shrinking a given architecture to incredibly small sizes and lowering power consumption, but good ISA designers, they are not. Not even close.

Last fiddled with by ewmayer on 2009-02-20 at 19:21

2009-02-20, 21:23   #4
rogue

"Mark"
Apr 2003
Between here and the

7·29·31 Posts

Quote:
 Originally Posted by ewmayer pmulld xmm0,xmm1,xmm2 pmuludh xmm0,xmm1,xmm3 with the low halves output in xmm2 and the high halves in xmm3. This seems inefficient because it uses 2 instructions and the 2nd mul discards the lower halves, but one could add microcode support so that the hardware recognizes such paired lower-and-upper-half muls and fuses them into a single hardware operation, which splits the double-wide outputs into the 2 destination registers. All sorts of ways to do this.
They must be learning from IBM as PowerPC does the same thing. You need two multiplies to get the 128-bit product (for 64x64 multiplies in 64-bit registers). They did it before the PowerPC line for some instructions, why did they drop it on PowerPC? I suspect there is some thing about "pure RISC" that using two registers for output (only one of which is specified on the instruction) is not to be done. Then again they set bits in control registers all the time based upon the results of different instructions...

2009-02-20, 21:56   #5
ldesnogu

Jan 2008
France

3×181 Posts

Quote:
 Originally Posted by rogue They must be learning from IBM as PowerPC does the same thing. You need two multiplies to get the 128-bit product (for 64x64 multiplies in 64-bit registers). They did it before the PowerPC line for some instructions, why did they drop it on PowerPC? I suspect there is some thing about "pure RISC" that using two registers for output (only one of which is specified on the instruction) is not to be done. Then again they set bits in control registers all the time based upon the results of different instructions...
The problem of having two outputs is that you need to increase the number of write ports in your register bank which makes it bigger and may add some critical paths. You also need to add data paths to route multiple outputs for forwarding paths, which can quickly become a problem.

As far as bits in control registers go, that's generally handled differently and causes less trouble because these are not as wide as registers and you can apply some tricks by only forwarding parts of these control regs.

So it's not a RISC issue or an encoding issue, it's a design constraint

I hope I did not use too many technical terms and that I made my point clear...

 2009-02-20, 22:07 #6 jasonp Tribal Bullet     Oct 2004 33·131 Posts Anyone who is interested in these sorts of issues could look at some of the work of the F-CPU project, which is (was?) an attempt to design a general-purpose 64-bit CPU from scratch, with SIMD designed in from the ground up. Because the first implementations needed to be efficient on FPGAs, and these have a strict limit on the number of read and write ports to the logic you would use for register files, the instruction set has many instructions that write two registers (e.g. a sum and a carry out in register (X) and (X XOR 1), respectively). I don't envy whoever tried to modify gcc to emit code for it :)
2009-02-20, 23:11   #7
ewmayer
2ω=0

Sep 2002
República de California

265528 Posts
Also useless: MOVNTPD/S

Another example of "useless":

Quote:
 MOVNTPD -- Store Packed Double-Precision Floating-Point Values Using Non-Temporal Hint Description Moves the double quadword in the source operand (second operand) to the destination operand (first operand) using a non-temporal hint to minimize cache pollution during the write to memory. The source operand is an XMM register, which is assumed to contain two packed double-precision floating-point values. The destination operand is a 128-bit memory location. The non-temporal hint is implemented by using a write combining (WC) memory type protocol when writing the data to memory. Using this protocol, the processor does not write the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy. The memory type of the region being written to can override the non-temporal hint, if the memory address specified for the non-temporal store is in an uncacheable (UC) or write protected (WP) memory region. For more information on non-temporal stores, see “Caching of Temporal vs. Non-Temporal Data” in Chapter 10 in the IA-32 Intel Architecture Software Developer’s Manual, Volume 1. Because the WC protocol uses a weakly-ordered memory consistency model, a fencing operation implemented with the SFENCE or MFENCE instruction should be used in conjunction with MOVNTPD instructions if multiple processors might use different memory types to read/write the destination memory locations. Opcode Instruction Description 66 0F 2B /r MOVNTPD m128, xmm Move packed double-precision floating-point values from xmm to m128 using non-temporal hint.
Sounds pretty good, doesn't it? Sounds like "if you're done crunching datum <foo> whose value is currently stored in an xmm register and you know that <foo> won't be used (either in read or write mode) for a while, you should use this special MOV instruction to write it back to memory, because this bypasses the cache hierarchy and thus allows more soon-to-be-used data to enter the caches without risking being kicked out by <foo> on its way back to main memory."

So I tried using it for the write-outputs step at the end of the loop body of the radix-8 complex-DFT-pass loop in Mlucas last night, in full-optimized mode on my Win32/Core2Duo. And whaddya know - the runtime instantly more than doubled. Note I said "runtime", not "performance". My reaction was something along the lines of "You know, if I needed a way to get my CPU to run cooler, I'd just switch my system power options to max-battery-life mode or fill my assembly code with no-ops."

Another useless example: Since the various SSE mov--- instructions don't care about the data type (e.g. we can freely use movaps in place of movapd, which is in fact recommended because the former has a smaller opcode), why do we need separate MOVAPS, MOVAPD, MOVDQA instructions? (similar with the unaligned versions of the these).

Last fiddled with by ewmayer on 2009-02-20 at 23:12

2009-02-20, 23:22   #8
__HRB__

Dec 2008
Boycotting the Soapbox

24·32·5 Posts

Quote:
 Originally Posted by ewmayer Another example - the utterly idiotic lack of any support whatsoever for complex MUL in SSE and SSE2.
You mean support for doing: (a+ib)(c+id)=ac-bd + i[(a+b)(c+d)-ac-bd] in 3 cycles?

I'd rather have a status register, a 4-bit condition-code, free shifts & rotates in ALL instructions.

Quote:
 Originally Posted by ewmayer You gave us a RISC-style register set, why not a set of RISC-style instructions to go along with it, which don't force us to do cycle-wasting register-operand copying at every turn?
This probably has to do with the small number of registers and the way the decoder works. During the decode the processor starts renaming register, so the cost of the copy is essentially only reduced decoding bandwidth. The register file has something like 96 entries, which I think acts more like a dual-ported level-0 cache.

I want 64 ARMs on one chip running at 2Ghz for my birthday. Oh, and 64 blocks of 64k dual ported RAM on the same chip. Thank you.

Quote:
 Originally Posted by jasonp The real problem is that multiple precision arithmetic was not on the agenda when this stuff was designed, so we'll just have to make do with what we have.
My objection is that it didn't have to be on the agenda. Just sticking to SIMD philosophy would have been sufficient. pmuludq mixes 32 and 64-bit operands, so this can't be right if the general idea is to have 2/4/8/16 independent streams.

Last fiddled with by __HRB__ on 2009-02-20 at 23:28

2009-02-21, 03:56   #9
jasonp
Tribal Bullet

Oct 2004

33·131 Posts

Quote:
 Originally Posted by ewmayer And whaddya know - the runtime instantly more than doubled. Note I said "runtime", not "performance"
IIRC George found something similar when optimizing Prime95 for the first Pentium 4 CPUs. However, he noticed that L2 bandwidth increased markedly when the stores were contiguous to some other memory region, and not back to the original addresses.

Sometimes I think MOVNTPD is designed only for memory copies; there's an AMD example whitepaper for that application where the MMX version of the instruction increases performance drastically because it allows write combining.

 2009-02-21, 04:36 #10 __HRB__     Dec 2008 Boycotting the Soapbox 24·32·5 Posts punpck* Code: xorps xmm,xmm punpck* mem, xmm gives us: ...xh,0,xl,0 but with source and destination exchanged, we'd have a quick zero-extend of the inputs, fetching only what you need from memory. I'd like to quote the docs, but copy&paste is "forbidden by drm". I'm too lazy to figure out how to circumvent it. Anyhoo, they actually mention that you can use punpck* for this purpose with a source operand of 0. This greatly benefits applications which store data in registers and zero-extend results before writing them to memory. By writing to the same memory location, data compression (lossy) of almost %100 is possible, while still being able to correctly reconstuct 50% of the input. Last fiddled with by __HRB__ on 2009-02-21 at 04:37
2009-02-21, 11:48   #11
ldesnogu

Jan 2008
France

3×181 Posts

Quote:
 Originally Posted by jasonp Anyone who is interested in these sorts of issues could look at some of the work of the F-CPU project, which is (was?) an attempt to design a general-purpose 64-bit CPU from scratch, with SIMD designed in from the ground up. Because the first implementations needed to be efficient on FPGAs, and these have a strict limit on the number of read and write ports to the logic you would use for register files, the instruction set has many instructions that write two registers (e.g. a sum and a carry out in register (X) and (X XOR 1), respectively).
It looks like F-CPU has been dead for years. The fact they had instructions that could write 2 registers is eased by the fact they only committed one instruction per cycle Another open project, OpenRISC, seemingly only has one output per instruction.

Section 5.1 of the book Embedded Computing: A Vliw Approach To Architecture, Compilers And Tools explains that the area of a register file increases as the square of the number of ports and access time increases linearly with the number of ports. Section 5.4.2 also explains how large the forwarding network can grow.

Anyway that doesn't explain why x86 SIMD various instruction sets are so odd. I guess it's the result of adding a few instructions at each generation, instead of spending a few years in R&D thinking about what is really needed in the longer term.

 Similar Threads Thread Thread Starter Forum Replies Last Post jasong Forum Feedback 1050 2019-04-29 00:50 EdH Linux 11 2016-05-13 15:36 lycorn PrimeNet 16 2009-09-08 18:16 jocelynl Data 4 2004-11-28 13:28

All times are UTC. The time now is 03:41.

Thu Apr 22 03:41:52 UTC 2021 up 13 days, 22:22, 0 users, load averages: 1.92, 1.74, 1.76