View Single Post
Old 2009-02-20, 18:11   #1
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

13208 Posts
Angry Useless SSE instructions

Occasionally you'll come across a really cool way of doing something using SSE. Then you discover that it won't work, because the designers had a 50/50 or better chance of doing it right - and did it wrong.

pmuludq:

It doesn't get any wronger than this. If you want to be fast using SSE, the trick is usually figuring out how to do it with one 8/16/32/64-bit value and then use SSE to process 16/8/4/2 values in one go.

Instead of providing an unsigned multiply that delivers the high 32-bits for dword operands or the low 64-bits for qword operands, we get two 32x32->64 bit results. So, to do anything useful with this you'll ALWAYS need shuffles and/or unpacking, as the upper 32-bit inputs are ignored and have to be processed somewhere else.

psll, psrl w/immediate

Aw, c'mon guys. If you've ever used these instuctions, you'd know that 90% of the time you need a move to preserve the inputs. Why doesn't this have a SRC, DST form?

pcmp:

There is no excuse for leaving out unsigned versions. Don't tell me that it requires real effort to include them: all compares have an immediate byte with unused bits, so for 50 extra transistors you could have xor'ed one bit with the top bit of the input.

pcmpestrm/pcmpistri:

Finally! Now the only missing instruction is:
paddcpuidtoweekdayxorbit19iftuesdayandstartinternetexplorer

addsubpd:
a.k.a. mycomplextypeis128bitsoSSEistheanswerpd

addsubps:
a.k.a. ivectorizecodethewrongwayps
__HRB__ is offline   Reply With Quote