View Single Post
Old 2009-02-20, 18:11   #1
__HRB__'s Avatar
Dec 2008
Boycotting the Soapbox

13208 Posts
Angry Useless SSE instructions

Occasionally you'll come across a really cool way of doing something using SSE. Then you discover that it won't work, because the designers had a 50/50 or better chance of doing it right - and did it wrong.


It doesn't get any wronger than this. If you want to be fast using SSE, the trick is usually figuring out how to do it with one 8/16/32/64-bit value and then use SSE to process 16/8/4/2 values in one go.

Instead of providing an unsigned multiply that delivers the high 32-bits for dword operands or the low 64-bits for qword operands, we get two 32x32->64 bit results. So, to do anything useful with this you'll ALWAYS need shuffles and/or unpacking, as the upper 32-bit inputs are ignored and have to be processed somewhere else.

psll, psrl w/immediate

Aw, c'mon guys. If you've ever used these instuctions, you'd know that 90% of the time you need a move to preserve the inputs. Why doesn't this have a SRC, DST form?


There is no excuse for leaving out unsigned versions. Don't tell me that it requires real effort to include them: all compares have an immediate byte with unused bits, so for 50 extra transistors you could have xor'ed one bit with the top bit of the input.


Finally! Now the only missing instruction is:

a.k.a. mycomplextypeis128bitsoSSEistheanswerpd

a.k.a. ivectorizecodethewrongwayps
__HRB__ is offline   Reply With Quote