View Single Post
Old 2010-09-27, 20:54   #11
ewmayer's Avatar
Sep 2002
Rep├║blica de California

22·5·11·53 Posts

Also, as I noted in a recent e-mail exchange with Paul/xilman, I think I finally hit my stride as to efficient SSE (and other ASM coding) with some recent SSE2 code which will go into a near-future Mlucas release ... Here is the recipe I use:

1> first code up all the key macros/code-sections in scalar mode (allowing complex subexpressions like z = a*x + b*y), and make sure that works correctly;

2> recode to reduce the number of local variables to match the number of available registers in the targeted ISA (I use special-named local variables with hexadecimal index fields - convenient for local memory storage writes of 16-byte-wide SSE registers - to stand for spills-to-memory like __m0, __m1, ... , __ma, __mb, which makes later conversion to inline assembler easy);

3> recode to mimic the instructions supported by the ISA, e.g. destructive 2-input-one-gets-overwritten for x86-style ISAs;

4> translate to assembler.

That makes it much easier to see the code flow and map out a quasi-optimal register-usage strategy. Of course this is exactly the kind of stuff optimizing compilers are supposed to do for one and have been promised to do for 50-some years now ... but one can wait and hope that the next version of compiler X will actually live up to such promises, or roll up one's sleeves.
ewmayer is offline   Reply With Quote