![]() |
performance improvement with assembly
What kinds of performance improvements have people seen when using hand optimized assembly routines vs. a high level language like C? For instance, is it possible for assembly to account for a 20% improvement, or 50% (twice as fast) or higher?
I know that this depends on the quality of implementation of both the C and the assembly code, so lets say that the C code has already been tweaked quite a bit, and is performing pretty well. I've written my own arbitrary precision integer math library, and it performs well for what I need it to do, but I'm trying to get a feel for if it'd be worth the effort to learn some assembly to improve its performance. If the gain is likely to be 10% or so, then the priority is not very high... thanks for any advice you can offer, - ben. |
Depends very much on the application, and how much time/skill you possess.
To use some Mersenne-realted examples: the big-FFT code used in LL testing benefits far more from careful high-level coding and data movement considerations than from ASM. (Note that ASM gave much more bang on the x86 in the early days of the project, when there were no really good HLL compilers for the x86). I can get about 2/3 the performance of George's hand-tuned Prime95 code using a simple MSVC build of my C code, and the nice thing is, the C code is portable t other platforms. Now, for smaller snippets of code, however, a whiff of ASM can often get you a lot of [url=http://www.mersenneforum.org/showthread.php?p=107427#post107427]bang for your buck[/url]. The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with. |
Thanks for the example, that's exactly the kind of stuff I'm looking for. I did some more digging around in the gmp manual, and found this:
[URL]http://gmplib.org/manual/Assembler-Coding.html#Assembler-Coding[/URL] which also seems to show what is possible. If anyone else has any examples of how assembly significant speeded up a routine, I'd be interested to hear about it. [SIZE=1]p.s. would this thread be better off in the programming sub-forum?[/SIZE] |
See [url]http://www.agner.org/optimize/[/url]
Alex |
Thanks Alex. Wow, that's a great site. I'll start trying some experiments with assembly, to see if I can improve performance at all. It looks like even a small amount, applied in the right places, can make a big difference.
- ben. |
hi,
todays processors get more and more complex. the difference between hand optimized assembly and very well programmed C code is therefore real real big. Factor 3+ at least. there is a good example of this more and more complex getting future. A number of years ago, in hardware, automatic designs were clocked quite high compared to hand optimized designs. If you make a verilog hardware design now, some hardware engineers told me in 65 nm they can clock those designs then, only after a BIT of hand tuning, to 300Mhz perhaps 333Mhz. However the chips that AMD and intel release, completely hand optimized by a lot of people, they get clocked to 3Ghz. Factor 10 difference nearly. That difference there only gets bigger and bigger, not smaller. Same is true for assembly at complex processors. Thanks, Vincent |
[QUOTE=ewmayer;110244]Depends very much on the application, and how much time/skill you possess.
To use some Mersenne-realted examples: the big-FFT code used in LL testing benefits far more from careful high-level coding and data movement considerations than from ASM. (Note that ASM gave much more bang on the x86 in the early days of the project, when there were no really good HLL compilers for the x86). I can get about 2/3 the performance of George's hand-tuned Prime95 code using a simple MSVC build of my C code, and the nice thing is, the C code is portable t other platforms. Now, for smaller snippets of code, however, a whiff of ASM can often get you a lot of [url=http://www.mersenneforum.org/showthread.php?p=107427#post107427]bang for your buck[/url]. The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with.[/QUOTE] In integer code the difference is much bigger, as the core2 nowadays can do 4 ops a cycle. That difference you reported now between your code and George's, gets influenced by 2 phenomena's. a) entire compiler teams who get the order to do ANYTHING they can to get THAT code as fast as possible in assembly (and even then losing a lot of speed). So very professional guys who optimize code that you and i write over the weekend (as that's when i have time for stuff like this if any). This point A you really should take into account bigtime. It is the biggest influence. b) right now the processors are relative bad in floating point if you look simply to the number of instructions a cycle they can execute (which of course are vectors, making it blazingly faster than single instructions in integer). This will change of course. Not long from now processors will be equal speed for floating point code as well as integer code. That means that there is more to gain there then too with assembly. For my FFT in 64 bits integers, gcc produces code 50% slower than msvc, and neither the gcc team nor msvc has the code at this moment (though i'll open source it most likely when it's a tad better). I have only os/x 64 bits at core2, not windows. So diff there i don't know, probably more than 50%. Vincent |
[QUOTE=ewmayer;110244]The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with.[/QUOTE]
The general rule above has since been amended to read: "If you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), [b]or your algorithm can take good advantage of hardware-level parallel-execution units (and the instruction sets needed to access them) such as mmx, sse, avx or GPU coding[/b] then ASM is worth playing with. And yes, since I wrote the comment excerpted at top of this message, I have gone over thoroughly to the dark side. In many ways, though, I'm glad I waited - SSE coding (most of what I use ASM for) is much more RISC-like than the 'standard' x86 ISA, and AVX (with its 3-register instruction format support) is more RISC-like still. In other words, it`s becoming less awkward to code in assembly all the time, especially relative to the resulting code speedups. But pre-SSE x86 floating-point ASM, with that awful stack-based resister access? Glad I "missed out" on that. That legacy x86 floating-point stuff gets filed as "If you weren't already doing it for a living, it probably wasn't worth learning". |
[QUOTE=ewmayer;231647]
"If you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), [b]or your algorithm can take good advantage of hardware-level parallel-execution units (and the instruction sets needed to access them) such as mmx, sse, avx or GPU coding[/b] then ASM is worth playing with. [/QUOTE] Good point. The OP was referring to my quadratic sieve implementation (then in progress, now fairly mature). I did end up writing some assembly code, and the only ASM that didn't actually hurt me by making things slower was SSE2 code. I ended up only getting about 10% improvement using the SSE2 code since it was only applicable in a few niche places in the algorithm. I believe there are some instructions in SSE4 which might also help me (a multi-up 32bit multiply, keeping low dwords), but I don't have a test platform for development using that instruction set. Interestingly, I also see about a 10% improvement using gcc 3.2.3 vs. 4.x, suggesting that large chunks of assembly based on emitted 3.2.3 code could make things faster for people who don't use my pre-compiled binaries. So far I've been unwilling to muster the effort it would take to do this for a relatively small gain. |
Using assembly is the only way to actually be limited by hardware bottlenecks. In [url]http://www.mersenneforum.org/showthread.php?t=13176[/url] for example the compiled code runs at ~66% of the theoretical optimum, so there is a potential speed-up of 50% using assembly. I'm fiddling around with GPU stuff at the moment, but IIRC all loops of that code should actually be able to satisfy the (typical) <=3-Instructions-peer-clock, <=1-addpd-per-clock, <=1-mulpd-per-clock, <=1-load-per-clock, <=1-store-per-clock limitations.
In my experience, simply rearranging the order of independent individual instructions can result in runtime differences of up to 30%, but unfortunately there is no simple way to automate a brute-force seach for a minimum. |
Also, as I noted in a recent e-mail exchange with Paul/xilman, I think I finally hit my stride as to efficient SSE (and other ASM coding) with some recent SSE2 code which will go into a near-future Mlucas release ... Here is the recipe I use:
1> first code up all the key macros/code-sections in scalar mode (allowing complex subexpressions like z = a*x + b*y), and make sure that works correctly; 2> recode to reduce the number of local variables to match the number of available registers in the targeted ISA (I use special-named local variables with hexadecimal index fields - convenient for local memory storage writes of 16-byte-wide SSE registers - to stand for spills-to-memory like __m0, __m1, ... , __ma, __mb, which makes later conversion to inline assembler easy); 3> recode to mimic the instructions supported by the ISA, e.g. destructive 2-input-one-gets-overwritten for x86-style ISAs; 4> translate to assembler. That makes it much easier to see the code flow and map out a quasi-optimal register-usage strategy. Of course this is exactly the kind of stuff optimizing compilers are supposed to do for one and have been promised to do for 50-some years now ... but one can wait and hope that the next version of compiler X will actually live up to such promises, or roll up one's sleeves. |
[QUOTE=ewmayer;231668]Also, as I noted in a recent e-mail exchange with Paul/xilman, I think I finally hit my stride as to efficient SSE (and other ASM coding) with some recent SSE2 code which will go into a near-future Mlucas release ... Here is the recipe I use:
1> first code up all the key macros/code-sections in scalar mode (allowing complex subexpressions like z = a*x + b*y), and make sure that works correctly; 2> recode to reduce the number of local variables to match the number of available registers in the targeted ISA (I use special-named local variables with hexadecimal index fields - convenient for local memory storage writes of 16-byte-wide SSE registers - to stand for spills-to-memory like __m0, __m1, ... , __ma, __mb, which makes later conversion to inline assembler easy); 3> recode to mimic the instructions supported by the ISA, e.g. destructive 2-input-one-gets-overwritten for x86-style ISAs; 4> translate to assembler. That makes it much easier to see the code flow and map out a quasi-optimal register-usage strategy. Of course this is exactly the kind of stuff optimizing compilers are supposed to do for one and have been promised to do for 50-some years now ... but one can wait and hope that the next version of compiler X will actually live up to such promises, or roll up one's sleeves.[/QUOTE] I don't think that's a good way to approach the problem, as SSE-stuff is designed to 'stream' data through the processor. IMO it is much easier to achieve peak performance by using only a few registers (read-only constants should be memory operands) with only a few arithmetic instructions but looping over medium-sized vectors. Loops should be unrolled to operate on cachelines per iteration, which maximizes L2->L1 bandwidth, aides efficient prefetching, and makes non-temporal stores effective. If one uses base+index addressing, increment-and-branch is only two instructions which can often be completely hidden, and branches can be perfectly predicted, if loops are short enough. |
Your approach makes sense for short-loop code like the linear-algebra BLAS stuff and for code with small compute segments accessed via branches, but my code has large loop bodies and no branches of note (except the "done with loop?" variety) - think radix-8 and larger DFT-pass loops - so benefits little or not at all from unrolling. (In other words the loop body *is* the unroll, in a manner of speaking.)
And yes, I do a lot accessing read-only operands via memory reads: I only prefer loading such an operand into a register if I'm going to use it more than twice in the same snippet of code and of course if I have a register free. The cache-line-optimizations are something I need to begin working on ... plan to use the Intel C compiler to do some profiling there, once I get it installed. (Worth paying for now since I need to also need to work on AVX code-dev on my 32-bit WinXP laptop, and my existing Visual Studio install there is obsolete in that regard). |
[QUOTE=ewmayer;231688]Your approach makes sense for short-loop code like the linear-algebra BLAS stuff and for code with small compute segments accessed via branches, but my code has large loop bodies and no branches of note (except the "done with loop?" variety) - think radix-8 and larger DFT-pass loops - so benefits little or not at all from unrolling. (In other words the loop body *is* the unroll, in a manner of speaking.)[/QUOTE]
It also makes sense for FFTs: a split-radix butterfly has 16 additions and 8 multiplies, which (IIRC) can be done in 48 instructions including branching code. Then you loop over one FFT of length N/2 and two FFTs of length N/4. The compiler [url]http://www.mersenneforum.org/showthread.php?t=13176[/url] obviously generates inefficient code in the main loops, which is why we don't see >0.9 addpd/clock. Hm...I really should do something about coding the 'outer FFT'... [QUOTE=ewmayer;231688]And yes, I do a lot accessing read-only operands via memory reads: I only prefer loading such an operand into a register if I'm going to use it more than twice in the same snippet of code and of course if I have a register free.[/QUOTE] If your code is not loading 1 operand/clock then that strategy will likely waste instructions and the reg-reg moves will likely block execution pipelines/ports that could be doing arithmetic. Where is the point in using all available registers, if that makes your program run slower? [QUOTE=ewmayer;231688]The cache-line-optimizations are something I need to begin working on ... plan to use the Intel C compiler to do some profiling there, once I get it installed. (Worth paying for now since I need to also need to work on AVX code-dev on my 32-bit WinXP laptop, and my existing Visual Studio install there is obsolete in that regard).[/QUOTE] I'm pretty sure you'll wish you had *started* with cache-line optimizations. Life is easier if you make the basic data-type one cacheline and align arrays to cacheline boundaries. I plan to use the same principle for getting an NTT to run fast on a Radeon HD 5xx0. As they perform best on vectors of length 64, it appears that the basic datatype should be 64 dwords = 256 bytes = 8*256 bits = 8 (256-bit controller) or 16 (128-bit controller) * burst mode. 2^6 values per 16K local memory = radix 64. I see no obvious bottleneck, but as I haven't actually run anything yet on my shiny new [URL="http://www.newegg.com/Product/Product.aspx?Item=N82E16814102873"]Sapphire 5770 ($114.99 w/promo & MIR)[/URL] this is speculation. |
[QUOTE=__HRB__;231692]If your code is not loading 1 operand/clock then that strategy will likely waste instructions and the reg-reg moves will likely block execution pipelines/ports that could be doing arithmetic. Where is the point in using all available registers, if that makes your program run slower?[/quote]
While I find it hard to believe that e.g. movaps xmm0,[eax] // load constant multiplier ... mulps xmm1,xmm0 mulps xmm2,xmm0 mulps xmm3,xmm0 mulps xmm4,xmm0 with just one load-from-memory (but 5 instructions) would run slower than mulps xmm1,[eax] mulps xmm2,[eax] mulps xmm3,[eax] mulps xmm4,[eax] with 4 loads and 4 instructions, I'm sure it could occur - one more reason to transition from overall code structure and correctness (what I spent most of the past 2 years doing, in my less-than-plentiful spare time) to detailed profiling. [quote]I'm pretty sure you'll wish you had *started* with cache-line optimizations. Life is easier if you make the basic data-type one cacheline and align arrays to cacheline boundaries.[/QUOTE] You misunderstand - basic cache-line alignment of arrays has been there all along in my code ... I'm talking about detailed cache behavior of a typical DFT-pass routine, where we load a bunch of stride-X-separated data from the main array, load some auxiliary trigonometric or DWT data from a small table and/or a radix-specific local-data store, combine to effect the DFT, typically with some register spills and fills, then write the results back to the main array. Simple-sounding questions like "Is it better to spill register data back to the main array (presumably already in cache) or to use a contiguous local store?" can yield quite-surprising answers when it actually time to run real-world code. (Which I why I prefer getting to the actual-running-code stage as soon as possible, rather than spending months or years in speculation as to what should work best, however well-informed such speculation might be.) |
[QUOTE=ewmayer;231798]You misunderstand - basic cache-line alignment of arrays has been there all along in my code ... I'm talking about detailed cache behavior of a typical DFT-pass routine, where we load a bunch of stride-X-separated data from the main array[...][/QUOTE]
The argument is that one will efficiently load stride-X-separated cachelines from L2 to L1 if you use cachelines as your data type. If loading a cacheline takes 4 cycles, then if your code partially operates on 8 cachelines within a couple of cycles, the reorder buffer would have to hold up to 32 (!) cycles worth of instructions or the execution will stall. |
| All times are UTC. The time now is 16:44. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.