mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2007-07-10, 21:14   #1
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

67738 Posts
Default performance improvement with assembly

What kinds of performance improvements have people seen when using hand optimized assembly routines vs. a high level language like C? For instance, is it possible for assembly to account for a 20% improvement, or 50% (twice as fast) or higher?

I know that this depends on the quality of implementation of both the C and the assembly code, so lets say that the C code has already been tweaked quite a bit, and is performing pretty well.

I've written my own arbitrary precision integer math library, and it performs well for what I need it to do, but I'm trying to get a feel for if it'd be worth the effort to learn some assembly to improve its performance. If the gain is likely to be 10% or so, then the priority is not very high...

thanks for any advice you can offer,
- ben.
bsquared is offline   Reply With Quote
Old 2007-07-13, 01:02   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1167410 Posts
Default

Depends very much on the application, and how much time/skill you possess.

To use some Mersenne-realted examples: the big-FFT code used in LL testing benefits far more from careful high-level coding and data movement considerations than from ASM. (Note that ASM gave much more bang on the x86 in the early days of the project, when there were no really good HLL compilers for the x86). I can get about 2/3 the performance of George's hand-tuned Prime95 code using a simple MSVC build of my C code, and the nice thing is, the C code is portable t other platforms.

Now, for smaller snippets of code, however, a whiff of ASM can often get you a lot of bang for your buck.

The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with.
ewmayer is offline   Reply With Quote
Old 2007-07-13, 02:33   #3
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

67738 Posts
Default

Thanks for the example, that's exactly the kind of stuff I'm looking for. I did some more digging around in the gmp manual, and found this:

http://gmplib.org/manual/Assembler-C...sembler-Coding

which also seems to show what is possible. If anyone else has any examples of how assembly significant speeded up a routine, I'd be interested to hear about it.

p.s. would this thread be better off in the programming sub-forum?
bsquared is offline   Reply With Quote
Old 2007-07-13, 11:16   #4
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

See http://www.agner.org/optimize/

Alex
akruppa is offline   Reply With Quote
Old 2007-07-13, 16:19   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×1,193 Posts
Default

Thanks Alex. Wow, that's a great site. I'll start trying some experiments with assembly, to see if I can improve performance at all. It looks like even a small amount, applied in the right places, can make a big difference.

- ben.
bsquared is offline   Reply With Quote
Old 2007-07-14, 09:51   #6
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

hi,

todays processors get more and more complex. the difference between hand optimized assembly and very well programmed C code is therefore real real big. Factor 3+ at least.

there is a good example of this more and more complex getting future. A number of years ago, in hardware, automatic designs were clocked quite high compared to hand optimized designs.

If you make a verilog hardware design now, some hardware engineers told me in 65 nm they can clock those designs then, only after a BIT of hand tuning, to 300Mhz perhaps 333Mhz.

However the chips that AMD and intel release, completely hand optimized by a lot of people, they get clocked to 3Ghz.

Factor 10 difference nearly.

That difference there only gets bigger and bigger, not smaller.

Same is true for assembly at complex processors.

Thanks,
Vincent

Last fiddled with by diep on 2007-07-14 at 09:53
diep is offline   Reply With Quote
Old 2007-07-14, 10:03   #7
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

11×71 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Depends very much on the application, and how much time/skill you possess.

To use some Mersenne-realted examples: the big-FFT code used in LL testing benefits far more from careful high-level coding and data movement considerations than from ASM. (Note that ASM gave much more bang on the x86 in the early days of the project, when there were no really good HLL compilers for the x86). I can get about 2/3 the performance of George's hand-tuned Prime95 code using a simple MSVC build of my C code, and the nice thing is, the C code is portable t other platforms.

Now, for smaller snippets of code, however, a whiff of ASM can often get you a lot of bang for your buck.

The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with.
In integer code the difference is much bigger, as the core2 nowadays can do 4 ops a cycle. That difference you reported now between your code and George's, gets influenced by 2 phenomena's.

a) entire compiler teams who get the order to do ANYTHING they can to get THAT code as fast as possible in assembly (and even then losing a lot of speed). So very professional guys who optimize code that you and i write over the weekend (as that's when i have time for stuff like this if any).

This point A you really should take into account bigtime. It is the biggest influence.

b) right now the processors are relative bad in floating point if you look simply to the number of instructions a cycle they can execute (which of course are vectors, making it blazingly faster than single instructions in integer). This will change of course. Not long from now processors will be equal speed for floating point code as well as integer code. That means that there is more to gain there then too with assembly.

For my FFT in 64 bits integers, gcc produces code 50% slower than msvc, and neither the gcc team nor msvc has the code at this moment (though i'll open source it most likely when it's a tad better). I have only os/x 64 bits at core2, not windows. So diff there i don't know, probably more than 50%.

Vincent
diep is offline   Reply With Quote
Old 2010-09-27, 17:55   #8
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

266328 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The general rule is: if you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), then ASM is worth playing with.
The general rule above has since been amended to read:

"If you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), or your algorithm can take good advantage of hardware-level parallel-execution units (and the instruction sets needed to access them) such as mmx, sse, avx or GPU coding then ASM is worth playing with.

And yes, since I wrote the comment excerpted at top of this message, I have gone over thoroughly to the dark side. In many ways, though, I'm glad I waited - SSE coding (most of what I use ASM for) is much more RISC-like than the 'standard' x86 ISA, and AVX (with its 3-register instruction format support) is more RISC-like still. In other words, it`s becoming less awkward to code in assembly all the time, especially relative to the resulting code speedups. But pre-SSE x86 floating-point ASM, with that awful stack-based resister access? Glad I "missed out" on that. That legacy x86 floating-point stuff gets filed as "If you weren't already doing it for a living, it probably wasn't worth learning".
ewmayer is offline   Reply With Quote
Old 2010-09-27, 18:44   #9
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×1,193 Posts
Default

Quote:
Originally Posted by ewmayer View Post
"If you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), or your algorithm can take good advantage of hardware-level parallel-execution units (and the instruction sets needed to access them) such as mmx, sse, avx or GPU coding then ASM is worth playing with.
Good point. The OP was referring to my quadratic sieve implementation (then in progress, now fairly mature). I did end up writing some assembly code, and the only ASM that didn't actually hurt me by making things slower was SSE2 code. I ended up only getting about 10% improvement using the SSE2 code since it was only applicable in a few niche places in the algorithm. I believe there are some instructions in SSE4 which might also help me (a multi-up 32bit multiply, keeping low dwords), but I don't have a test platform for development using that instruction set.

Interestingly, I also see about a 10% improvement using gcc 3.2.3 vs. 4.x, suggesting that large chunks of assembly based on emitted 3.2.3 code could make things faster for people who don't use my pre-compiled binaries. So far I've been unwilling to muster the effort it would take to do this for a relatively small gain.
bsquared is offline   Reply With Quote
Old 2010-09-27, 20:43   #10
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Using assembly is the only way to actually be limited by hardware bottlenecks. In http://www.mersenneforum.org/showthread.php?t=13176 for example the compiled code runs at ~66% of the theoretical optimum, so there is a potential speed-up of 50% using assembly. I'm fiddling around with GPU stuff at the moment, but IIRC all loops of that code should actually be able to satisfy the (typical) <=3-Instructions-peer-clock, <=1-addpd-per-clock, <=1-mulpd-per-clock, <=1-load-per-clock, <=1-store-per-clock limitations.

In my experience, simply rearranging the order of independent individual instructions can result in runtime differences of up to 30%, but unfortunately there is no simple way to automate a brute-force seach for a minimum.
__HRB__ is offline   Reply With Quote
Old 2010-09-27, 20:54   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×449 Posts
Default

Also, as I noted in a recent e-mail exchange with Paul/xilman, I think I finally hit my stride as to efficient SSE (and other ASM coding) with some recent SSE2 code which will go into a near-future Mlucas release ... Here is the recipe I use:

1> first code up all the key macros/code-sections in scalar mode (allowing complex subexpressions like z = a*x + b*y), and make sure that works correctly;

2> recode to reduce the number of local variables to match the number of available registers in the targeted ISA (I use special-named local variables with hexadecimal index fields - convenient for local memory storage writes of 16-byte-wide SSE registers - to stand for spills-to-memory like __m0, __m1, ... , __ma, __mb, which makes later conversion to inline assembler easy);

3> recode to mimic the instructions supported by the ISA, e.g. destructive 2-input-one-gets-overwritten for x86-style ISAs;

4> translate to assembler.

That makes it much easier to see the code flow and map out a quasi-optimal register-usage strategy. Of course this is exactly the kind of stuff optimizing compilers are supposed to do for one and have been promised to do for 50-some years now ... but one can wait and hope that the next version of compiler X will actually live up to such promises, or roll up one's sleeves.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Trial Divison Improvement carpetpool Information & Answers 1 2016-11-30 03:48
A new Fermat Factorization improvement? siegert81 Factoring 5 2011-06-13 02:58
snewpgen improvement very much appreciated :) jasong Software 3 2007-11-23 04:06
IA-32 Assembly Coders, anyone? xenon Programming 6 2005-06-02 13:26
Possible idea for improvement Kevin Software 1 2002-08-26 07:57

All times are UTC. The time now is 00:37.


Thu Dec 2 00:37:18 UTC 2021 up 131 days, 19:06, 1 user, load averages: 0.49, 0.73, 0.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.