View Single Post
Old 2010-09-27, 18:44   #9
bsquared's Avatar
Feb 2007

2×1,789 Posts

Originally Posted by ewmayer View Post
"If you have a relatively small amount of code which dominates your compute time (e.g. a critical inner-loop section or macro), or your algorithm can take good advantage of hardware-level parallel-execution units (and the instruction sets needed to access them) such as mmx, sse, avx or GPU coding then ASM is worth playing with.
Good point. The OP was referring to my quadratic sieve implementation (then in progress, now fairly mature). I did end up writing some assembly code, and the only ASM that didn't actually hurt me by making things slower was SSE2 code. I ended up only getting about 10% improvement using the SSE2 code since it was only applicable in a few niche places in the algorithm. I believe there are some instructions in SSE4 which might also help me (a multi-up 32bit multiply, keeping low dwords), but I don't have a test platform for development using that instruction set.

Interestingly, I also see about a 10% improvement using gcc 3.2.3 vs. 4.x, suggesting that large chunks of assembly based on emitted 3.2.3 code could make things faster for people who don't use my pre-compiled binaries. So far I've been unwilling to muster the effort it would take to do this for a relatively small gain.
bsquared is offline   Reply With Quote