mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2010-09-27, 21:43   #12
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Also, as I noted in a recent e-mail exchange with Paul/xilman, I think I finally hit my stride as to efficient SSE (and other ASM coding) with some recent SSE2 code which will go into a near-future Mlucas release ... Here is the recipe I use:

1> first code up all the key macros/code-sections in scalar mode (allowing complex subexpressions like z = a*x + b*y), and make sure that works correctly;

2> recode to reduce the number of local variables to match the number of available registers in the targeted ISA (I use special-named local variables with hexadecimal index fields - convenient for local memory storage writes of 16-byte-wide SSE registers - to stand for spills-to-memory like __m0, __m1, ... , __ma, __mb, which makes later conversion to inline assembler easy);

3> recode to mimic the instructions supported by the ISA, e.g. destructive 2-input-one-gets-overwritten for x86-style ISAs;

4> translate to assembler.

That makes it much easier to see the code flow and map out a quasi-optimal register-usage strategy. Of course this is exactly the kind of stuff optimizing compilers are supposed to do for one and have been promised to do for 50-some years now ... but one can wait and hope that the next version of compiler X will actually live up to such promises, or roll up one's sleeves.
I don't think that's a good way to approach the problem, as SSE-stuff is designed to 'stream' data through the processor. IMO it is much easier to achieve peak performance by using only a few registers (read-only constants should be memory operands) with only a few arithmetic instructions but looping over medium-sized vectors. Loops should be unrolled to operate on cachelines per iteration, which maximizes L2->L1 bandwidth, aides efficient prefetching, and makes non-temporal stores effective. If one uses base+index addressing, increment-and-branch is only two instructions which can often be completely hidden, and branches can be perfectly predicted, if loops are short enough.
__HRB__ is offline   Reply With Quote
Old 2010-09-27, 23:00   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·3·29·67 Posts
Default

Your approach makes sense for short-loop code like the linear-algebra BLAS stuff and for code with small compute segments accessed via branches, but my code has large loop bodies and no branches of note (except the "done with loop?" variety) - think radix-8 and larger DFT-pass loops - so benefits little or not at all from unrolling. (In other words the loop body *is* the unroll, in a manner of speaking.)

And yes, I do a lot accessing read-only operands via memory reads: I only prefer loading such an operand into a register if I'm going to use it more than twice in the same snippet of code and of course if I have a register free.

The cache-line-optimizations are something I need to begin working on ... plan to use the Intel C compiler to do some profiling there, once I get it installed. (Worth paying for now since I need to also need to work on AVX code-dev on my 32-bit WinXP laptop, and my existing Visual Studio install there is obsolete in that regard).
ewmayer is offline   Reply With Quote
Old 2010-09-28, 00:26   #14
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

72010 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Your approach makes sense for short-loop code like the linear-algebra BLAS stuff and for code with small compute segments accessed via branches, but my code has large loop bodies and no branches of note (except the "done with loop?" variety) - think radix-8 and larger DFT-pass loops - so benefits little or not at all from unrolling. (In other words the loop body *is* the unroll, in a manner of speaking.)
It also makes sense for FFTs: a split-radix butterfly has 16 additions and 8 multiplies, which (IIRC) can be done in 48 instructions including branching code. Then you loop over one FFT of length N/2 and two FFTs of length N/4. The compiler http://www.mersenneforum.org/showthread.php?t=13176 obviously generates inefficient code in the main loops, which is why we don't see >0.9 addpd/clock. Hm...I really should do something about
coding the 'outer FFT'...

Quote:
Originally Posted by ewmayer View Post
And yes, I do a lot accessing read-only operands via memory reads: I only prefer loading such an operand into a register if I'm going to use it more than twice in the same snippet of code and of course if I have a register free.
If your code is not loading 1 operand/clock then that strategy will likely waste instructions and the reg-reg moves will likely block execution pipelines/ports that could be doing arithmetic. Where is the point in using all available registers, if that makes your program run slower?

Quote:
Originally Posted by ewmayer View Post
The cache-line-optimizations are something I need to begin working on ... plan to use the Intel C compiler to do some profiling there, once I get it installed. (Worth paying for now since I need to also need to work on AVX code-dev on my 32-bit WinXP laptop, and my existing Visual Studio install there is obsolete in that regard).
I'm pretty sure you'll wish you had *started* with cache-line optimizations. Life is easier if you make the basic data-type one cacheline and align arrays to cacheline boundaries.

I plan to use the same principle for getting an NTT to run fast on a Radeon HD 5xx0. As they perform best on vectors of length 64, it appears that the basic datatype should be 64 dwords = 256 bytes = 8*256 bits = 8 (256-bit controller) or 16 (128-bit controller) * burst mode. 2^6 values per 16K local memory = radix 64. I see no obvious bottleneck, but as I haven't actually run anything yet on my shiny new Sapphire 5770 ($114.99 w/promo & MIR) this is speculation.
__HRB__ is offline   Reply With Quote
Old 2010-09-28, 18:10   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·3·29·67 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
If your code is not loading 1 operand/clock then that strategy will likely waste instructions and the reg-reg moves will likely block execution pipelines/ports that could be doing arithmetic. Where is the point in using all available registers, if that makes your program run slower?
While I find it hard to believe that e.g.
movaps xmm0,[eax] // load constant multiplier
...
mulps xmm1,xmm0
mulps xmm2,xmm0
mulps xmm3,xmm0
mulps xmm4,xmm0

with just one load-from-memory (but 5 instructions) would run slower than

mulps xmm1,[eax]
mulps xmm2,[eax]
mulps xmm3,[eax]
mulps xmm4,[eax]

with 4 loads and 4 instructions, I'm sure it could occur - one more reason to transition from overall code structure and correctness (what I spent most of the past 2 years doing, in my less-than-plentiful spare time) to detailed profiling.

Quote:
I'm pretty sure you'll wish you had *started* with cache-line optimizations. Life is easier if you make the basic data-type one cacheline and align arrays to cacheline boundaries.
You misunderstand - basic cache-line alignment of arrays has been there all along in my code ... I'm talking about detailed cache behavior of a typical DFT-pass routine, where we load a bunch of stride-X-separated data from the main array, load some auxiliary trigonometric or DWT data from a small table and/or a radix-specific local-data store, combine to effect the DFT, typically with some register spills and fills, then write the results back to the main array. Simple-sounding questions like "Is it better to spill register data back to the main array (presumably already in cache) or to use a contiguous local store?" can yield quite-surprising answers when it actually time to run real-world code. (Which I why I prefer getting to the actual-running-code stage as soon as possible, rather than spending months or years in speculation as to what should work best, however well-informed such speculation might be.)
ewmayer is offline   Reply With Quote
Old 2010-09-28, 19:00   #16
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

13208 Posts
Default

Quote:
Originally Posted by ewmayer View Post
You misunderstand - basic cache-line alignment of arrays has been there all along in my code ... I'm talking about detailed cache behavior of a typical DFT-pass routine, where we load a bunch of stride-X-separated data from the main array[...]
The argument is that one will efficiently load stride-X-separated cachelines from L2 to L1 if you use cachelines as your data type. If loading a cacheline takes 4 cycles, then if your code partially operates on 8 cachelines within a couple of cycles, the reorder buffer would have to hold up to 32 (!) cycles worth of instructions or the execution will stall.
__HRB__ is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Trial Divison Improvement carpetpool Information & Answers 1 2016-11-30 03:48
A new Fermat Factorization improvement? siegert81 Factoring 5 2011-06-13 02:58
snewpgen improvement very much appreciated :) jasong Software 3 2007-11-23 04:06
IA-32 Assembly Coders, anyone? xenon Programming 6 2005-06-02 13:26
Possible idea for improvement Kevin Software 1 2002-08-26 07:57

All times are UTC. The time now is 08:18.


Fri Oct 22 08:18:59 UTC 2021 up 91 days, 2:47, 1 user, load averages: 1.21, 1.39, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.