![]() |
|
|
#12 | |
|
Dec 2008
Boycotting the Soapbox
13208 Posts |
Quote:
|
|
|
|
|
|
|
#13 |
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Your approach makes sense for short-loop code like the linear-algebra BLAS stuff and for code with small compute segments accessed via branches, but my code has large loop bodies and no branches of note (except the "done with loop?" variety) - think radix-8 and larger DFT-pass loops - so benefits little or not at all from unrolling. (In other words the loop body *is* the unroll, in a manner of speaking.)
And yes, I do a lot accessing read-only operands via memory reads: I only prefer loading such an operand into a register if I'm going to use it more than twice in the same snippet of code and of course if I have a register free. The cache-line-optimizations are something I need to begin working on ... plan to use the Intel C compiler to do some profiling there, once I get it installed. (Worth paying for now since I need to also need to work on AVX code-dev on my 32-bit WinXP laptop, and my existing Visual Studio install there is obsolete in that regard). |
|
|
|
|
|
#14 | |||
|
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
Quote:
coding the 'outer FFT'... Quote:
Quote:
I plan to use the same principle for getting an NTT to run fast on a Radeon HD 5xx0. As they perform best on vectors of length 64, it appears that the basic datatype should be 64 dwords = 256 bytes = 8*256 bits = 8 (256-bit controller) or 16 (128-bit controller) * burst mode. 2^6 values per 16K local memory = radix 64. I see no obvious bottleneck, but as I haven't actually run anything yet on my shiny new Sapphire 5770 ($114.99 w/promo & MIR) this is speculation. |
|||
|
|
|
|
|
#15 | ||
|
∂2ω=0
Sep 2002
República de California
103×113 Posts |
Quote:
movaps xmm0,[eax] // load constant multiplier ... mulps xmm1,xmm0 mulps xmm2,xmm0 mulps xmm3,xmm0 mulps xmm4,xmm0 with just one load-from-memory (but 5 instructions) would run slower than mulps xmm1,[eax] mulps xmm2,[eax] mulps xmm3,[eax] mulps xmm4,[eax] with 4 loads and 4 instructions, I'm sure it could occur - one more reason to transition from overall code structure and correctness (what I spent most of the past 2 years doing, in my less-than-plentiful spare time) to detailed profiling. Quote:
|
||
|
|
|
|
|
#16 |
|
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
The argument is that one will efficiently load stride-X-separated cachelines from L2 to L1 if you use cachelines as your data type. If loading a cacheline takes 4 cycles, then if your code partially operates on 8 cachelines within a couple of cycles, the reorder buffer would have to hold up to 32 (!) cycles worth of instructions or the execution will stall.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Trial Divison Improvement | carpetpool | Information & Answers | 1 | 2016-11-30 03:48 |
| A new Fermat Factorization improvement? | siegert81 | Factoring | 5 | 2011-06-13 02:58 |
| snewpgen improvement very much appreciated :) | jasong | Software | 3 | 2007-11-23 04:06 |
| IA-32 Assembly Coders, anyone? | xenon | Programming | 6 | 2005-06-02 13:26 |
| Possible idea for improvement | Kevin | Software | 1 | 2002-08-26 07:57 |