mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Computer Science & Computational Number Theory

Reply
 
Thread Tools
Old 2014-10-10, 21:48   #23
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·3·7·139 Posts
Default

Quote:
Originally Posted by Prime95 View Post
In the radix-16 code, have you tried prefetching the next radix-16 data into the L1 cache?
My current prefetch scheme there allows for a prefetch with a compile-time "fetch-ahead" distance specified. Based on experiments I found the best value of that to be ~1kB, very different from the "fetch very next set of inputs" scheme you describe. But let me do a quick-revert to the latter when I get home this evening and have access to the Haswell, and I will let you know what the result is.

One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch - mine have been at best 2-3%, an order of magnitude lower than expected. Getting another 20-30% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years.
ewmayer is offline   Reply With Quote
Old 2014-10-10, 22:50   #24
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

769110 Posts
Default

Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.

In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache.

How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.
Prime95 is offline   Reply With Quote
Old 2014-10-11, 05:34   #25
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·3·7·139 Posts
Default

I just re-played with the prefetch params in the forward-radix-16 DFT macro, tried bytes-ahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.

Quote:
Originally Posted by Prime95 View Post
Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.

In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache.

How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner.
Well, you have to store the not-currently-in-registers data somewhere, so where do you cache the data if not in the main array? In some contiguous chunk of local memory?

I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n:

[1] Initial forward radix-240 decimation-in-frequency (DIF) pass on n/240-stride-separated data;
[2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix-16 here) done on one contiguous n/240-sized data chunk at a time;
[3] Final radix-240 decimation-in-time (DIT) pass on n/240-stride-separated data;
[4] Round and do carries;

Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguous-data chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leading-radix-960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31.
ewmayer is offline   Reply With Quote
Old 2014-10-11, 14:39   #26
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

170138 Posts
Default

Quote:
Originally Posted by ewmayer View Post
[1] Initial forward radix-240 decimation-in-frequency (DIF) pass on n/240-stride-separated data;
[2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix-16 here) done on one contiguous n/240-sized data chunk at a time;
[3] Final radix-240 decimation-in-time (DIT) pass on n/240-stride-separated data;
[4] Round and do carries;
This is what I call 2 passes over main memory -- exactly what prime95 does.

The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance:

1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.

2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix-16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix-16 step writes it back to the FFT data area with horrible strides.



What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance:

1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency.

2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except---

3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going -- that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.


Here's what I'd recommend trying:

In pass 1, go with 240 x 16 -- 60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.

In pass 2, go with 16x16x16 -- 64KB. Should fit nicely in the L2 cache since you don't need a scratch area.


That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.
Prime95 is offline   Reply With Quote
Old 2014-10-12, 01:54   #27
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×3×7×139 Posts
Default

George, many thanks for the thoughtful, detailed reply.

Quote:
Originally Posted by Prime95 View Post
This is what I call 2 passes over main memory -- exactly what prime95 does.
Holy convergent evolution, Bat-man! :) Actually, the key pieces of the scheme - the fused-radix/carry step and a similar fusion of the final fFFT pass, dyadic-square step and initial iFFT pass (same radix as the final-fFFT, and sharing the same twiddles) were already in place in the Fortran90 code I used for the 1999 test of F24. The key pieces missing at that time (beside the transition to C and mass deployment of inline ASM) were that the 1999 code still used an out-of-place "ping pong" FFT (which performed well on at least one chip family, MIPS, but sucked hard on most others, from Alpha to x86), and instead of doing a fused step [2] to all but the final-iFFT-radix/carry/initial-fFFT-radix on one contiguous main-array data chunk at a time, I did the intermediate-radix (= all but the initial and final radix of each FFT) passes on the whole array, thus a set of 5 radices-for-each-FFT as I am using for F29 meant 8 passes through the whole array. That predictably scales badly to large FFT lengths.

Quote:
The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed.
I use a complex FFT, so radix-240 means 240*(16,32,64) = (3.75,7.5,15)KB in scalar-double, SSE2 and AVX build modes, respectively. Even the AVX-mode 15KB should fit nicely in L1 on Haswell with its 32KB level-1 D-cache, though.

However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * power-of-2, but could also be the product of 2 coprime odds or odd1 * (odd2 * power-of-2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the index-permutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from large-stride-separated main-array locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the main-array step is the crucial one here, as you note below.

Quote:
Two comments on performance:

1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change.
Yes, that sounds like a promising "try this first" step - I have done limited prefetch experiments for the large-stride step, but IIRC my attempts were influenced by my experience with prefetch within the contiguous Step [2] data blocks, where fetch-ahead distances ~1KB seem best for me. I don't think I ever made a concerted effort to fetch just the next set of complex data-to-be-used into L1.

Quote:
2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix-16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix-16 step writes it back to the FFT data area with horrible strides.
Same here, but with the extra twis of 2 scratch areas for coprime radices.

The carry macros also operate on the last-used scratch area. Carries also need auxiliary data - DWT weights for Mersenne-mod, and complex-roots-of-(-1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of -1 (same as the first (n/2), that is upper-complex-half-plane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign.

I use a 2-table multiply scheme for on-the-fly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n-1 I make use of the identity

w_j * w_n-j = 2

to allow a single array to encode both the forward and inverse weights, at the price of some slightly-less-cache-friendly reverse-running indexing for the inverse-weights step.

My 2-tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)-sized tables of complex data for FFT twiddles, another such pair for acyclic-twisting weights in Fermat-mod mode, and a pair of similar-sized real data table for DWT weights in Mersenne-mod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residue-related) data in any phase of the overall FFT-modmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclic-twisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complex-FFT twiddles.

Aside: For non-power-of-2 FFT lengths (this case was not considered in the now-famous 1994 Crandall/Fagin DWT paper) the Fermat-mod scheme also needs Mersenne-style IBDWT weights in addition to the acyclic-twisting weights, but the IBDWTs are memory-negligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use odd-radix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth - I will use FFT length 63 * 2^m for one of the 2 independent tests, and a power-of-2-length 2^(m+6) for the other.

Quote:
What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance:

1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency.

2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except---

3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going -- that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know.
So if I read you right the evidence for the L3-bandwidth issue you raise would be that the prefetch-into-L3 scheme helps for (say) 1 or 2-threaded runs, but hurts performance (vs no-L3-prefetch) above some thread count in the 4ish range?

Quote:
Here's what I'd recommend trying:

In pass 1, go with 240 x 16 -- 60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit.
In "240 x 16", is the 16 a data-chunk byte count (as in your "combined steps 3,4,1 work on 240*16 = 4KB chunks" not above), or are you referring to radix-16, i.e. to prefetching data for the first 2 FFT passes into L2?

Quote:
In pass 2, go with 16x16x16 -- 64KB. Should fit nicely in the L2 cache since you don't need a scratch area.
On considering my experience with your comments, I'm not even sure prefetch in pass [2] is worth spending much time on, since the whole point of properly choosing the leading radix (e.g 240 in our example here) is so that the system should have an easy time fetching the needed data from the higher-level caches ... but I will first attempt to tackle the large-stride-related data fetch issues for fused steps [1,3,4] in my breakdown.

Quote:
That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees.
Well, as you know, this is a long-term project for me. :) It's nice to have some promising directional hints, for sure.

For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it - don't expect huge gains from that on Haswell, but reliable gains and hopefully-even-better-on-Broadwell when it comes out - and do some large-stride-related prefetch experiments as a side project. Will let you know what I find as results become available.
ewmayer is offline   Reply With Quote
Old 2014-10-13, 02:13   #28
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×3×7×139 Posts
Default

I just added some prefetcht1 calls to my AVX Fermat-mod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetches-into-L1 of the data at the R distinct n/R-stride-separated main-array locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leading-radix/carry routines I have.

For these experiments I added code such that the bytes-ahead offset for the prefetches is set as a compile-time command-line define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytes-ahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter --> 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playing-with-prefetch. Thanks, George! Hopefully this is the start of a trend - still lots of other things to play with here, as you lay out above.
ewmayer is offline   Reply With Quote
Old 2016-01-23, 01:52   #29
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×3×7×139 Posts
Default

First Pe'pin test of F29 completed late Wednesday - took slightly under 20 months at FFT length 30M. Start-of-run per-iteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMA-oriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.

On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2-core Haswell NUC, so I am proceeding where that run left off.

[UPDATE] I was going to post the final checkpoint-Res64 and the traditional Selfridge-Hurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quad-Haswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crash-at-random-intervals-ranging-from-an-hour-to-several-months prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 2-3% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal bad-data errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run.

So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a run-pair in serial mode, always do the 2 side-by-side on systems yielding as similar speeds as possible. To that end, I am looking for a victimvolunteer to go on a suicide mission help me re-do the 30M run in parallel with the 32M one. Best will be someone with a similar Linux Haswell-quad to my own ... should one prove slightly faster than the other it will get to finish the 32M-FFT run, which needs ~6% more cycles than the 30M one. Or we could consider running @30M using just 3 cores on the faster system, leaving one core free for other tasks.

Note that should you choose to accept this mission, it will be a significant runtime commitment - around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be cross-checking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result.
ewmayer is offline   Reply With Quote
Old 2016-01-23, 05:09   #30
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

22×607 Posts
Default

Just curious, did the system have ECC memory?
ixfd64 is offline   Reply With Quote
Old 2016-01-23, 05:19   #31
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

22×3×17×41 Posts
Default

None of his (or ours) has ECC memory.

Xyzzy is offline   Reply With Quote
Old 2016-01-23, 07:24   #32
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

630210 Posts
Default

18 month test without ECC RAM!
retina is offline   Reply With Quote
Old 2016-01-23, 15:05   #33
ramshanker
 
ramshanker's Avatar
 
"Ram Shanker"
May 2015
Delhi

2·19 Posts
Default

I envy these algorithms!
ramshanker is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
P-1/P+1 on Fermat numbers ATH Operazione Doppi Mersennes 2 2015-01-25 06:27
What are the Primality Tests ( not factoring! ) for Fermat Numbers? Erasmus Math 46 2014-08-08 20:05
LLT numbers, linkd with Mersenne and Fermat numbers T.Rex Math 4 2005-05-07 08:25
Two Primality tests for Fermat numbers T.Rex Math 2 2004-09-11 07:26
Fermat Numbers devarajkandadai Math 8 2004-07-27 12:27

All times are UTC. The time now is 10:59.


Wed Dec 8 10:59:22 UTC 2021 up 138 days, 5:28, 1 user, load averages: 2.09, 1.92, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.