![]() |
![]() |
#23 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
One of my ongoing frustrations is that I have not been able to get anywhere near the big speedups you describe with prefetch - mine have been at best 2-3%, an order of magnitude lower than expected. Getting another 20-30% speedup effectively "for free" would of course be huge, but I have never seen anything near that with my prefetch experiments over the years. |
|
![]() |
![]() |
![]() |
#24 |
P90 years forever!
Aug 2002
Yeehaw, FL
5·23·71 Posts |
![]()
Prime95 v27 and before prefetched way ahead into the L2 cache. That is, say were doing the first pass operating on 128KB of data. During the processing of the 128KB of data I slowly prefetch the next 128KB of data into the L2 cache.
In v28, I added the prefetch next few cache lines into the L1 cache. This moves the data from the L2 cache into the L1 cache. How many total passes over main memory does your FFT take? Prime95 does the forward and inverse FFTs with reading and writing the FFT data only twice. If your code is doing more than that you will run into memory bandwidth limitations sooner. |
![]() |
![]() |
![]() |
#25 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
I just re-played with the prefetch params in the forward-radix-16 DFT macro, tried bytes-ahead for the address offset ranging from 8 (just one double ahead) to 4096 ... again only a few % variation in the resulting timings, and 1kB seems roughly the best, though the effect is weak. All these current prefetches are the t1 variety.
Quote:
I'll illustrate my data move strategy using the above set of complex radices (240,16,16,16,16) I'm using for F29. Call the overall runlength in terms of doubles n: [1] Initial forward radix-240 decimation-in-frequency (DIF) pass on n/240-stride-separated data; [2] Remaining 4 fFFT passes, dyadic square step, and initial 4 iFFT passes (iFFT runs radices in reverse order, thus again 4 x radix-16 here) done on one contiguous n/240-sized data chunk at a time; [3] Final radix-240 decimation-in-time (DIT) pass on n/240-stride-separated data; [4] Round and do carries; Steps 1,3,4 (actual order is 3,4,1 here) are all fused into a single pass through the data. Step 2 involves multiple passes, but all on a contiguous block of data. (And I process 240 such blocks in turn). The size of each of these blocks relates to the "1 MB chunk threshold" I described earlier, which is crucial for good performance on the x86. 4 passes of complex radix16 means 16^4 = 2^16 complex doubles, i.e. 2^17 doubles, exactly 1MB. For F30 if I again used leading radix 240 the resulting contiguous-data chunks would be 2MB, and I would expect (and have already confirmed) a timing deterioration. I already have in place a leading-radix-960 routine, which is slightly worse than 240 when used for F29, but is better for F30, and better still for F31. |
|
![]() |
![]() |
![]() |
#26 | |
P90 years forever!
Aug 2002
Yeehaw, FL
5·23·71 Posts |
![]() Quote:
The combined steps 3,4,1 work on 240*16 = 4KB chunks. These 4KB chunks easily fit it the L1 cache while being processed. Two comments on performance: 1) While processing the 4KB chunk in the L1 cache, you should be prefetching the next 4KB chunk into the L1 cache. My concern is that you are not doing enough FPU work to hide the latencies of reading the next 4KB chunk from main memory. If I'm right, you can improve the FPU code all you want but your actual timings won't change. 2) You likely break up the up into two steps 16 by 15 (actually inverse 16,15,carries,forward 15,16) and you are working with horribly large strides. During the inverse radix-16 step in my example, Prime95 would read the horrible stride data and write it to a contiguous 4KB scratch area. The remaining steps then deal with the pleasant small stride scratch area until the final forward radix-16 step writes it back to the FFT data area with horrible strides. What I call pass 2 over main memory, processes chunks of 16^4 * 16 bytes = 1MB. This pass operates out of the L3 cache. Some comments on performance: 1) The L3 cache is a lot slower than the L2 cache. The CPU will not be able to hide the latencies without prefetching into L1 cache. You've found that prefetching 1KB ahead seems to be optimal for hiding this latency. 2) While you are processing the 1MB chunk, you can prefetch the next 1MB chunk into the L3 cache. You are doing plenty of FPU work to easily hide all main memory latency. Except--- 3) I'm worried you may be exceeding the L3 cache's capabilities. If you operate on 1MB of data while prefetching the next 1MB of data and you have 4 cores going -- that's 8MB of data. It probably won't all fit. I also worry about the contention you've got going with 4 cores all placing very heavy read and write demands on the shared L3 cache. Is it possible to exceed the L3 cache's bandwidth capabilities? I don't know. Here's what I'd recommend trying: In pass 1, go with 240 x 16 -- 60KB. While processing, prefetch the next 60KB into the L2 cache (spread these prefetches out!). Total demand on the L2 cache is 180+KB (60KB FFT data, next 60KB FFT data, 60KB scratch area, some sin/cos and weights). A tight fit. In pass 2, go with 16x16x16 -- 64KB. Should fit nicely in the L2 cache since you don't need a scratch area. That's a lot of work. I hope I'm right and it will be worth the effort. No guarantees. |
|
![]() |
![]() |
![]() |
#27 | ||||||||
∂2ω=0
Sep 2002
República de California
101101111010112 Posts |
![]()
George, many thanks for the thoughtful, detailed reply.
Quote:
Quote:
However, note that for radices like 240 which decompose into 2 coprime factors (240 = 15*16 = odd * power-of-2, but could also be the product of 2 coprime odds or odd1 * (odd2 * power-of-2) with odd1,2 coprime), I typically use 2 such contiguous local stores, due to the index-permutation flexibility demanded by my twiddleless implementation of such coprime radix pairs. So e.g. pass1 reads from large-stride-separated main-array locations, writes to store1, pass 2 reads from stroe 1 and writes to store 2. But I still think the main-array step is the crucial one here, as you note below. Quote:
Quote:
The carry macros also operate on the last-used scratch area. Carries also need auxiliary data - DWT weights for Mersenne-mod, and complex-roots-of-(-1) (similar to the FFT twiddles, but those are (n/2)th roots of +1, as opposed to the (n/2)th roots of -1 (same as the first (n/2), that is upper-complex-half-plane (n)th roots of +1) needed for the "acyclic twisting" scheme for moduli like Fermats with a "+" sign. I use a 2-table multiply scheme for on-the-fly generation of both FFT twiddles and IBDWT weights. As I noted to you some years ago for the IBDWT weights w_0,...n-1 I make use of the identity w_j * w_n-j = 2 to allow a single array to encode both the forward and inverse weights, at the price of some slightly-less-cache-friendly reverse-running indexing for the inverse-weights step. My 2-tables schemes have each pair of tables within a factor of 2 of each in size, thus I have a pair of O(sqrt n)-sized tables of complex data for FFT twiddles, another such pair for acyclic-twisting weights in Fermat-mod mode, and a pair of similar-sized real data table for DWT weights in Mersenne-mod mode. Note that only one such pair of data tables competes for L1 cache with actual "live" transform (i.e. residue-related) data in any phase of the overall FFT-modmul scheme, though: The fused steps [1,3,4] need either IBDWT weights or acyclic-twisting weights depending on the kind of modulus; step [2] needs just the 2 subtables needed to generate the complex-FFT twiddles. Aside: For non-power-of-2 FFT lengths (this case was not considered in the now-famous 1994 Crandall/Fagin DWT paper) the Fermat-mod scheme also needs Mersenne-style IBDWT weights in addition to the acyclic-twisting weights, but the IBDWTs are memory-negligible here because for an FFT length of form odd * 2^m the IBDWT weights cycle with sequence length (odd), and it makes little sense to use odd-radix components much larger than 15. (The largest such I have is 63, and will be sued specifically for the Fermat numbers F32 and F33, where an FFT length of form 15 * 2^m will no longer suffice due to ROE growth - I will use FFT length 63 * 2^m for one of the 2 independent tests, and a power-of-2-length 2^(m+6) for the other. Quote:
Quote:
Quote:
Quote:
For the coming month or so, I'll probably continue with on ongoing work of FMAizing the remaining key AVX code macros which need it - don't expect huge gains from that on Haswell, but reliable gains and hopefully-even-better-on-Broadwell when it comes out - and do some large-stride-related prefetch experiments as a side project. Will let you know what I find as results become available. |
||||||||
![]() |
![]() |
![]() |
#28 |
∂2ω=0
Sep 2002
República de California
1175510 Posts |
![]()
I just added some prefetcht1 calls to my AVX Fermat-mod carry macro, such that during processing of the 240 (or more generally, the R, where R is the leading radix) carries within the segmented main data array (which is not directly in the picture during carry processing, as this is all done using a small contiguous local storage area as described above), prefetches-into-L1 of the data at the R distinct n/R-stride-separated main-array locations get done while the carries are processed. The carry macros seem the logical place to do this prefetching, since even though they do not otherwise access the main array in any way, they make is easy to propagate such prefetches to all the various leading-radix/carry routines I have.
For these experiments I added code such that the bytes-ahead offset for the prefetches is set as a compile-time command-line define and played with various values: 64 is the obvious candidate in AVX mode since an "AVX complex double" consists of 4 adjacent quartets of doubles, thus "fetch the next complex datum at each array location" means fetch 64 bytes ahead of the current data addresses. In practice, though, 32 bytes-ahead proves a smidge faster (and I also tried other values, like 16 and 128, all of which were clearly worse). Instant 4% speedup (95ms/iter --> 91 ms/iter for my ongoing F29 run), the clearest timing "signal" I have ever seen in all my playing-with-prefetch. Thanks, George! Hopefully this is the start of a trend - still lots of other things to play with here, as you lay out above. |
![]() |
![]() |
![]() |
#29 |
∂2ω=0
Sep 2002
República de California
101101111010112 Posts |
![]()
First Pe'pin test of F29 completed late Wednesday - took slightly under 20 months at FFT length 30M. Start-of-run per-iteration time on my 3.3 GHz Haswell 4670 quad (running at stock) was ~115ms; several rounds of AVX2/FMA-oriented optimizations brought this down to ~90ms, which was representative of roughly the final 80% of the iterations.
On to run #2 (@FFT length 32M) ... our own Mike/Xyzzy was kind enough to run the first ~25Miter @32M over the last several months for me on his 2-core Haswell NUC, so I am proceeding where that run left off. [UPDATE] I was going to post the final checkpoint-Res64 and the traditional Selfridge-Hurwitz residues along with some notes on run #2 after letting the latter run for a day or two on the same quad-Haswell system, but the latter has already turned up a 'bonk!' between iteration 25M and 26M. Not yet 100% sure whether the error is in my new or the old run, but I suspect the latter because that entire run was 'fragile' and crash-at-random-intervals-ranging-from-an-hour-to-several-months prone. Said crashes were accompanied by either 'fatal roundoff error = 0.5' or 'nonzero exit carry' errors indicating some kind of data corruption, whereas the FFTM = 32M run has been clean of any untowardness. I further note that the 30M run used a fast but 'preliminary' binary of my v14.1 code release ... when I compared that to the eventual stabilized release code it was 2-3% faster at the FFT length in question, so I continued using it. Big mistake, as I now suspect the aforementioned 'noisy' fatal bad-data errors were likely accompanied by silent fails ... note the residue discrepancy between the 30M and 32M runs appears at a checkpoint where there are no signs of 'distress' from either run. So, likely I am looking at a massive waste of compute effort ... lesson learned, never again do such a run-pair in serial mode, always do the 2 side-by-side on systems yielding as similar speeds as possible. To that end, I am looking for a Note that should you choose to accept this mission, it will be a significant runtime commitment - around 18 months and 0 chance of a prime, although should you change your mind (or blow up your system) at some point along the way you can hand things off (even if your savefiles get nuked, since we will be cross-checking those vs mine) to a fresh recruit. But you will be contributing to a small piece of math history and of course getting a grateful 'thanks!' in any publication that may eventually result. |
![]() |
![]() |
![]() |
#30 |
Bemusing Prompter
"Danny"
Dec 2002
California
5·499 Posts |
![]()
Just curious, did the system have ECC memory?
|
![]() |
![]() |
![]() |
#31 |
Aug 2002
43×199 Posts |
![]()
None of his (or ours) has ECC memory.
![]() |
![]() |
![]() |
![]() |
#32 |
Undefined
"The unspeakable one"
Jun 2006
My evil lair
2·3,343 Posts |
![]()
18 month test without ECC RAM!
![]() |
![]() |
![]() |
![]() |
#33 |
"Ram Shanker"
May 2015
Delhi
2616 Posts |
![]()
I envy these algorithms!
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
P-1/P+1 on Fermat numbers | ATH | Operazione Doppi Mersennes | 2 | 2015-01-25 06:27 |
What are the Primality Tests ( not factoring! ) for Fermat Numbers? | Erasmus | Math | 46 | 2014-08-08 20:05 |
LLT numbers, linkd with Mersenne and Fermat numbers | T.Rex | Math | 4 | 2005-05-07 08:25 |
Two Primality tests for Fermat numbers | T.Rex | Math | 2 | 2004-09-11 07:26 |
Fermat Numbers | devarajkandadai | Math | 8 | 2004-07-27 12:27 |