![]() |
![]() |
#12 |
Tribal Bullet
Oct 2004
32×5×79 Posts |
![]()
Nice to hear!
I just saw your previous post about the 1MB limit to performance. Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; those caches are huge now, and a TLB is logic that is in the critical path of every memory access. The good news on linux is that you can prove whether this is happening, by allocating a partition somewhere of type 'hugetlbfs' that is backed by RAM. Accesses inside this partition are backed by 'second level TLB' entries, which on x86 cover 2MB or 4MB. If your code speeds up drastically in this case, you know that it's TLB thrashing. The bad news is that you can't do much about it, the TLB size is a fundamental machine limit and the only way to speed up is to respect that limit. The worse news is that I have no idea how to do the test :) |
![]() |
![]() |
![]() |
#13 |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]()
Thanks for the commentary and suggestions - I considered that the effect might be a TLB issue rather than L2 cache per se, but as you note, saw no easy way to test it. Perhaps George can provide some useful insight here, as I know he did much wrestling with TLB issues in days of yore.
From my perspective, as I now have a nice unified-code methodology [most code aside from the DFT cores common for different radices] for implementing large-radix leading-pass DFTs (the ones that wrap the carry phase as noted above), absent a simpler direct-fiddle-with-TLB-usage fix I will simply add larger-radix DFTs as needed to keep the per-thread data chunksizes < 1 MB. Note that on pre-Haswell chips this is closely tied to the compact-object-code issue, because previously those L1 I-cache overflow issues were negating the chunksize-beneficial effects of the larger radices. |
![]() |
![]() |
![]() |
#14 | |
P90 years forever!
Aug 2002
Yeehaw, FL
11111111010002 Posts |
![]() Quote:
Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference. Now if Ernst is trying to do the entire large stride pass 1 in-place, then the TLB costs could be much more significant. |
|
![]() |
![]() |
![]() |
#15 | |
∂2ω=0
Sep 2002
República de California
5·2,351 Posts |
![]() Quote:
I can see why the local-array aspect of the DFT/carry/DFT stage is TLB-insensitive, but what about the scattered-data reads and writes bookending things? One more note of possible relevance: For large radices I've found it expedient for the scalar-double code path to write things back to the scattered main array locs rather than to local memory. Interestingly I see no significant performance hit from doing so, contrary to my expectations. (This is not an option currently for the SIMD builds since those use carry macros which assume data are in contiguous memory.) |
|
![]() |
![]() |
![]() |
#16 | |
Jan 2008
France
11248 Posts |
![]() Quote:
![]() |
|
![]() |
![]() |
![]() |
#17 |
"Marv"
May 2009
near the Tannhäuser Gate
22·3·67 Posts |
![]()
If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on. It's free for 30 days. Quite a few years ago, for a project, I had the chance to use an early version of it and was VERY impressed; it was quite amazing to actually look inside the cpu to see what's going on and make changes to improve response rather that doing the trial and error approach. Although there are lots of features requiring some dedicated time to learn, I also remember it had some "quick start" modes to get results quickly. I believe there are some YouTube videos that demonstrate its features.
|
![]() |
![]() |
![]() |
#18 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() |
![]() |
![]() |
![]() |
#19 |
"Marv"
May 2009
near the Tannhäuser Gate
14448 Posts |
![]()
It says it doesn't. With Visual Studio being so prevalent on Windoze boxes, it makes sense that they wouldn't limit it.
In examining its new features, one that I found VERY interesting that they improved from the time I used it is that you can collect data from remote machines by running a collection piece on the other machine that doesn't require a seperate license for it. Previously, if you had 3 different Intel cpu architectures you were targeting, you needed a license for each machine or an expensive floating license. Now, it looks like 1 will do the trick. Very cool. BTW, its name is now Vtune Amplifier XE 2013. sheesh, I'll never understand marketing types. |
![]() |
![]() |
![]() |
#20 |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]()
Second run of F28, at FFT length 16384 kdoubles, finished today, and match the one for 15360k I posted last November in this thread. The timings in the attached checkpoint-summary-status file reflect the first part of the run being done using leading radix r0 = 64 (yielding a per-thread dataset size of ~2 MB) and then switching to r0 = 128 (1MB/thread, right at the 1 MB threshold which is critical on Intel x86) and later r0 = 256 (~0.5MB/thread) as the new code for those radices became available.
The summaries for the first few and last few checkpoints and the final Selfridge-Hurwitz residues are F28: using FFT length 16384K = 16777216 8-byte floats. this gives an average 16.000000000000000 bits per digit Using complex FFT radices 64 16 16 16 32 [Nov 26 09:00:00] F28 Iter# = 10000 clocks = 00:00:00.000 [ 0.0695 sec/iter] Res64: 0F38D26B35C6794C. AvgMaxErr = 0.010993669. MaxErr = 0.013671875 [Nov 26 09:11:38] F28 Iter# = 20000 clocks = 00:00:00.000 [ 0.0696 sec/iter] Res64: 6E621D1C6A05BC46. AvgMaxErr = 0.011034805. MaxErr = 0.015625000 ... Using complex FFT radices 128 16 16 16 16 [Dec 14 14:50:10] F28 Iter# = 22690000 clocks = 00:00:00.000 [ 0.0580 sec/iter] Res64: 262012254B592782. AvgMaxErr = 0.010806998. MaxErr = 0.013671875 [Dec 14 14:59:50] F28 Iter# = 22700000 clocks = 00:00:00.000 [ 0.0579 sec/iter] Res64: 0002AF208FE85365. AvgMaxErr = 0.010797848. MaxErr = 0.013671875 ... Using complex FFT radices 256 8 16 16 16 [Jan 09 14:55:42] F28 Iter# = 61130000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: F3699600BCBFECE0. AvgMaxErr = 0.011603018. MaxErr = 0.015625000 [Jan 09 15:05:10] F28 Iter# = 61140000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: 2DD786952E04FC17. AvgMaxErr = 0.011608487. MaxErr = 0.015625000 ... [May 27 04:12:06] F28 Iter# = 268390000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: E529CBABAE1D8B17. AvgMaxErr = 0.011612073. MaxErr = 0.015625000 [May 27 04:21:42] F28 Iter# = 268400000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 7D195E1B0DD4A93E. AvgMaxErr = 0.011611240. MaxErr = 0.015625000 [May 27 04:31:17] F28 Iter# = 268410000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 840DBB498F57E870. AvgMaxErr = 0.011617642. MaxErr = 0.015625000 [May 27 04:40:53] F28 Iter# = 268420000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 64BE2132426CDD1C. AvgMaxErr = 0.011605388. MaxErr = 0.015625000 [May 27 04:50:28] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.011610274. MaxErr = 0.015625000 [May 27 04:55:43] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.006339752. MaxErr = 0.015625000 F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x R28 mod 2^36 = 16759471759 R28 mod 2^35 - 1 = 30476727665 R28 mod 2^36 - 1 = 9104636564 Tarchive of the 2 checkpoint-summary files and the final bytewise residue file (identical for both runs, thus just 1 copy) in the previously-described format are here (33 MB, so please do not download frivolously - the run-status file alone is in the much smaller attachment below, if you just want the summary data). The MD5 checksum on the extracted bytewise residue file is md5sum f28 = 1554face46e633eb2fd773df2647cd2a F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my non-OCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider Last fiddled with by ewmayer on 2017-11-15 at 00:21 Reason: Last para: F28 -> R28 |
![]() |
![]() |
![]() |
#21 | |
∂2ω=0
Sep 2002
República de California
5×2,351 Posts |
![]() Quote:
Fwd: 174 FMA [Breakdown: 87 FMA, 2 FMS, 85 FNMA, 0 FNMS.] Inv: 174 FMA [a whopping 112 of which have a trivial unity multiplicand], 34 MUL. Upshot: I probably get the desired FMA-speedup in the forward FFT routines restructured thusly, but get little speedup in the iFFT due to the above issues. Partly as a result of the effort expended to achieve this mediocre gain I did not consider widening my FMA-related optimizations beyond the above 2 "heavy lifting" routines until recently. However, I will henceforth be focusing on more-localized "peephole" style FMA code opts, which can be done incrementally and in a localized fashion. For starters I FMAized the other code macros (radix-15 and twiddleless radix-16 assembled into radix-240, and fused radix-16 final-fwd-FFT/dyadic-square/initial-inv-FFT-pass) used in the above set of radices for my ongoing F29 run. That and a half-dozen other miscellaneous 1%-level tweaks have cut the per-iteration timing @30 Mdoubles to 95 msec, which is within 10% of my "stretch goal" of 86 msec. (Still running at stock - if I added a ginormo-cooler and OCed up the wazoo I could probably hit 80 msec with the current code, but where's the challenge in that? :) |
|
![]() |
![]() |
![]() |
#22 |
P90 years forever!
Aug 2002
Yeehaw, FL
816810 Posts |
![]()
In the radix-16 code, have you tried prefetching the next radix-16 data into the L1 cache?
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
P-1/P+1 on Fermat numbers | ATH | Operazione Doppi Mersennes | 2 | 2015-01-25 06:27 |
What are the Primality Tests ( not factoring! ) for Fermat Numbers? | Erasmus | Math | 46 | 2014-08-08 20:05 |
LLT numbers, linkd with Mersenne and Fermat numbers | T.Rex | Math | 4 | 2005-05-07 08:25 |
Two Primality tests for Fermat numbers | T.Rex | Math | 2 | 2004-09-11 07:26 |
Fermat Numbers | devarajkandadai | Math | 8 | 2004-07-27 12:27 |