mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Computer Science & Computational Number Theory

Reply
 
Thread Tools
Old 2014-03-15, 15:11   #12
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Nice to hear!

I just saw your previous post about the 1MB limit to performance. Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches; those caches are huge now, and a TLB is logic that is in the critical path of every memory access.

The good news on linux is that you can prove whether this is happening, by allocating a partition somewhere of type 'hugetlbfs' that is backed by RAM. Accesses inside this partition are backed by 'second level TLB' entries, which on x86 cover 2MB or 4MB. If your code speeds up drastically in this case, you know that it's TLB thrashing. The bad news is that you can't do much about it, the TLB size is a fundamental machine limit and the only way to speed up is to respect that limit.

The worse news is that I have no idea how to do the test :)
jasonp is offline   Reply With Quote
Old 2014-03-15, 21:39   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×449 Posts
Default

Thanks for the commentary and suggestions - I considered that the effect might be a TLB issue rather than L2 cache per se, but as you note, saw no easy way to test it. Perhaps George can provide some useful insight here, as I know he did much wrestling with TLB issues in days of yore.

From my perspective, as I now have a nice unified-code methodology [most code aside from the DFT cores common for different radices] for implementing large-radix leading-pass DFTs (the ones that wrap the carry phase as noted above), absent a simpler direct-fiddle-with-TLB-usage fix I will simply add larger-radix DFTs as needed to keep the per-thread data chunksizes < 1 MB. Note that on pre-Haswell chips this is closely tied to the compact-object-code issue, because previously those L1 I-cache overflow issues were negating the chunksize-beneficial effects of the larger radices.
ewmayer is offline   Reply With Quote
Old 2014-03-16, 03:55   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×11×349 Posts
Default

Quote:
Originally Posted by jasonp View Post
Could it be a limit on the number of TLB's on your machine? A single translation lookaside buffer speeds up access to a single 4kB page, so if your machine has a TLB with 128 entries and you pull in data from more than 128 different widely separated addresses then you will probably get noticeable slowdowns. Usually a TLB miss takes much less time than an access from RAM, but if you would otherwise be running from L2 then the miss penalty should be noticeable. Yes, this means modern x86 machines don't have enough TLB entries to cover their own caches;
I'm not convinced TLB accesses are a problem.

Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference.

Now if Ernst is trying to do the entire large stride pass 1 in-place, then the TLB costs could be much more significant.
Prime95 is online now   Reply With Quote
Old 2014-03-16, 06:07   #15
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1167410 Posts
Default

Quote:
Originally Posted by Prime95 View Post
I'm not convinced TLB accesses are a problem.

Prime95 in pass 1 where we are using a large stride reads scattered data from RAM and copies it to a scratch area, the various FFT steps and normalization are done in the contiguous scratch area where TLBs are not a problem before copying the final results back to scattered RAM. Yes, there are some TLB misses at the start and end, but the cost appears to be insignificant. I've implemented large pages FFTs in Windows and did not observe a measurable speed difference.

Now if Ernst is trying to do the entire large stride pass 1 in-place, then the TLB costs could be much more significant.
No - The way I do things is very much as you describe (holy convergent evolution, Batman) - large-regular-stride main-array data go into the radix-R DFT preceding the carry step, outputs of final-iFFT DFT are into small block of local storage, after the carry step we read from that, do the initial-fFFT radix-R DFT and write results back to the large-stride main array locations. I've been working to increase the pass 1 radix (R in my notation) to get the ensuing pass2 (remaining fFFT passes/dyadic-squaring/all-but-last-iFFTpass) data block sizes below the L2 size, and eventually to something which will fit in L1.

I can see why the local-array aspect of the DFT/carry/DFT stage is TLB-insensitive, but what about the scattered-data reads and writes bookending things?

One more note of possible relevance: For large radices I've found it expedient for the scalar-double code path to write things back to the scattered main array locs rather than to local memory. Interestingly I see no significant performance hit from doing so, contrary to my expectations. (This is not an option currently for the SIMD builds since those use carry macros which assume data are in contiguous memory.)
ewmayer is offline   Reply With Quote
Old 2014-03-17, 12:28   #16
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×52×11 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The one annoyance: I get no speedup from the object-code-shrinkage on my Haswell quad. But my code already got a much greater per-cycle boost than p95 on Haswell, which leads me to believe the various hardware improvements [doubled L1 cache sizes, doubled L2-to-L1 bandwidth] on Haswell mitigated the I-cache-overflow issue for me.
Another hypothesis: much better hardware prefetcher + branch prediction on Haswell
ldesnogu is offline   Reply With Quote
Old 2014-03-17, 23:49   #17
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2A616 Posts
Default Intel Vtune

If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on. It's free for 30 days. Quite a few years ago, for a project, I had the chance to use an early version of it and was VERY impressed; it was quite amazing to actually look inside the cpu to see what's going on and make changes to improve response rather that doing the trial and error approach. Although there are lots of features requiring some dedicated time to learn, I also remember it had some "quick start" modes to get results quickly. I believe there are some YouTube videos that demonstrate its features.
tServo is offline   Reply With Quote
Old 2014-03-18, 03:39   #18
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·13·449 Posts
Default

Quote:
Originally Posted by tServo View Post
If one had the time, one could download Intel's excellent utility, Vtune, to determine what exactly is going on.
Does that require an Intel compiler install underneath?
ewmayer is offline   Reply With Quote
Old 2014-03-18, 13:01   #19
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

2×3×113 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Does that require an Intel compiler install underneath?
It says it doesn't. With Visual Studio being so prevalent on Windoze boxes, it makes sense that they wouldn't limit it.
In examining its new features, one that I found VERY interesting that they improved from the time I used it is that you can collect data from remote machines by running a collection piece on the other machine that doesn't require a seperate license for it. Previously, if you had 3 different Intel cpu architectures you were targeting, you needed a license for each machine or an expensive floating license. Now, it looks like 1 will do the trick. Very cool.
BTW, its name is now Vtune Amplifier XE 2013. sheesh, I'll never understand marketing types.
tServo is offline   Reply With Quote
Old 2014-05-28, 07:09   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×449 Posts
Default

Second run of F28, at FFT length 16384 kdoubles, finished today, and match the one for 15360k I posted last November in this thread. The timings in the attached checkpoint-summary-status file reflect the first part of the run being done using leading radix r0 = 64 (yielding a per-thread dataset size of ~2 MB) and then switching to r0 = 128 (1MB/thread, right at the 1 MB threshold which is critical on Intel x86) and later r0 = 256 (~0.5MB/thread) as the new code for those radices became available.

The summaries for the first few and last few checkpoints and the final Selfridge-Hurwitz residues are

F28: using FFT length 16384K = 16777216 8-byte floats.
this gives an average 16.000000000000000 bits per digit
Using complex FFT radices 64 16 16 16 32
[Nov 26 09:00:00] F28 Iter# = 10000 clocks = 00:00:00.000 [ 0.0695 sec/iter] Res64: 0F38D26B35C6794C. AvgMaxErr = 0.010993669. MaxErr = 0.013671875
[Nov 26 09:11:38] F28 Iter# = 20000 clocks = 00:00:00.000 [ 0.0696 sec/iter] Res64: 6E621D1C6A05BC46. AvgMaxErr = 0.011034805. MaxErr = 0.015625000
...
Using complex FFT radices 128 16 16 16 16
[Dec 14 14:50:10] F28 Iter# = 22690000 clocks = 00:00:00.000 [ 0.0580 sec/iter] Res64: 262012254B592782. AvgMaxErr = 0.010806998. MaxErr = 0.013671875
[Dec 14 14:59:50] F28 Iter# = 22700000 clocks = 00:00:00.000 [ 0.0579 sec/iter] Res64: 0002AF208FE85365. AvgMaxErr = 0.010797848. MaxErr = 0.013671875
...
Using complex FFT radices 256 8 16 16 16
[Jan 09 14:55:42] F28 Iter# = 61130000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: F3699600BCBFECE0. AvgMaxErr = 0.011603018. MaxErr = 0.015625000
[Jan 09 15:05:10] F28 Iter# = 61140000 clocks = 00:00:00.000 [ 0.0567 sec/iter] Res64: 2DD786952E04FC17. AvgMaxErr = 0.011608487. MaxErr = 0.015625000
...
[May 27 04:12:06] F28 Iter# = 268390000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: E529CBABAE1D8B17. AvgMaxErr = 0.011612073. MaxErr = 0.015625000
[May 27 04:21:42] F28 Iter# = 268400000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 7D195E1B0DD4A93E. AvgMaxErr = 0.011611240. MaxErr = 0.015625000
[May 27 04:31:17] F28 Iter# = 268410000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 840DBB498F57E870. AvgMaxErr = 0.011617642. MaxErr = 0.015625000
[May 27 04:40:53] F28 Iter# = 268420000 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 64BE2132426CDD1C. AvgMaxErr = 0.011605388. MaxErr = 0.015625000
[May 27 04:50:28] F28 Iter# = 268430000 clocks = 00:00:00.000 [ 0.0575 sec/iter] Res64: 360A9E1308A3165A. AvgMaxErr = 0.011610274. MaxErr = 0.015625000
[May 27 04:55:43] F28 Iter# = 268435455 clocks = 00:00:00.000 [ 0.0574 sec/iter] Res64: 468731C3E6F13E8F. AvgMaxErr = 0.006339752. MaxErr = 0.015625000
F28 is not prime. Res64: 468731C3E6F13E8F. Program: E3.0x
R28 mod 2^36 = 16759471759
R28 mod 2^35 - 1 = 30476727665
R28 mod 2^36 - 1 = 9104636564


Tarchive of the 2 checkpoint-summary files and the final bytewise residue file (identical for both runs, thus just 1 copy) in the previously-described format are here (33 MB, so please do not download frivolously - the run-status file alone is in the much smaller attachment below, if you just want the summary data). The MD5 checksum on the extracted bytewise residue file is

md5sum f28 = 1554face46e633eb2fd773df2647cd2a


F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my non-OCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider performance-enhancing drugs overclocking the hardware.
Attached Files
File Type: gz f28_16384k.stat.gz (663.3 KB, 207 views)

Last fiddled with by ewmayer on 2017-11-15 at 00:21 Reason: Last para: F28 -> R28
ewmayer is offline   Reply With Quote
Old 2014-10-10, 04:07   #21
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×13×449 Posts
Default

Quote:
Originally Posted by ewmayer View Post
F29 (@30720K) is taking ~110 ms/iter on all 4 cores of my non-OCed Haswell 4670K; I'd really like to get that down to ~86ms (i.e. 1 Miter/day) which would still need 18 months for the complete run, so it may be time to consider performance-enhancing drugs overclocking the hardware.
My initial stab (summer 2013) at FMA-related speedups was restricted to the intermediate-FFT-pass radix-16 routines, which do roughly half the work in a run like my ongoing F29 one, where I am using complex FFT radices 240,16,16,16,16 (product = 15M, or 30 Mdoubles in real-data terms). The bolded radices are the previously-FMAized ones. That effort was somewhat disppointing, since it yielded an overall speedup of 4-5% over non-FMA AVX, i.e. perhaps 8-10% speedup for the FMA-optimized code macros. (I was hoping for something like 15% or better). Part of the disappointment is due to a very basic aspect of my FFT implementation, namely that I use a post-twiddles complex-multiply scheme for the various passes of the inverse FFT, in which the complex twiddles are applied to the outputs of the DFT use in each iFFT pass. This is nice from a dataflow-symmetry and auxiliary-data-indexing perspective, but is not favorable for an FMA-based code speedup because instead of leading to a nice "tangent structure" (as for the forward, pre-twiddle passes) in which the twiddle MULs can be absorbed nicely in with the ADD/SUBs, one instead ends up in the iFFT with a raft of "trivial" FMAs with one of the two multiplicands = 1 (or simply retains the original ADD/SUBs those replace), and at the end must do a batch of pure-MULs (i.e. wasting the fused mul/add aspect of FMA) as part of the post-twiddling. Here are the resulting opcounts for forward and inverse radix-16 complex FFT passes:

Fwd: 174 FMA [Breakdown: 87 FMA, 2 FMS, 85 FNMA, 0 FNMS.]
Inv: 174 FMA [a whopping 112 of which have a trivial unity multiplicand], 34 MUL.

Upshot: I probably get the desired FMA-speedup in the forward FFT routines restructured thusly, but get little speedup in the iFFT due to the above issues.

Partly as a result of the effort expended to achieve this mediocre gain I did not consider widening my FMA-related optimizations beyond the above 2 "heavy lifting" routines until recently. However, I will henceforth be focusing on more-localized "peephole" style FMA code opts, which can be done incrementally and in a localized fashion. For starters I FMAized the other code macros (radix-15 and twiddleless radix-16 assembled into radix-240, and fused radix-16 final-fwd-FFT/dyadic-square/initial-inv-FFT-pass) used in the above set of radices for my ongoing F29 run. That and a half-dozen other miscellaneous 1%-level tweaks have cut the per-iteration timing @30 Mdoubles to 95 msec, which is within 10% of my "stretch goal" of 86 msec.

(Still running at stock - if I added a ginormo-cooler and OCed up the wazoo I could probably hit 80 msec with the current code, but where's the challenge in that? :)
ewmayer is offline   Reply With Quote
Old 2014-10-10, 14:45   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·11·349 Posts
Default

In the radix-16 code, have you tried prefetching the next radix-16 data into the L1 cache?
Prime95 is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
P-1/P+1 on Fermat numbers ATH Operazione Doppi Mersennes 2 2015-01-25 06:27
What are the Primality Tests ( not factoring! ) for Fermat Numbers? Erasmus Math 46 2014-08-08 20:05
LLT numbers, linkd with Mersenne and Fermat numbers T.Rex Math 4 2005-05-07 08:25
Two Primality tests for Fermat numbers T.Rex Math 2 2004-09-11 07:26
Fermat Numbers devarajkandadai Math 8 2004-07-27 12:27

All times are UTC. The time now is 22:44.


Wed Dec 1 22:44:51 UTC 2021 up 131 days, 17:13, 1 user, load averages: 1.04, 1.16, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.