![]() |
|
|
#23 | |
|
Sep 2016
22·5·19 Posts |
Quote:
|
|
|
|
|
|
|
#24 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
Quote:
Last fiddled with by fivemack on 2017-06-25 at 06:13 |
|
|
|
|
|
|
#25 |
|
Random Account
Aug 2009
Not U. + S.A.
3·953 Posts |
L1 = 10 x 32K Instruction, 10 x 32K Data.
L2 = 10 x 1MB. L3 = 1 x 13.75MB. I got this from PCParkPIcker. Last fiddled with by storm5510 on 2017-06-25 at 06:16 |
|
|
|
|
|
#26 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Pro: The extra bits used for the modular transform yield a disproportionate reduction in overall transform length. E.g. a 60-bit prime modulus gives 30 more bits per input word, so 128-bits per hybrid transform element (half for the floating-double-FFT word, half for the modular transform word) increases our allowable input size from ~18 bits to 48 bits, thus our total memory is (128/64)*(18/48) = 0.75 that of the pure float-double-FFT approach. Con: The modular transform needs extra work to keep the intermediate data modded, and the carry step is more complex due to cost of the hybrid-data error correction. Con: There is no vector-SIMD architecture supporting 64x64 --> 128-bit integer MUL at present, so even using the latest AVX-512 extensions we are limited to a 52-bit modulus, which - assuming we can find a suitable one to support the roots of 1 needed by the FFT and the roots of 2 neded for the IBDWT - increases our allowable input size from ~18 bits to 44 bits, thus our total memory is (128/64)*(18/44) = 0.82 that of the pure float-double-FFT approach. (I apologize for getting rather far out into the weeds, as it were.) |
|
|
|
|
|
|
#27 | |
|
Sep 2016
38010 Posts |
Quote:
With double-precision you're more or less limited to ~17 bits/point. That comes out to a memory efficiency of 17/64 = 27%. With double-double, you have 107 bits of precision which allows for maybe ~43 bits/point. That comes out to a memory efficiency of 43/128 = 34%. This makes it 25% faster than simple double-precision FFTs assuming the exact same memory access pattern in an environment where memory is such a large bottleneck that computation is negligible. This is getting into the same territory as the y-cruncher project in the trillion digit range where you're running off of disk and the fastest algorithm is the one with the smallest memory footprint regardless of how computationally slow it is. henryzz and I are not saying we are at that point yet for LL testing. Double-double arithmetic is still around 7 - 10x slower than standard double-precision arithmetic. So we'd probably need the compute/memory gap to increase by another factor of 4x or so before we cross that threshold. And this is where it's worth researching other "slow but small" approaches like your DP-FFT/M61-NTT hybrid algorithm. And since you mention that 64-bit multiply is slow, it's not obvious which approach is actually faster. Last fiddled with by Mysticial on 2017-06-25 at 06:55 |
|
|
|
|
|
|
#28 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3×23×89 Posts |
Quote:
Assuming the speed of an fft is O(n) rather than O(nlog(n)) we already have potential for 2.53(due to reduction in fft length) * 2 (due to avx512 rather than avx2) plus the fact it doesn't take a top cpu for the avx2 code to be memory limited(probably another 1.5x or more with skylake-X or server chips?) That hand wavy arithmetic comes to around 7.5x potential increase in cpu throughput which is close to the 7-10x. I hope I am not double counting anything. @Mystical Where did you get your 43 bits value from? The calculations change quite a bit if this changes. It does ring a bell that there is a formula for this. edit: Had an idea while writing this. Apologies for it being a bit of a jumble to read. We are suggesting a larger base for our fft. This is expensive at small sizes due to the faster algorithms requiring large numbers for the runtime order to cancel out the constant slowdown but increasing the size of the base is cheap at larger sizes since the increase in number of bits reduces the number of words). Once the base gets to fft length size doubling the size does little more than double the cost due to the O(nlogn) runtime. However, we gained more than 2x bits per word by doubling our word size above(as we keep doing this I think the number of bits for a word of twice the length would get closer to doubling). We may gain some more memory efficiency this way. For clarity what I am suggesting here is a 1k length fft made up of words of length 1k. Arithmetic for the words would also be done using fft. Is this what is meant by the two passes in Prime95's fft(1k and 4k make a 4M fft)? I suppose that this only makes sense as long as the subfft length remains large enough. I assume 3 pass is some way off being effective. Last fiddled with by henryzz on 2017-06-25 at 08:52 |
|
|
|
|
|
|
#29 | |
|
Sep 2016
38010 Posts |
Quote:
For length 2^24 and 17 bits assuming destructive cancellation, we get: 17*2 + 24/2 = 46 bits. Or about 7 bits of safety for round-off and "unlucky" large coefficients. Using 43 bits for also with length 2^24, we get: 43*2 + 24/2 = 98 bits. Or about 9 bits of safety with 107 bits via double-double. ------ More memory efficiency comparisons:
Last fiddled with by Mysticial on 2017-06-25 at 08:57 |
|
|
|
|
|
|
#30 | |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
10111111111012 Posts |
Quote:
|
|
|
|
|
|
|
#31 |
|
Random Account
Aug 2009
Not U. + S.A.
B2B16 Posts |
All of this has made for a lot of interesting reading. The first thing I think of when I see something like this is how will I use it. It seems clear that the majority of these new releases target high-end servers for large businesses supporting hundreds, or even thousands, of workstations.
It has been my very recent experience that Prime95, at its most efficient, uses half of what is available. Use less, then it becomes a waste. Use more, then it creates a bottleneck. Iteration time doubles in each situation. The only one in the group that I would be interested in is the Kaby Lake X. 4/8 running at 4.5 GHz. Anything beyond, not likely. Now, if you have an application that can multiple-instance, then that changes the story. Each instance in a worker-helper scenario, for example. On the SkyLake X, this would be 18 instances, if one wanted to run full-bore. I suspect that huge amounts of RAM would be needed. In the context of this project, I am not aware of any Windows GUI or console programs that can be set up to run this way. Linux may be a different case, as I have had little experience with it. As this goes along with these processors releasing tomorrow, some here might be more interested in how they are being applied than anything else. What is your application and how do you have it configured? Things like this.
|
|
|
|
|
|
#32 |
|
Bamboozled!
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across
2×17×347 Posts |
Indeed.
I'm currently reading a book called Massively Parallel Computing, written 30 years ago. Exactly the same discussion about memory bandwidth and latency, floating point precision, general usability of processors, etc, occurs there. The only difference is that microprocessors of the day, exemplified by the 68020 and 80386, were in the few MIPs range (few GIPS today) and vector processors were in the few GIPS (few TIPS today). Plus ça change. |
|
|
|
|
|
#33 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quite possibly - the point I was trying to make was that the hybrid float/int approach gets you as many or more extra bits per datum as doubled-double but with far less arithmetic penalty - but of course it requires good integer MUL support. The 52x52->104-bit vector-MUL (actually FMA) in the AVX-512IFMA extensions is getting us close to extreme interestingness, though still falling a couple product bytes short of what is needed for mod-M61 arithmetic. If said extensions are supported by the i9 series (and if so, also the new AVX-512-capable i7s?) Might be worth spending some time in search of a suitable 50-52-bit modulus.
Last fiddled with by ewmayer on 2017-06-25 at 21:01 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| The Secret CPU Inside Your Intel Processor | ewmayer | Tales From the Crypt(o) | 21 | 2017-11-23 03:02 |
| 64 bit intel processor? | Unregistered | Hardware | 2 | 2006-08-30 22:21 |
| Intel Core Duo processor | drew | Hardware | 5 | 2006-05-29 07:00 |
| Intel processor lineup | Peter Nelson | Hardware | 12 | 2005-07-04 20:42 |
| Which type of Intel processor to choose? | Mike | Hardware | 11 | 2004-12-21 04:10 |