mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-06-25, 03:44   #23
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22·5·19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Don't hold your breath for hardware-supported 128-bit float ('X-float') ... market for that is exceedingly small, and transistor count roughly 4x that of floating-double, based on the needed hardware multiplier.

Re. memory bandwidth ... what are the L2/3 caches on the i9s? I'm wondering if the total cache shared by (say) 10 i9 hardware cores is large enough to fit the ~32MB needed for an LL-test residue vector @4M FFT length.
I think he's talking about double-double arithmetic with 107 bits of precision, not quad-precision floats with 113 bits. The latter isn't happening for the foreseeable future. But the former is doable efficiently with current hardware.
Mysticial is offline   Reply With Quote
Old 2017-06-25, 06:13   #24
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

2·7·461 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Don't hold your breath for hardware-supported 128-bit float ('X-float') ... market for that is exceedingly small, and transistor count roughly 4x that of floating-double, based on the needed hardware multiplier.

Re. memory bandwidth ... what are the L2/3 caches on the i9s? I'm wondering if the total cache shared by (say) 10 i9 hardware cores is large enough to fit the ~32MB needed for an LL-test residue vector @4M FFT length.
L2 is 1MB per core, L3 is 11/8MB per core, so that's not quite 32M total on the i9/7900X. And I think a load made mostly of snoops into other cores' L2 will not run that efficiently on the 18-core model.

Last fiddled with by fivemack on 2017-06-25 at 06:13
fivemack is offline   Reply With Quote
Old 2017-06-25, 06:14   #25
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
Not U. + S.A.

3·953 Posts
Default

Quote:
Originally Posted by ewmayer View Post
... what are the L2/3 caches on the i9s?.
L1 = 10 x 32K Instruction, 10 x 32K Data.
L2 = 10 x 1MB.
L3 = 1 x 13.75MB.

I got this from PCParkPIcker.

Last fiddled with by storm5510 on 2017-06-25 at 06:16
storm5510 is offline   Reply With Quote
Old 2017-06-25, 06:22   #26
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I think he's talking about double-double arithmetic with 107 bits of precision, not quad-precision floats with 113 bits. The latter isn't happening for the foreseeable future. But the former is doable efficiently with current hardware.
Don't see the point of that - half the number of 'residue words' but now 128-bit words (i..e. no memory savings), and arithmetic on each doubled-double still costs more than simple double-precision arithmetic. That's quite a different proposition than, say, doing a hybrid double/int64 transform with a 64-bit prime modulus for the latter, where the 2 side-by-side transforms only need be recorrelated during the normalize&carry step, and the modular transform adds ~32 bits (half the no. of bits of the modulus) to the maximum input wordsize of the floating-double FFT, by acting as a 64-bit error correction on the floating-double convolution outputs. Of course for that kind of scheme to work we need efficient vector 64x64 --> 128-bit integer MUL, which means hoping for same in some future iteration of the x86 vector instruction set. (The version of AVX-512 shipping in the Intel i9 has efficient 64x64-bit low-half MUL, but only 52x52 --> 104-bit integer MUL ... hmm, I suppose if we could find a suitable 50ish-bit modulus, which has both the needed roots of unity and of 2 ...). Even such a hybrid is a dubious proposition for improvement, here are the main pros and cons:

Pro: The extra bits used for the modular transform yield a disproportionate reduction in overall transform length. E.g. a 60-bit prime modulus gives 30 more bits per input word, so 128-bits per hybrid transform element (half for the floating-double-FFT word, half for the modular transform word) increases our allowable input size from ~18 bits to 48 bits, thus our total memory is (128/64)*(18/48) = 0.75 that of the pure float-double-FFT approach.

Con: The modular transform needs extra work to keep the intermediate data modded, and the carry step is more complex due to cost of the hybrid-data error correction.

Con: There is no vector-SIMD architecture supporting 64x64 --> 128-bit integer MUL at present, so even using the latest AVX-512 extensions we are limited to a 52-bit modulus, which - assuming we can find a suitable one to support the roots of 1 needed by the FFT and the roots of 2 neded for the IBDWT - increases our allowable input size from ~18 bits to 44 bits, thus our total memory is (128/64)*(18/44) = 0.82 that of the pure float-double-FFT approach.

(I apologize for getting rather far out into the weeds, as it were.)
ewmayer is offline   Reply With Quote
Old 2017-06-25, 06:54   #27
Mysticial
 
Mysticial's Avatar
 
Sep 2016

38010 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Don't see the point of that - half the number of 'residue words' but now 128-bit words (i..e. no memory savings), and arithmetic on each doubled-double still costs more than simple double-precision arithmetic. That's quite a different proposition than, say, doing a hybrid double/int64 transform with a 64-bit prime modulus for the latter, where the 2 side-by-side transforms only need be recorrelated during the normalize&carry step, and the modular transform adds ~32 bits (half the no. of bits of the modulus) to the maximum input wordsize of the floating-double FFT, by acting as a 64-bit error correction on the floating-double convolution outputs. Of course for that kind of scheme to work we need efficient vector 64x64 --> 128-bit integer MUL, which means hoping for same in some future iteration of the x86 vector instruction set. (The version of AVX-512 shipping in the Intel i9 has efficient 64x64-bit low-half MUL, but only 52x52 --> 104-bit integer MUL ... hmm, I suppose if we could find a suitable 50ish-bit modulus, which has both the needed roots of unity and of 2 ...). Even such a hybrid is a dubious proposition for improvement, here are the main pros and cons:

Pro: The extra bits used for the modular transform yield a disproportionate reduction in overall transform length. E.g. a 60-bit prime modulus gives 30 more bits per input word, so 128-bits per hybrid transform element (half for the floating-double-FFT word, half for the modular transform word) increases our allowable input size from ~18 bits to 48 bits, thus our total memory is (128/64)*(18/48) = 0.75 that of the pure float-double-FFT approach.

Con: The modular transform needs extra work to keep the intermediate data modded, and the carry step is more complex due to cost of the hybrid-data error correction.

Con: There is no vector-SIMD architecture supporting 64x64 --> 128-bit integer MUL at present, so even using the latest AVX-512 extensions we are limited to a 52-bit modulus, which - assuming we can find a suitable one to support the roots of 1 needed by the FFT and the roots of 2 neded for the IBDWT - increases our allowable input size from ~18 bits to 44 bits, thus our total memory is (128/64)*(18/44) = 0.82 that of the pure float-double-FFT approach.

(I apologize for getting rather far out into the weeds, as it were.)
I think we're talking past each other. The point is that we imagine we live in a world were computation is free and memory is the only thing that matters.

With double-precision you're more or less limited to ~17 bits/point. That comes out to a memory efficiency of 17/64 = 27%.
With double-double, you have 107 bits of precision which allows for maybe ~43 bits/point. That comes out to a memory efficiency of 43/128 = 34%.

This makes it 25% faster than simple double-precision FFTs assuming the exact same memory access pattern in an environment where memory is such a large bottleneck that computation is negligible. This is getting into the same territory as the y-cruncher project in the trillion digit range where you're running off of disk and the fastest algorithm is the one with the smallest memory footprint regardless of how computationally slow it is.

henryzz and I are not saying we are at that point yet for LL testing. Double-double arithmetic is still around 7 - 10x slower than standard double-precision arithmetic. So we'd probably need the compute/memory gap to increase by another factor of 4x or so before we cross that threshold.

And this is where it's worth researching other "slow but small" approaches like your DP-FFT/M61-NTT hybrid algorithm. And since you mention that 64-bit multiply is slow, it's not obvious which approach is actually faster.

Last fiddled with by Mysticial on 2017-06-25 at 06:55
Mysticial is offline   Reply With Quote
Old 2017-06-25, 08:13   #28
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

3×23×89 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I think we're talking past each other. The point is that we imagine we live in a world were computation is free and memory is the only thing that matters.

With double-precision you're more or less limited to ~17 bits/point. That comes out to a memory efficiency of 17/64 = 27%.
With double-double, you have 107 bits of precision which allows for maybe ~43 bits/point. That comes out to a memory efficiency of 43/128 = 34%.

This makes it 25% faster than simple double-precision FFTs assuming the exact same memory access pattern in an environment where memory is such a large bottleneck that computation is negligible. This is getting into the same territory as the y-cruncher project in the trillion digit range where you're running off of disk and the fastest algorithm is the one with the smallest memory footprint regardless of how computationally slow it is.

henryzz and I are not saying we are at that point yet for LL testing. Double-double arithmetic is still around 7 - 10x slower than standard double-precision arithmetic. So we'd probably need the compute/memory gap to increase by another factor of 4x or so before we cross that threshold.

And this is where it's worth researching other "slow but small" approaches like your DP-FFT/M61-NTT hybrid algorithm. And since you mention that 64-bit multiply is slow, it's not obvious which approach is actually faster.
That is exactly what I was talking about. We are actually closer than you say. I think the 7-10x assumes the same number of double-double multiplies as double multiplies. The number of multiplies would be less due to the fft length reducing by a factor of 43/17=2.53. It is just the memory usage that is ~1.25x less.
Assuming the speed of an fft is O(n) rather than O(nlog(n)) we already have potential for 2.53(due to reduction in fft length) * 2 (due to avx512 rather than avx2) plus the fact it doesn't take a top cpu for the avx2 code to be memory limited(probably another 1.5x or more with skylake-X or server chips?)
That hand wavy arithmetic comes to around 7.5x potential increase in cpu throughput which is close to the 7-10x. I hope I am not double counting anything.

@Mystical Where did you get your 43 bits value from? The calculations change quite a bit if this changes. It does ring a bell that there is a formula for this.

edit: Had an idea while writing this. Apologies for it being a bit of a jumble to read.
We are suggesting a larger base for our fft. This is expensive at small sizes due to the faster algorithms requiring large numbers for the runtime order to cancel out the constant slowdown but increasing the size of the base is cheap at larger sizes since the increase in number of bits reduces the number of words). Once the base gets to fft length size doubling the size does little more than double the cost due to the O(nlogn) runtime. However, we gained more than 2x bits per word by doubling our word size above(as we keep doing this I think the number of bits for a word of twice the length would get closer to doubling). We may gain some more memory efficiency this way.
For clarity what I am suggesting here is a 1k length fft made up of words of length 1k. Arithmetic for the words would also be done using fft. Is this what is meant by the two passes in Prime95's fft(1k and 4k make a 4M fft)? I suppose that this only makes sense as long as the subfft length remains large enough. I assume 3 pass is some way off being effective.

Last fiddled with by henryzz on 2017-06-25 at 08:52
henryzz is offline   Reply With Quote
Old 2017-06-25, 08:43   #29
Mysticial
 
Mysticial's Avatar
 
Sep 2016

38010 Posts
Default

Quote:
Originally Posted by henryzz View Post
@Mystical Where did you get your 43 bits value from? The calculations change quite a bit if this changes.
A hand-wavy estimate based on the current safety margins.

For length 2^24 and 17 bits assuming destructive cancellation, we get: 17*2 + 24/2 = 46 bits.
Or about 7 bits of safety for round-off and "unlucky" large coefficients.

Using 43 bits for also with length 2^24, we get: 43*2 + 24/2 = 98 bits.
Or about 9 bits of safety with 107 bits via double-double.

------

More memory efficiency comparisons:
  • 17 bits/DP is 17/64 = 27% (Assuming destructive cancellation.)
  • 43 bits/double-double is 43/128 = 34% - (Assuming destructive cancellation.)
  • y-cruncher's best algorithm gets 256/576 for 44% efficiency. This done using a 9-prime 64-bit NTT and is provably correct, with no reliance on destructive cancellation. It's possible to squeeze a few more bits out if you do go into the destructive cancellation range.
  • The Schönhage–Strassen Algorithm gets you arbitrarily close to 50%. But there are other issues with super-large ring algorithms which make them less useful.

Last fiddled with by Mysticial on 2017-06-25 at 08:57
Mysticial is offline   Reply With Quote
Old 2017-06-25, 09:06   #30
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Liverpool (GMT/BST)

10111111111012 Posts
Default

Quote:
Originally Posted by Mysticial View Post
A hand-wavy estimate based on the current safety margins.

For length 2^24 and 17 bits assuming destructive cancellation, we get: 17*2 + 24/2 = 46 bits.
Or about 7 bits of safety for round-off and "unlucky" large coefficients.

Using 43 bits for also with length 2^24, we get: 43*2 + 24/2 = 98 bits.
Or about 9 bits of safety with 107 bits via double-double.

------

More memory efficiency comparisons:
  • 17 bits/DP is 17/64 = 27% (Assuming destructive cancellation.)
  • 43 bits/double-double is 43/128 = 34% - (Assuming destructive cancellation.)
  • y-cruncher's best algorithm gets 256/576 for 44% efficiency. This done using a 9-prime 64-bit NTT and is provably correct, with no reliance on destructive cancellation. It's possible to squeeze a few more bits out if you do go into the destructive cancellation range.
  • The Schönhage–Strassen Algorithm gets you arbitrarily close to 50%. But there are other issues with super-large ring algorithms which make them less useful.
It looks like there is potential for much better than 17/64. We just need to balance that with cost properly.
henryzz is offline   Reply With Quote
Old 2017-06-25, 18:11   #31
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
Not U. + S.A.

B2B16 Posts
Default

All of this has made for a lot of interesting reading. The first thing I think of when I see something like this is how will I use it. It seems clear that the majority of these new releases target high-end servers for large businesses supporting hundreds, or even thousands, of workstations.

It has been my very recent experience that Prime95, at its most efficient, uses half of what is available. Use less, then it becomes a waste. Use more, then it creates a bottleneck. Iteration time doubles in each situation.

The only one in the group that I would be interested in is the Kaby Lake X. 4/8 running at 4.5 GHz. Anything beyond, not likely. Now, if you have an application that can multiple-instance, then that changes the story. Each instance in a worker-helper scenario, for example. On the SkyLake X, this would be 18 instances, if one wanted to run full-bore. I suspect that huge amounts of RAM would be needed.

In the context of this project, I am not aware of any Windows GUI or console programs that can be set up to run this way. Linux may be a different case, as I have had little experience with it.

As this goes along with these processors releasing tomorrow, some here might be more interested in how they are being applied than anything else. What is your application and how do you have it configured? Things like this.

storm5510 is offline   Reply With Quote
Old 2017-06-25, 20:03   #32
xilman
Bamboozled!
 
xilman's Avatar
 
"𒉺𒌌𒇷𒆷𒀭"
May 2003
Down not across

2×17×347 Posts
Default

Quote:
Originally Posted by storm5510 View Post
All of this has made for a lot of interesting reading.
Indeed.

I'm currently reading a book called Massively Parallel Computing, written 30 years ago. Exactly the same discussion about memory bandwidth and latency, floating point precision, general usability of processors, etc, occurs there. The only difference is that microprocessors of the day, exemplified by the 68020 and 80386, were in the few MIPs range (few GIPS today) and vector processors were in the few GIPS (few TIPS today).

Plus ça change.
xilman is offline   Reply With Quote
Old 2017-06-25, 21:00   #33
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22·2,939 Posts
Default

Quote:
Originally Posted by Mysticial View Post
I think we're talking past each other. The point is that we imagine we live in a world were computation is free and memory is the only thing that matters.
Quite possibly - the point I was trying to make was that the hybrid float/int approach gets you as many or more extra bits per datum as doubled-double but with far less arithmetic penalty - but of course it requires good integer MUL support. The 52x52->104-bit vector-MUL (actually FMA) in the AVX-512IFMA extensions is getting us close to extreme interestingness, though still falling a couple product bytes short of what is needed for mod-M61 arithmetic. If said extensions are supported by the i9 series (and if so, also the new AVX-512-capable i7s?) Might be worth spending some time in search of a suitable 50-52-bit modulus.

Last fiddled with by ewmayer on 2017-06-25 at 21:01
ewmayer is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
The Secret CPU Inside Your Intel Processor ewmayer Tales From the Crypt(o) 21 2017-11-23 03:02
64 bit intel processor? Unregistered Hardware 2 2006-08-30 22:21
Intel Core Duo processor drew Hardware 5 2006-05-29 07:00
Intel processor lineup Peter Nelson Hardware 12 2005-07-04 20:42
Which type of Intel processor to choose? Mike Hardware 11 2004-12-21 04:10

All times are UTC. The time now is 13:30.


Fri Jul 7 13:30:41 UTC 2023 up 323 days, 10:59, 0 users, load averages: 1.18, 1.26, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔