![]() |
[QUOTE=Prime95;441759]Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!
A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization. I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.[/QUOTE] You may well be right, but I prefer to remain optimistic until cruel reality smacks me upside the head. :) Two added points to consider: 1. We have 32 SIMD registers to work with, which should mean somewhat reduced memory traffic, when properly used; 2. There are just 2 points in each FFT-mul where the various processor threads need to share data, i.e. go back to main memory. IIRC KL cores are paired, with each pair sharing a 2MB L2 cache. If we have one of each such pair doing low-bandwidth TF work, each LL-test thread has 2 MB L2. Moreover, these multiple L2s can communicate directly with each other, or via main memory - from the Colfax "Clustering Modes in Knights Landing Processors" whitepaper: [i] In KNL (see Figure 1, bottom), each of its ≤ 72 cores has an L1 cache, pairs of cores are organized into tiles with a slice of the L2 cache symmetrically shared between the two cores, and the L2 caches are connected to each other with a mesh. All caches are kept coherent by the mesh with the MESIF protocol (this is an acronym for Modified/Exclusive/Shared/Invalid/Forward states of cache lines). In the mesh, each vertical and horizontal link is a bidirectional ring. [/i] The whitepaper alas does not reveal the L2-to-CPU or L2-mesh bandwidths. |
[QUOTE=Mark Rose;441764]The chip we're getting probably runs at 1.3 GHz. The CPU we're getting probably only has 64 cores enabled.[/QUOTE]
Ah, you are correct. I did not factor that in. So we have 16x the cores but running at half speed (compared to my standard issue i5-6500s). So, we have somewhere in the neighborhood of 6x the memory bandwidth and 8x the FPU firepower. Not good, but not as terrible as I estimated earlier. Edit: According to [url]http://www.hardwareunboxed.com/forum/viewtopic.php?t=1570[/url] and [url]http://www.asrock.com/news/index.asp?id=3043[/url] my Skylakes are getting just shy of 30 GB/s. Intel says HBM is 400 GB/s, so this my 6x bandwidth estimate was off by a factor of 2 as well! Sorry, for the false alarm. Those back-of-the-envelope calculations can be dangerous. If we can keep HBM memory fully busy, KNL may be a winner! |
That's why having actual hardware is going to be good. As complex as newer hardware is where calculating exact performance of the whole pipeline isn't even possible and done by statistical analysis, we have to leave our arm chairs to actually know.
Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game... |
[QUOTE=Mark Rose;441666]It causes spikes in the [url=http://www.mersenne.org/primenet/graphs.php]graphs[/url].
They're done up to 2.06M, actually. I usually run a few hundred to test new hardware.[/QUOTE] I tried to make sure my own (more recent) totally unnecessary triple-checks were excluded from the graphs of throughput. The SQL query itself has a "where (user != madpoo or exponent > 3e6)" type of thing (little more complicated than that, but you get the idea). But that only works for me. LOL I guess I could tell it to exclude any LL tests from anyone below XX size... In the case of any custom builds to test things out, the server wouldn't accept those results anyway. |
[QUOTE=airsquirrels;441780]Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game...[/QUOTE]
That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best. I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field. |
[QUOTE=Mysticial;441792]That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.
I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field.[/QUOTE] [url=http://www.mersenneforum.org/showthread.php?t=118]See here[/url] for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid. |
[QUOTE=Mysticial;441792]That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.
I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity [I]and [/I]roots-of-two. But I'm saying that without any expertise in the field.[/QUOTE] Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-( |
[QUOTE=ewmayer;441795][URL="http://www.mersenneforum.org/showthread.php?t=118"]See here[/URL] for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid.[/QUOTE]
That is an interesting idea. Doing both NTT+FFT and using the NTT to reconstruct the bottom (lost) parts of the coefficients. I was thinking more on the lines of going 100% NTT. The memory reduction should be a lot more than just 20%. For an double-precision FFT using 16-bits/point the memory efficiency is 0.25. And for library writers who prefer not to rely on destructive cancellation, we're talking more like only 8-bits/point if we want it to work at 1 billion+ bits. That's 0.125. At the other extreme, the Schönhage–Strassen NTT gets you asymptotically close to 0.50. The multi-prime algorithms will get you somewhere between that. 9 primes will get you ~0.44 which is still a lot better than even than FFT with destructive cancellation. [QUOTE=axn;441796]Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-( [/QUOTE]That's actually not a terrible idea. If you use double-double arithmetic: [LIST][*]Addition is 8 word-sized additions.[*]Multiplication is 1 multiplication and 3 FMAs.[/LIST]Double-double is 107 bits. That's probably large enough to place 40+ bits per point. IOW more than 2x over simple double-precision. The cost is somewhere between 4 - 8x for each operation. So computationally, you're going up by around a factor of 3-4x over the standard DP implementation for a 30 - 50%? reduction in bandwidth? It doesn't look like a win at first. But it might be worth investigating. I'm sure there are corners that can be cut when doing an optimized butterfly with double-double arithmetic. |
[QUOTE=Prime95;441777]Those back-of-the-envelope calculations can be dangerous[/QUOTE]
As ldesnogu pointed out, I forgot one other factor of 2, this time not in our favor. A KNL core needs twice the bandwidth as a Skylake core because AVX-512 is twice as wide as AVX-256. So summarizing KNL vs. my Skylake system: 400 GB/s vs. 30GB/s or 13.33x more bandwidth in KNL 1.3 GHz vs 2.5GHz or Skylake will need 1.92x more bandwidth 64 cores vs 4 cores or KNL will need 16x more bandwidth AVX-512 vs. AVX-256 or KNL will need 2x the bandwidth Net result is KNL should be a little more memory bound than a typical 4-core Skylake. |
I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.
I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading. |
[QUOTE=xathor;442606]I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.
I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading.[/QUOTE] One of the key advantages in KNL though is the AVX-512 support, and I'm pretty sure the developers are interested to get their hands dirty with that. It's also an interesting opportunity to get the current codebases tuned better for multi-threading. There are certain challenges involved for sure. At the very least, if a system like this could run 64 (or 128) simultaneous, single-core workers, using AVX-512, and the fast memory can keep all the pipes flowing, then it should be able to provide amazing throughput at a smaller price-point than a similarly kitted Broadwell. Dual Broadwell systems aren't cheap, and they don't have AVX-512, 6-channel memory, HBM... they do have faster clock speeds though. :smile: But even then, when you're looking at the top 22-core Broadwell, it's only running at 2.2 GHz with a turbo boost to 2.8 or something. So yeah, twice as fast as the 7210P, 44 cores on a dual CPU setup, but maybe 2-3 times the price. |
| All times are UTC. The time now is 06:48. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.