![]() |
|
|
#177 | |
|
∂2ω=0
Sep 2002
República de California
19×613 Posts |
Quote:
1. We have 32 SIMD registers to work with, which should mean somewhat reduced memory traffic, when properly used; 2. There are just 2 points in each FFT-mul where the various processor threads need to share data, i.e. go back to main memory. IIRC KL cores are paired, with each pair sharing a 2MB L2 cache. If we have one of each such pair doing low-bandwidth TF work, each LL-test thread has 2 MB L2. Moreover, these multiple L2s can communicate directly with each other, or via main memory - from the Colfax "Clustering Modes in Knights Landing Processors" whitepaper: In KNL (see Figure 1, bottom), each of its ≤ 72 cores has an L1 cache, pairs of cores are organized into tiles with a slice of the L2 cache symmetrically shared between the two cores, and the L2 caches are connected to each other with a mesh. All caches are kept coherent by the mesh with the MESIF protocol (this is an acronym for Modified/Exclusive/Shared/Invalid/Forward states of cache lines). In the mesh, each vertical and horizontal link is a bidirectional ring. The whitepaper alas does not reveal the L2-to-CPU or L2-mesh bandwidths. |
|
|
|
|
|
|
#178 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19×397 Posts |
Quote:
So, we have somewhere in the neighborhood of 6x the memory bandwidth and 8x the FPU firepower. Not good, but not as terrible as I estimated earlier. Edit: According to http://www.hardwareunboxed.com/forum...pic.php?t=1570 and http://www.asrock.com/news/index.asp?id=3043 my Skylakes are getting just shy of 30 GB/s. Intel says HBM is 400 GB/s, so this my 6x bandwidth estimate was off by a factor of 2 as well! Sorry, for the false alarm. Those back-of-the-envelope calculations can be dangerous. If we can keep HBM memory fully busy, KNL may be a winner! Last fiddled with by Prime95 on 2016-09-06 at 22:31 |
|
|
|
|
|
|
#179 |
|
"David"
Jul 2015
Ohio
11×47 Posts |
That's why having actual hardware is going to be good. As complex as newer hardware is where calculating exact performance of the whole pipeline isn't even possible and done by statistical analysis, we have to leave our arm chairs to actually know.
Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game... |
|
|
|
|
|
#180 | |
|
Serpentine Vermin Jar
Jul 2014
1100111100012 Posts |
Quote:
But that only works for me. LOL I guess I could tell it to exclude any LL tests from anyone below XX size... In the case of any custom builds to test things out, the server wouldn't accept those results anyway. |
|
|
|
|
|
|
#181 | |
|
Sep 2016
5168 Posts |
Quote:
I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity and roots-of-two. But I'm saying that without any expertise in the field. |
|
|
|
|
|
|
#182 | |
|
∂2ω=0
Sep 2002
República de California
19×613 Posts |
Quote:
|
|
|
|
|
|
|
#183 | |
|
Jun 2003
32·5·113 Posts |
Quote:
|
|
|
|
|
|
|
#184 | ||
|
Sep 2016
2×167 Posts |
Quote:
For an double-precision FFT using 16-bits/point the memory efficiency is 0.25. And for library writers who prefer not to rely on destructive cancellation, we're talking more like only 8-bits/point if we want it to work at 1 billion+ bits. That's 0.125. At the other extreme, the Schönhage–Strassen NTT gets you asymptotically close to 0.50. The multi-prime algorithms will get you somewhere between that. 9 primes will get you ~0.44 which is still a lot better than even than FFT with destructive cancellation. Quote:
It doesn't look like a win at first. But it might be worth investigating. I'm sure there are corners that can be cut when doing an optimized butterfly with double-double arithmetic. Last fiddled with by Mysticial on 2016-09-07 at 03:33 |
||
|
|
|
|
|
#185 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
19×397 Posts |
As ldesnogu pointed out, I forgot one other factor of 2, this time not in our favor. A KNL core needs twice the bandwidth as a Skylake core because AVX-512 is twice as wide as AVX-256.
So summarizing KNL vs. my Skylake system: 400 GB/s vs. 30GB/s or 13.33x more bandwidth in KNL 1.3 GHz vs 2.5GHz or Skylake will need 1.92x more bandwidth 64 cores vs 4 cores or KNL will need 16x more bandwidth AVX-512 vs. AVX-256 or KNL will need 2x the bandwidth Net result is KNL should be a little more memory bound than a typical 4-core Skylake. |
|
|
|
|
|
#186 |
|
Sep 2016
1316 Posts |
I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.
I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading. |
|
|
|
|
|
#187 | |
|
Serpentine Vermin Jar
Jul 2014
3,313 Posts |
Quote:
It's also an interesting opportunity to get the current codebases tuned better for multi-threading. There are certain challenges involved for sure. At the very least, if a system like this could run 64 (or 128) simultaneous, single-core workers, using AVX-512, and the fast memory can keep all the pipes flowing, then it should be able to provide amazing throughput at a smaller price-point than a similarly kitted Broadwell. Dual Broadwell systems aren't cheap, and they don't have AVX-512, 6-channel memory, HBM... they do have faster clock speeds though. But even then, when you're looking at the top 22-core Broadwell, it's only running at 2.2 GHz with a turbo boost to 2.8 or something. So yeah, twice as fast as the 7210P, 44 cores on a dual CPU setup, but maybe 2-3 times the price.
Last fiddled with by Madpoo on 2016-09-15 at 16:42 |
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Reservations | ET_ | Operazione Doppi Mersennes | 495 | 2020-12-19 19:41 |
| Reservations | kar_bon | Riesel Prime Data Collecting (k*2^n-1) | 129 | 2016-09-05 09:23 |
| Reservations? | R.D. Silverman | NFS@Home | 15 | 2015-11-29 23:18 |
| Intel Xeon Phi - Knights Corner | BotXXX | Hardware | 16 | 2012-06-21 23:54 |
| 4-5M Reservations | paulunderwood | 3*2^n-1 Search | 15 | 2008-06-08 03:29 |