mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-06, 21:42   #177
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!

A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization.

I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.
You may well be right, but I prefer to remain optimistic until cruel reality smacks me upside the head. :) Two added points to consider:

1. We have 32 SIMD registers to work with, which should mean somewhat reduced memory traffic, when properly used;

2. There are just 2 points in each FFT-mul where the various processor threads need to share data, i.e. go back to main memory. IIRC KL cores are paired, with each pair sharing a 2MB L2 cache. If we have one of each such pair doing low-bandwidth TF work, each LL-test thread has 2 MB L2. Moreover, these multiple L2s can communicate directly with each other, or via main memory - from the Colfax "Clustering Modes in Knights Landing Processors" whitepaper:

In KNL (see Figure 1, bottom), each of its ≤ 72 cores has an L1 cache, pairs of cores are organized into tiles with a slice of the L2 cache symmetrically shared between the two cores, and the L2 caches are connected to each other with a mesh. All caches are kept coherent by the mesh with the MESIF protocol (this is an acronym for Modified/Exclusive/Shared/Invalid/Forward states of cache lines). In the mesh, each vertical and horizontal link is a bidirectional ring.

The whitepaper alas does not reveal the L2-to-CPU or L2-mesh bandwidths.
ewmayer is offline   Reply With Quote
Old 2016-09-06, 22:20   #178
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
The chip we're getting probably runs at 1.3 GHz. The CPU we're getting probably only has 64 cores enabled.
Ah, you are correct. I did not factor that in. So we have 16x the cores but running at half speed (compared to my standard issue i5-6500s).

So, we have somewhere in the neighborhood of 6x the memory bandwidth and 8x the FPU firepower. Not good, but not as terrible as I estimated earlier.

Edit: According to http://www.hardwareunboxed.com/forum...pic.php?t=1570 and http://www.asrock.com/news/index.asp?id=3043 my Skylakes are getting just shy of 30 GB/s. Intel says HBM is 400 GB/s, so this my 6x bandwidth estimate was off by a factor of 2 as well!

Sorry, for the false alarm. Those back-of-the-envelope calculations can be dangerous.

If we can keep HBM memory fully busy, KNL may be a winner!

Last fiddled with by Prime95 on 2016-09-06 at 22:31
Prime95 is offline   Reply With Quote
Old 2016-09-06, 23:24   #179
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

That's why having actual hardware is going to be good. As complex as newer hardware is where calculating exact performance of the whole pipeline isn't even possible and done by statistical analysis, we have to leave our arm chairs to actually know.

Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game...
airsquirrels is offline   Reply With Quote
Old 2016-09-07, 01:21   #180
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1100111100012 Posts
Default

Quote:
Originally Posted by Mark Rose View Post
It causes spikes in the graphs.



They're done up to 2.06M, actually. I usually run a few hundred to test new hardware.
I tried to make sure my own (more recent) totally unnecessary triple-checks were excluded from the graphs of throughput. The SQL query itself has a "where (user != madpoo or exponent > 3e6)" type of thing (little more complicated than that, but you get the idea).

But that only works for me. LOL I guess I could tell it to exclude any LL tests from anyone below XX size...

In the case of any custom builds to test things out, the server wouldn't accept those results anyway.
Madpoo is offline   Reply With Quote
Old 2016-09-07, 02:29   #181
Mysticial
 
Mysticial's Avatar
 
Sep 2016

5168 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Or someone can unearth a new FFT/multiplication algorithm with drastically less required memory and change the whole game...
That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.

I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity and roots-of-two. But I'm saying that without any expertise in the field.
Mysticial is offline   Reply With Quote
Old 2016-09-07, 03:01   #182
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by Mysticial View Post
That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.

I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity and roots-of-two. But I'm saying that without any expertise in the field.
See here for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid.
ewmayer is offline   Reply With Quote
Old 2016-09-07, 03:05   #183
axn
 
axn's Avatar
 
Jun 2003

32·5·113 Posts
Default

Quote:
Originally Posted by Mysticial View Post
That would be the NTTs. They'll save you a factor of 2-3x for memory and bandwidth. But they're also at least 3x slower at best.

I'm also unsure how well the IBDWT can be used with the NTT. My gut feeling tells me it might be difficult to find a modulus that has both suitably deep roots-of-unity and roots-of-two. But I'm saying that without any expertise in the field.
Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-(
axn is offline   Reply With Quote
Old 2016-09-07, 03:28   #184
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×167 Posts
Default

Quote:
Originally Posted by ewmayer View Post
See here for an example of such a modulus, in this case a complex (Gaussian-integer) one. Since x86 SIMD only supports 32x32 --> 64-bit integer multiply, M31 rather than M61 would be a more promising modulus in that context. A hybrid float64/int32 such transform would add ~31/2 = 15.5 bits to the allowable per-digit input size, which represents slightly less than a doubling of that (i.e. halving of the transform length). But each transform 'word' is now 1.5x larger (96 bits versus 64), so the overall bandwidth reduction is maybe ~20%. No free lunch in view here, I'm afraid.
That is an interesting idea. Doing both NTT+FFT and using the NTT to reconstruct the bottom (lost) parts of the coefficients. I was thinking more on the lines of going 100% NTT. The memory reduction should be a lot more than just 20%.

For an double-precision FFT using 16-bits/point the memory efficiency is 0.25. And for library writers who prefer not to rely on destructive cancellation, we're talking more like only 8-bits/point if we want it to work at 1 billion+ bits. That's 0.125. At the other extreme, the Schönhage–Strassen NTT gets you asymptotically close to 0.50. The multi-prime algorithms will get you somewhere between that. 9 primes will get you ~0.44 which is still a lot better than even than FFT with destructive cancellation.


Quote:
Originally Posted by axn View Post
Another way would be to use (software-emulated) quad precision FP. It will improve the compute:memory ratio significantly. But still probably won't be a net win due to the software overhead :-(
That's actually not a terrible idea. If you use double-double arithmetic:
  • Addition is 8 word-sized additions.
  • Multiplication is 1 multiplication and 3 FMAs.
Double-double is 107 bits. That's probably large enough to place 40+ bits per point. IOW more than 2x over simple double-precision. The cost is somewhere between 4 - 8x for each operation. So computationally, you're going up by around a factor of 3-4x over the standard DP implementation for a 30 - 50%? reduction in bandwidth?

It doesn't look like a win at first. But it might be worth investigating. I'm sure there are corners that can be cut when doing an optimized butterfly with double-double arithmetic.

Last fiddled with by Mysticial on 2016-09-07 at 03:33
Mysticial is offline   Reply With Quote
Old 2016-09-07, 18:57   #185
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Those back-of-the-envelope calculations can be dangerous
As ldesnogu pointed out, I forgot one other factor of 2, this time not in our favor. A KNL core needs twice the bandwidth as a Skylake core because AVX-512 is twice as wide as AVX-256.

So summarizing KNL vs. my Skylake system:

400 GB/s vs. 30GB/s or 13.33x more bandwidth in KNL
1.3 GHz vs 2.5GHz or Skylake will need 1.92x more bandwidth
64 cores vs 4 cores or KNL will need 16x more bandwidth
AVX-512 vs. AVX-256 or KNL will need 2x the bandwidth

Net result is KNL should be a little more memory bound than a typical 4-core Skylake.
Prime95 is offline   Reply With Quote
Old 2016-09-15, 05:55   #186
xathor
 
Sep 2016

1316 Posts
Default

I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.

I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading.
xathor is offline   Reply With Quote
Old 2016-09-15, 16:41   #187
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

3,313 Posts
Default

Quote:
Originally Posted by xathor View Post
I have a Colfax KNL development system sitting idle in my office. I had originally planned to purchase several hundred KNL nodes but have abandoned that for Broadwell after doing extensive testing.

I'll be more than happy to run some benchmarks for you guys, but I'm afraid you'll find out it behaves more like a 128 core machine running at half the clock speed due to the hardware forcing in order threading.
One of the key advantages in KNL though is the AVX-512 support, and I'm pretty sure the developers are interested to get their hands dirty with that.

It's also an interesting opportunity to get the current codebases tuned better for multi-threading. There are certain challenges involved for sure.

At the very least, if a system like this could run 64 (or 128) simultaneous, single-core workers, using AVX-512, and the fast memory can keep all the pipes flowing, then it should be able to provide amazing throughput at a smaller price-point than a similarly kitted Broadwell. Dual Broadwell systems aren't cheap, and they don't have AVX-512, 6-channel memory, HBM... they do have faster clock speeds though. But even then, when you're looking at the top 22-core Broadwell, it's only running at 2.2 GHz with a turbo boost to 2.8 or something. So yeah, twice as fast as the 7210P, 44 cores on a dual CPU setup, but maybe 2-3 times the price.

Last fiddled with by Madpoo on 2016-09-15 at 16:42
Madpoo is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Reservations ET_ Operazione Doppi Mersennes 495 2020-12-19 19:41
Reservations kar_bon Riesel Prime Data Collecting (k*2^n-1) 129 2016-09-05 09:23
Reservations? R.D. Silverman NFS@Home 15 2015-11-29 23:18
Intel Xeon Phi - Knights Corner BotXXX Hardware 16 2012-06-21 23:54
4-5M Reservations paulunderwood 3*2^n-1 Search 15 2008-06-08 03:29

All times are UTC. The time now is 06:48.


Fri Aug 6 06:48:26 UTC 2021 up 14 days, 1:17, 1 user, load averages: 3.08, 2.84, 2.77

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.