mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Knights Landing reservations (https://www.mersenneforum.org/showthread.php?t=20152)

Madpoo 2016-09-04 16:49

[QUOTE=airsquirrels;441532]I believe I read that KNL does not support virtualization instructions.[/QUOTE]

Hmm... that would be troubling...

I did a quick Google and found this:
[URL="http://colfaxresearch.com/knl-avx512/"]http://colfaxresearch.com/knl-avx512/[/URL]

There's a spot on there where it lists the output of "cat /proc/cpuinfo" and "vmx" is listed (that link itself might prove useful since it talks about the 512-bit AVX)

On the other hand though, I went to the official Intel spec page, and you're right, it says no virtualization support. Boo!
[URL="http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core"]http://ark.intel.com/products/94033/Intel-Xeon-Phi-Processor-7210-16GB-1_30-GHz-64-core[/URL]

That's disappointing since it means I wouldn't be able to use any of these for work (virtual hosting). Bummer.

GP2 2016-09-04 18:23

[QUOTE=Madpoo;441542]On the other hand though, I went to the official Intel spec page, and you're right, it says no virtualization support.[/QUOTE]

So it sounds like this chip is specifically for the HPC niche market, and some other processor will be intended for servers. Few of us will actually use one, in home computing farms or cloud computing.

ewmayer 2016-09-04 20:39

[QUOTE=GP2;441546]So it sounds like this chip is specifically for the HPC niche market, and some other processor will be intended for servers. Few of us will actually use one, in home computing farms or cloud computing.[/QUOTE]

Which is why the fact that - unlike previous Intel 'Xeon Phi specials' which proved evolutionary dead ends in this regard - the instruction set will also carry forward to future wider release, including the PC market was an absolute must for justifying raising money for early adoption of such a system.

As the funding page states, this system is aimed at the HPC-for-number-theory code-development folks. There are probably less than a dozen such who regularly post here, but that is the inherent nature of DC projects like GIMPS.

airsquirrels 2016-09-04 20:54

Thanks to an incredibly attentive development team and a very small additional patch, I was able to get the latest version of HJWASM to assemble gwnum.a for linux successfully.

I am still working out some nuances in the rest of the packaging/build process for prime95 on my linux dev box but things do seem to be moving forward. There is now a way to build prime95 with support for AVX-512 instructions.

Madpoo 2016-09-05 20:18

[QUOTE=airsquirrels;441562]Thanks to an incredibly attentive development team and a very small additional patch, I was able to get the latest version of HJWASM to assemble gwnum.a for linux successfully.

I am still working out some nuances in the rest of the packaging/build process for prime95 on my linux dev box but things do seem to be moving forward. There is now a way to build prime95 with support for AVX-512 instructions.[/QUOTE]

If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile:

I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past.

I can generate a list of suitable worktodo lines for you if you'd like.

Note that a custom build won't have the checksum or security code or whatever it is that an official "George" build would have, so it wouldn't be accepted by the server.

Mark Rose 2016-09-05 20:29

[QUOTE=Madpoo;441664]If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile:[/quote]

It causes spikes in the [url=http://www.mersenne.org/primenet/graphs.php]graphs[/url].

[quote]I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past.

I can generate a list of suitable worktodo lines for you if you'd like.[/QUOTE]

They're done up to 2.06M, actually. I usually run a few hundred to test new hardware.

airsquirrels 2016-09-05 20:42

[QUOTE=Madpoo;441664]If you get to a point where you want to test that build to make sure it's spitting out the same results as mprime would (sans any AVX-512 stuff of course...just to make sure it works equally as well) I'd suggest trying it out by doing triple checks of some small exponents... why not. :smile:

I've done triple-checks on all exponents below 2M (that didn't already have 3), and it's not a bad way to compare results of a new build to some tried & true residues from the past.

I can generate a list of suitable worktodo lines for you if you'd like.

Note that a custom build won't have the checksum or security code or whatever it is that an official "George" build would have, so it wouldn't be accepted by the server.[/QUOTE]

That's not a bad thought, easy enough to validate things.

I was aware of the security code issue. I suppose if I was malicious I could just make that work...

I don't expect to submit anything to primenet unless it is promoted up to a working build that has been through all the paces. Right now my goal was just to make things buildable on a KNL system so that I, or George, or whoever could poke at AVX-512.

Prime95 2016-09-06 19:12

preliminary KNL analysis
 
Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!

A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization.

I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.

Mark Rose 2016-09-06 19:59

I believe KL is 6 channel, not 3 channel.

The cores are based off Atom (Silvermont) and are clocked slower, but have 4 hyperthreads each. The chip we're getting probably runs at 1.3 GHz. The CPU we're getting probably only has 64 cores enabled. Edit: Each core will have double the FP.

Apparently the onboard 16 GB can deliver over 400 GB/s versus the 42 GB/s a Skylake with 2 channel DDR4-3200 gets.

If we consider a 4 GHz Skylake, we have roughly 10.4 times the CPU, but we'll have about 10 times the bandwidth.

So it might not be so bad.

Edit: Updated to reflect comments from ldesnogu

ldesnogu 2016-09-06 20:10

Skylake can do 16 DP FLOPs/cycle. So 4 cores at 4 GHz will give 256 GFLOPs/s.

KNL is 32 DP FLOPs/cycle. So 64 cores at 1.3 GHz will give 2662 GFLOPs/s. That's more than 10 times a 4-core Skylake, not 5.

henryzz 2016-09-06 20:20

[QUOTE=Prime95;441759]Intel says the 16GB HBM memory has 4x the bandwidth of the 3-channel DDR4 ram. FFT data will easily fit in 16GB, so the good news is we should be running out of HBM memory at all times. Compare KNL to a 4-core Skylake with 2-channel DDR4 ram, the KNL system will have 4x (HBM vs DDR4) times 1.5x (3-channel vs 2-channel), or 6x the memory bandwidth. Unfortunately, we have 6x memory bandwidth feeding 16x number of cores!

A Skylake system is hurting on memory bandwidth, the KNL is going to be downright starving. We're looking at roughly 33% FPU utilization.

I do not have any good ideas on reducing memory bandwidth requirements any further. The only option may be to run 64 cores of TF hyperthreaded alongside a 64 core FFT.[/QUOTE]

16x cores which will be a fair bit slower per core at least before avx512. Is there any point in developing avx512 code while the current systems are so memory bound? Can we actually expect any improvement?

Didn't you mention a while back that it may be possible to reduce memory bandwidth by using integer ffts rather than floating point. This may be an inaccurate memory.


All times are UTC. The time now is 06:48.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.