mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2017-05-29, 18:57   #1
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

13×227 Posts
Default Intel Processor Speculations

It seems Intel is going up to 18 cores on HEDT.

http://wccftech.com/intel-core-x-sky...ore-36-thread/

Still only 4 memory channels though, so hopefully it lowers the prices of the lower core count chips.
Mark Rose is offline   Reply With Quote
Old 2017-05-29, 22:52   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

34×5×29 Posts
Default

Those sound like real monsters - the 18-core flagship i9 roughly the same as the soon-to-arrive Xeon Skylake server chips (which will power AWS' new C5 instances, among other things), but presumably at a much lower price point than a server CPU. (Albeit still very high relative to run-of-the-mill Skylake quads and such.)

Let's do an Intel vs AMD head-to-head-compare based on what both companies have revealed to date:

o AVX-512 (in the i9s) vs AVX-256;
o True AVX-512 support vs each AVX-256 instruction broken into two 128-bit uops;
o Similar core counts and clock speeds.

But I worry about the i9 memory subsystems' ability to keep those data-hungry vector units fed, even with those yuuge L2/3 caches. And the second bullet point above actually plays out less detrimentally for AMD than one might think, because breaking a wide vector-op into two half-width uops helps hide latency: E.g. say I have 8 independent AVX-256 vector MULs I need to do, assume 2 can start per cycle with a 5-cycle latency. Intel: 2 MULs start on each of clocks 0-3, but then we idle until cycle 5 waiting for the first results to become available. AMD: 2 half-width MULs start on each of clocks 0-7, and ensuing instructions can start using the early-issued-MUL results before the late-issued ones have even begun. I'm seeing this play out in my Mlucas runs on Ryzen, where I get better than 50% the per-cycle throughput as on my Haswell, i.e. better total throughput for the Ryzen 8-core than for the Intel quad.

It'll be interesting to do similar head-to-head compares - not just of total throughput but also of FLOPS-per-watt-and-hardware-dollar - once both vendors' new CPUs hit market, that's for sure.
ewmayer is offline   Reply With Quote
Old 2017-05-30, 03:25   #3
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2×181 Posts
Default

Not to spoil anything, but there are conflicting rumors that the HEDT Skylake X processors will not have true AVX512 but rather double-cycled 256-bit execution units.

Of the leaked benchmarks that I've seen so far:
  • A 6-core Skylake X benchmarked alongside a Skylake Platinum 28-core Xeon with an AVX512 enabled benchmark. If you run the numbers, the desktop Skylake only has half the per-cycle throughput as the Platinum Xeon.
  • A 10-core i9 7900X showing what appears to be full throughput AVX512.

It's already known that not all the server Xeons will have full throughput AVX512. The question is which (if any) of the HEDT Skylakes will have it.

If we assume that the AVX512 units take up a significant amount die area as well a lot of TDP, it makes sense for Intel to selectively disable them to improve yields. The resulting market segmentation probably plays in their favor if they want to milk people for more money to get the full AVX512.
Mysticial is offline   Reply With Quote
Old 2017-05-30, 09:03   #4
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

44610 Posts
Default

http://www.anandtech.com/show/11464/...umers-for-1999

Interesting times, I was reading up at the link above but not fully digested yet.

Anyone care to discuss what the new cache arrangement might mean for performance? 1MB/core L2 and 1.375MB/core non-inclusive L3 is quite a change.
mackerel is offline   Reply With Quote
Old 2017-05-30, 10:38   #5
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·223 Posts
Default

Ian Cutress from Anandtech has confirmed with Intel that each core will have an AVX512 unit.
mackerel is offline   Reply With Quote
Old 2017-05-30, 22:37   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267418 Posts
Default

Quote:
Originally Posted by mackerel View Post
Anyone care to discuss what the new cache arrangement might mean for performance? 1MB/core L2 and 1.375MB/core non-inclusive L3 is quite a change.
The very large L2/3 caches raise the interesting possibility that a single multithreaded job which "can live in the cache" might get better throughput than the current paradigm of one-job-per-thread.

Quote:
Originally Posted by mackerel View Post
Ian Cutress from Anandtech has confirmed with Intel that each core will have an AVX512 unit.
If what Mysticial notes above is correct, the crucial question is whether the AVX-512 support is genuine or emulated (via pairs of 256-bit uops). However as I noted based on my AMD Ryzen timings, even the latter might be appreciably faster than the current AVX-256 due to the latency-hiding effects of the emulation. IOW we may end up with the following 3-tier Intel processor landscape, in descending order of per-cycle throughput, with AMD offerings forming the fourth tier:

o genuine AVX-512 [high-end i9]
o emulated AVX-512 [low-end i9]
o genuine AVX-256 [old and new i7]
o emulated AVX-256 [AMD]

Hopefully it won't be too long before we have actual i9 hardware to play on.
ewmayer is offline   Reply With Quote
Old 2017-05-30, 23:10   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

1011010102 Posts
Default

Quote:
Originally Posted by ewmayer View Post
The very large L2/3 caches raise the interesting possibility that a single multithreaded job which "can live in the cache" might get better throughput than the current paradigm of one-job-per-thread.



If what Mysticial notes above is correct, the crucial question is whether the AVX-512 support is genuine or emulated (via pairs of 256-bit uops). However as I noted based on my AMD Ryzen timings, even the latter might be appreciably faster than the current AVX-256 due to the latency-hiding effects of the emulation. IOW we may end up with the following 3-tier Intel processor landscape, in descending order of per-cycle throughput, with AMD offerings forming the fourth tier:

o genuine AVX-512 [high-end i9]
o emulated AVX-512 [low-end i9]
o genuine AVX-256 [old and new i7]
o emulated AVX-256 [AMD]

Hopefully it won't be too long before we have actual i9 hardware to play on.
FWIW, I'm more worried about the bandwidth problem.

I have benchmarks from a 40-core Skylake Gold system. (Which I can't really disclose since the source doesn't even know if he's under NDA)
Based on the small-data scaling, I'm about 90% sure that model has the full-throughput AVX512. However, the AVX2 -> AVX512 scaling for large-data is so hilariously bad that it makes Knights Landing look good.

Part of the problem is likely due to the NUMA since the source said he has no access to the BIOS to enable node-interleaving nor did he mention anything about "numactl --interleave=all".

Last fiddled with by Mysticial on 2017-05-30 at 23:10
Mysticial is offline   Reply With Quote
Old 2017-06-19, 15:38   #8
Mysticial
 
Mysticial's Avatar
 
Sep 2016

2·181 Posts
Default

NDAs lift today. According to this: http://www.anandtech.com/show/11550/...7800x-tested/3

Quote:
Nominally the FMAs on ports 0 and 1 are 256-bit, so in order to drive towards the AVX-512-F these two ports are fused together, similar to how AVX-512-F is implemented in Knights Landing. The six-core and eight-core Skylake-X parts support one fused FMA for AVX-512-F, although the 10-core will support dual 512-bit AVX-512-F ports, which seems to be located on port 5. This means that the 10-core i9-7900X can support 64 SP or 32 DP calculations per cycle, whereas the 8-core/6-core parts can support 32 SP or 16 DP per cycle.
The first 512-bit FMA unit is done by combining the port0 and port1 256-bit pipes. The 2nd 512-bit FMA is added onto the port5 shuffle pipe and is fused off in the lower SKUs.

This raises a bunch of questions:
  • Dual-issue 512-bit FMA will be using 3 ports? Seems a bit asymmetric.
  • When a 512-bit FMA goes into port0/port1, is it 1 or 2 uops? If only 1 uop, does it completely block the other port from doing anything else? Or does it only block it from doing FPU?
  • It's not possible to dual-issue 512-bit FMA and a shuffle.
  • How wide is the port5 shuffle unit? Does it have 1-cycle throughput shuffle? And is it the same on the half-throughput AVX512 vs. the full-throughput AVX512?
  • Is it not possible to 3-issue 512-bit integer SIMD? (You can 3-issue 256-bit integer SIMD on Skylake desktop.)

Agner Fog is gonna have some fun with these.
Mysticial is offline   Reply With Quote
Old 2017-06-19, 20:55   #9
Mark Rose
 
Mark Rose's Avatar
 
"/X\(‘-‘)/X\"
Jan 2013

B8716 Posts
Default

Also interesting:

Quote:
The new mesh topology for the Skylake-SP core was perhaps more of a requirement for consistency than an option over the older ring bus system, which starts to outgrow its usefulness as more cores are added. Intel has already had success with mesh architectures with the Xeon Phi chips, so this isn’t entirely new, but essentially makes the chip a big 2D-node array for driving data around the core. As with the ring bus, core-to-core latency will vary based on the locality of the cores, and those nearest the DRAM controllers will get the best benefit for memory accesses. As Intel grows its core-count, it will be interesting to see how the mesh scales.
So some cores could be faster for Prime95.
Mark Rose is offline   Reply With Quote
Old 2017-06-19, 21:15   #10
Mysticial
 
Mysticial's Avatar
 
Sep 2016

36210 Posts
Default

Originally, I had assumed that all of the LCC Skylake X chips would have only half-throughput AVX512.

So my plan was to get the 8-core one for development and do correctness testing on all the AVX512 code that I've accumulated since 2013. Then come October, trade it up for the 16 or 18-core one for proper performance tuning. (especially around the anticipated memory bottleneck)

Since the full-throughput chip is coming out now, I'll get that so I can start early. But I'm not sure if I still want to double-dip on another high-end chip in just 4 months from now.
Mysticial is offline   Reply With Quote
Old 2017-06-20, 03:02   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2DE116 Posts
Default

Quote:
Originally Posted by Mysticial View Post
NDAs lift today. According to this: http://www.anandtech.com/show/11550/...7800x-tested/3

The first 512-bit FMA unit is done by combining the port0 and port1 256-bit pipes. The 2nd 512-bit FMA is added onto the port5 shuffle pipe and is fused off in the lower SKUs.
Thanks for the link! Any hints as to ship date for these CPUs?

The article only mentioned the two-half-width-uops implementation for 512-bit FMA ... that surely also includes pure-FMUL, but are they also lumping FADD in with that? If vector add were able to execute 2-per-cycle at full 512-bit width that would give a nice boost to FFT arithmetic, which is add-dominated.

Last fiddled with by ewmayer on 2017-06-20 at 03:03
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
The Secret CPU Inside Your Intel Processor ewmayer Tales From the Crypt(o) 21 2017-11-23 03:02
64 bit intel processor? Unregistered Hardware 2 2006-08-30 22:21
Intel Core Duo processor drew Hardware 5 2006-05-29 07:00
Intel processor lineup Peter Nelson Hardware 12 2005-07-04 20:42
Which type of Intel processor to choose? Mike Hardware 11 2004-12-21 04:10

All times are UTC. The time now is 05:39.


Fri Oct 7 05:39:49 UTC 2022 up 50 days, 3:08, 0 users, load averages: 1.51, 1.20, 1.10

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔