mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-10-05, 20:06   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·331 Posts
Default Zen 3 speculation

Zen 3 is the 7nm+ generation of AMD chiplets that follows the current 7nm Zen 2 generation. It's early to start discussing Zen 3 as not even all Zen 2 SKU's have been released yet but I'm going to be compiling data about Zen 3 anyway so it might as well be public. Anything written as absolute is confirmed (I'll try my best anyway, hopefully with references), the rest is speculation based on logic and a big dollop of wishful thinking.

  • Late 2020 release
  • Will be used in Epyc 3 (Milan) on the SP3 socket and Ryzen 4000 on the AM4 socket
  • Probably used in Threadripper 4 on the TR4 socket
  • The 7nm+ process has reportedly 20% more density and 10% power reduction relative to 7nm: https://en.wikichip.org/wiki/amd/mic...ges_from_Zen_2
  • L3 cache is no longer split into 2x16MB chunks per chiplet, instead all 8 cores have equal access to the 32MB+ of L3 cache on the chiplet: https://www.overclock3d.net/news/cpu..._genoa_plans/1
  • Note 32MB+ not 32MB, there's potential for there to be more than 32MB of L3 cache per chiplet. L3 cache may be the main benefactor from higher density on 7nm+ as I believe it currently takes up over half the die space on a Zen 2 chiplet
  • I'm not sure how useful a unified cache is for us unless CCX topology changes. In theory it should passively increase throughput of one-worker-per-die workloads compared to Zen 2. It may also decrease throughput of one-worker-per-CCX workloads if contention is an issue, but it could just as easily increase it by smoothing out access between the workers. For general use lightly threaded workloads may greatly benefit from (at least) a doubling of cache
  • There's potential (but it's unlikely IMO) to move to a single 8 core CCX per chiplet instead of two quad core CCX's per chiplet. It would involve using a mesh, ringbus or some other non-direct topology within the CCX which adds some complexity but would optimise for up to 8 core workloads and could still be used in a modular way
  • There's potential for a subset of AVX512 to be implemented, maybe in a double-cycle way (like Zen/+ did for AVX2) for the extended core instructions so that AVX512 is supported on paper allowing for optional AVX512 subsets to also be implemented

Last fiddled with by M344587487 on 2019-10-05 at 20:16 Reason: Conference video removed from youtube
M344587487 is online now   Reply With Quote
Old 2019-10-05, 23:48   #2
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

23·41 Posts
Default

AMD confirmed in a recent presentation that the CCX will indeed be 8 cores as well, bringing the whole chiplet together.

https://wccftech.com/amd-zen-3-epyc-...-cpu-detailed/
nomead is offline   Reply With Quote
Old 2019-10-06, 09:11   #3
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

10100101102 Posts
Default

That's the presentation that I and all the news sites got their information, there's no confirmation that a CCX will be 8 cores. I watched it before it was removed from youtube and it was presented in a confusing manner (with slides detailing different generations intermingled), if anyone reports 8 core CCX confirmed they are mistaken and most (all?) are reporting it as speculation.


Just the cache has been confirmed as unified. This has benefits on its own:
  • Low-threaded workloads on an under-utilised processor (like most games and consumer applications) naively have double the cache to play with
  • Workloads spanning 2x quad core CCX's may get a passive boost by sometimes not having to use IF
  • In situations where there's one or more workloads per CCX with unbalanced cache utilisation, a unified cache can be better utilised
M344587487 is online now   Reply With Quote
Old 2019-10-06, 10:49   #4
nomead
 
nomead's Avatar
 
"Sam Laur"
Dec 2018
Turku, Finland

32810 Posts
Default

This slide has it...

edit: now that I think about it, perhaps not, it just says that the L3 cache is unified, but would that make sense without making the CCX 8 cores as well?
Attached Thumbnails
Click image for larger version

Name:	AMD-EPYC-Milan-Zen-3-Server-CPU-1030x579.png
Views:	55
Size:	343.1 KB
ID:	21088  

Last fiddled with by nomead on 2019-10-06 at 10:50
nomead is offline   Reply With Quote
Old 2019-10-06, 11:34   #5
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2×331 Posts
Default

For the reasons I outlined above I think so. 8 core CCX's will likely come at some point but I'm skeptical it's going to be so soon. Changing topology seems like a big change and everything points to Zen 3 being a more incremental step. Place your bets now.
M344587487 is online now   Reply With Quote
Old 2019-10-08, 18:04   #6
aurashift
 
Jan 2015

11×23 Posts
Default

For the Epyc 7003 release, they should (had better) get the memory controller off of 14nm. The uniform access to all the memory on the socket is nice, especially for big VMs, but the 150ns access latency is a weakness.


Other than that... I can't think of what the next bottlneck is going to be, besides the obvious cores per ccx.


Edit: Clock speed is a weakness. I'm eying the 7543 32c SKU for the clock speed, and the 64c whatever it is for density/cost.
They announced a 7H12 special SKU that has 64c @ 2.7GHz, but, like the intel AP series, requires 300w+ and water cooling.

Last fiddled with by aurashift on 2019-10-08 at 18:08
aurashift is offline   Reply With Quote
Old 2019-10-09, 12:03   #7
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·3·5·13 Posts
Default

How did I miss this thread until now... my biggest wish item improvement from Zen 2 would be unified L3 cache, and if there is nothing new to get in the way, it would make these CPUs even better.

As always, I don't know how the P95-like code works, but I assume core to core communication isn't required, and it is more a shared data set that is needed. So if that thinking is correct, then a split CCX with unified L3 would still be of great benefit. Right now for maximum throughput, I have to keep tasks on one CCX as there is quite a drop in throughput once you cross CCX.

AVX-512 isn't so exciting if they implement something comparable to one unit Intel CPUs. No throughput benefit to AVX2/FMA. If they do VNNI, there might still be something in it for low precision uses but I don't believe that is useful to us.
mackerel is offline   Reply With Quote
Old 2019-10-09, 17:45   #8
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

32·72·13 Posts
Default

We want them to include AVX-512 as soon as possible as it shows progress towards what we actually want. AMD virtually always implement SIMD extensions that are slow initially but catch up after a generation or two. AVX512 has so many extra bits that there will quite possibly be some useful bits even at the same throughput as FMA3. I believe that the doubling in the number of registers should help if nothing else.
henryzz is offline   Reply With Quote
Old 2019-10-09, 18:47   #9
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

2·3·5·13 Posts
Default

Thinking more, I'm wondering how much benefit there could be from AVX-512 in Zen. Reason for saying this is that I feel they're power and thermal limited already even at lower core counts. We got the expected FP performance increase over previous generations in Zen 2, and some improvement from 7nm, but they still use a lot more energy running this type of code than otherwise.

I think we're at the point where power efficiency is more important than more raw performance.
mackerel is offline   Reply With Quote
Old 2019-10-09, 19:09   #10
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

2·331 Posts
Default

Quote:
Originally Posted by aurashift View Post
For the Epyc 7003 release, they should (had better) get the memory controller off of 14nm. The uniform access to all the memory on the socket is nice, especially for big VMs, but the 150ns access latency is a weakness.
...
I can't see them moving away from using a central IO die, it makes too much sense to separate the cores off and be able to scale the IO die as necessary. They had an on-core-die memory controller with zen/+ which necessitated NUMA and over-complicated userspace, in a perfect world programs would be NUMA-aware and scale nicely but in reality it's too much of a burden for non-specialist programs to implement, unless the market is saturated with NUMA architectures. The current Ryzen 3000 series is using the 12nm node for the IO die presumably so that it can better accomodate higher memory/IF clocks preferred by desktops but AFAIK it doesn't do much for latency. What workloads are severely impacted by latency?

An interesting question is what intel will do. They likely have to go for an MCM approach in the future (at least for servers) to scale as well as AMD have, they'll have a similar choice to make when it comes to memory access. Their 56 core vapourware is in essence two 28C 8180's smushed together with the same NUMA as a two processor server just in one socket (2x6 channel not 1x12 channel, separate cache), it's not representative of their intentions. iGPU's are another element to consider for both companies, we've yet to see how an MCM iGPU performs (unless you count the NUCs, I don't know if you can) and if intel intends to switch to MCM for consumers they'll have that influencing the design too.

Quote:
Originally Posted by aurashift View Post
Other than that... I can't think of what the next bottlneck is going to be, besides the obvious cores per ccx.
Always memory IMO, which adding ever more cache somewhat alleviates (at the expense of a little latency at the L3 level). DDR5 will help in a few years but we're likely talking Zen4 with the SP5 socket change (and PCIe 5). If in doubt add more channels? Depends how long a socket is meant to last I guess but if they get in DDR5 and PCIe 5 early the SP5 socket could last a long time if they future-proof the channel count.

Quote:
Originally Posted by aurashift View Post
...
Edit: Clock speed is a weakness. I'm eying the 7543 32c SKU for the clock speed, and the 64c whatever it is for density/cost.
They announced a 7H12 special SKU that has 64c @ 2.7GHz, but, like the intel AP series, requires 300w+ and water cooling.
Is it a weakness for server parts? The 7542 32C has a base of 2.9GHz with a 225W TDP, the Xeon 8180 28C has 2.5GHz base at 205W TDP. I know there's a lot wrong with comparing like this (TDP's are calculated differently, the Xeon has AVX512, actual all-core depends on workload and is somewhere between base and boost), but naively it looks like if anything AVX512 is the only thing in intel's favour.

Quote:
Originally Posted by mackerel View Post
Thinking more, I'm wondering how much benefit there could be from AVX-512 in Zen. Reason for saying this is that I feel they're power and thermal limited already even at lower core counts. We got the expected FP performance increase over previous generations in Zen 2, and some improvement from 7nm, but they still use a lot more energy running this type of code than otherwise.

I think we're at the point where power efficiency is more important than more raw performance.
You have a point. AVX512 was a way (IMO) for intel to improve performance without necessarily improving core count or core frequency (in lieu of competition in those areas for a decade). AVX512 and its functional benefits is out of my wheelhouse but I'm led to believe that the extra bits are at least as interesting than the doubling of width. AMD has a history of implementing an AVX instruction set only once it's been established, and as a half-measure initially. Maybe a half-measure works as the final form, to get access to the extra bits and for the slight instruction cache benefit (one 512 bit instruction vs two 256 bit instructions) if that even matters.
M344587487 is online now   Reply With Quote
Old 2019-10-09, 21:45   #11
Mysticial
 
Mysticial's Avatar
 
Sep 2016

32910 Posts
Default

There's enough stuff in AVX512 not related to 512-bit that AMD can't sit on it forever. So they'll either need to adopt AVX512 as is, or build a competing version (highly unlikely).

When I spoke to David Suggs (lead chip architect for AMD) at Hot Chips back in August, I asked him if there's anything in AVX512 that's fundamentally difficult (under the pretense that AMD isn't already working on it).

I specifically asked about things that (I believe) would complicate a design that doesn't natively support it:
- Mask registers.
- 4-input dependencies
- Decoding

He said none. No difficulties. It's just "work to be done" - without implying that they're already working on it or not.

So my guess is that AMD won't add a competing ISA that's 256-bit. But will wait on AVX512 for as long as economically possible - which could be quite a few generations from now given that AMD now has Intel on its heels.

Last fiddled with by Mysticial on 2019-10-09 at 21:48
Mysticial is offline   Reply With Quote
Reply

Thread Tools


All times are UTC. The time now is 08:52.

Thu Oct 29 08:52:09 UTC 2020 up 49 days, 6:03, 1 user, load averages: 1.04, 1.40, 1.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.