mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Zen 2 details announced (https://www.mersenneforum.org/showthread.php?t=23783)

M344587487 2018-11-07 00:10

Zen 2 details announced
 
Zen 2 details were announced today. Highlights include:
[LIST][*]MCM design, with a central 14nm I/O die connected to multiple 7nm chiplets[*]Each chiplet has up to 8 cores, probably two four core CCX per die as is the case now but unconfirmed[*]Up to 64C/128T per socket by having up to 8 chiplets[*]FPU upgraded to 256-bit, meaning AVX2 parity with intel offerings[*]Vague improvements over zen+ (improved branch predictor, better instruction pre-fetching, re-optimised instruction cache, larger op cache)[/LIST]
There's more details about Zen2, 7nm GPUs and a few other things if you want to sift through the talk:
[URL]https://www.youtube.com/watch?v=GwX13bo0RDQ&t=3686s[/URL]

They also announced gen 1 Epyc AWS instances: [URL]https://www.mersenneforum.org/showthread.php?t=23782[/URL]

For our purposes the biggest news is the upgraded FPU, with it each core will be able to utilise more memory bandwidth meaning the lower core count parts will be the sweet spot for saturating available memory bandwidth. Probably quad core will be enough to saturate dual channel on Ryzen 3rd gen similar to intel's sweet spot now. Better power efficiency with the 7nm node shrink is another no-brainer. They didn't mention cache beyond the vague bullet point above.

Speculation: Having a dedicated I/O die may allow for a more performant memory controller. It may allow for UMA. It makes sense that they will create two 14nm dies (a smaller one for Ryzen and a bigger one for Epyc and Threadripper) but use the same 7nm chiplet throughout. They've said nothing of an iGPU chiplet as the event was all about servers. I wouldn't be surprised if the iGPU chiplet was a zen2 version of what we have now in the 2400G (quad core CPU + iGPU on a single chip). This allows for the possibility of Ryzen to have up to 12 cores and an iGPU, and scales down nicely to quad and dual core + iGPU for the low end. I don't think they'll dedicate an entire chiplet to iGPU as the one in the 2400G is limited by memory bandwidth as it is, unless they've decided against a 12 core Ryzen and make the iGPU chiplet much smaller than the 8 core chiplet.

mackerel 2018-11-07 08:40

Yup, most excited about the FPU upgrade as it means, if I want a farm of LLR crunchers, quad core offerings on dual channel ram will probably hit a sweet spot of price/performance/power. Question is, how long will it be before consumer versions? I'm hoping, at worse, they'll keep the cadence they had up to now, with Ryzen 1000 and 2000 both being launched around April from memory.

M344587487 2018-11-07 10:49

[QUOTE=mackerel;499796]Yup, most excited about the FPU upgrade as it means, if I want a farm of LLR crunchers, quad core offerings on dual channel ram will probably hit a sweet spot of price/performance/power. Question is, how long will it be before consumer versions? I'm hoping, at worse, they'll keep the cadence they had up to now, with Ryzen 1000 and 2000 both being launched around April from memory.[/QUOTE]
I think that's wishful thinking but perhaps that'll be the case, to hit intel where it hurts they'll probably prioritise server. The power efficiency is nice for consumer particularly mobile but it's a game changer for servers. A likely clock of around 4.5 GHz and IPC improvements is big news for enthusiasts but unfortunately for them they'll have to get in line. I'm expecting something along the lines of:
[LIST][*]Q4 2018 7nm compute GPUs[*]Q1 2019 Epyc 2 for validation[*]Early Q2 2019 perhaps a trickle of Ryzen 3700 8C/16T and Ryzen 3600 6C/12T while waiting for Epyc 2 validation and catering to the enthusiasts who are willing to pay more[*]Q2 2019 Epyc 2 full steam ahead. Possibly Threadripper[*]Q3 2019 the rest of the Ryzen lineup that had to wait for the right kind of defective dies to get stockpiled to have any meaningful supply. Ramping up of the top end Ryzen. Threadripper. Possibly some secret sauce[/LIST]If it's not too taxing on the io die they may allow for two chiplets on Ryzen. That could allow them to keep 10 and 12 core chips with an iGPU (3800/3900?) in the back pocket for if intel make an unexpected move next year. Not the most useful of parts but it would allow enthusiasts to have their cake and eat it too, and answers a lot of headline grabbing moves intel could make.

M344587487 2018-11-07 14:57

The official video is now up with much better audio and seemingly extra details: [url]https://www.youtube.com/watch?v=kC3ny3LBfi4[/url]

Mysticial 2018-11-07 17:14

If I understood the presentation and slides correctly, the Zen FPU is the same as the old one, but double in size. That makes it (in theory) better than the non-AVX512 Intels.

On Zen1, you could do 2 x 128-bit FMA + 2 x 128-bit FADD. (though not sustainably)
If it's the same or better on Zen2, you can probably do 2 x 256-bit FMA + 2 x 256-bit FADD. (also unlikely to be unsustainable)

So you get FMA parity with non-AVX512 Intel. And you have a couple extra FADD units to pick at the FADDs lying around. (since most code is not going to be 100% FMA and will have FADDs as well)

mackerel 2018-11-07 18:31

[QUOTE=M344587487;499798][*]Early Q2 2019 perhaps a trickle of Ryzen 3700 8C/16T and Ryzen 3600 6C/12T while waiting for Epyc 2 validation and catering to the enthusiasts who are willing to pay more[/QUOTE]

My original thinking was that, roughly speaking Intel had been releasing refreshed consumer desktop CPUs roughly once a year, at least as far back as Haswell. We only have limited data points for Ryzen but the two generational launches so far have been in April a year apart.

You make a good point, that this may be skewed to higher end parts. We don't have Zen+ quad cores, do we? The 2000 series APUs are still Zen.

I need to catch up on the full video tonight.

[QUOTE=Mysticial;499818]If I understood the presentation and slides correctly, the Zen FPU is the same as the old one, but double in size. That makes it (in theory) better than the non-AVX512 Intels.[/QUOTE]

Assuming that is correct, would you dare put a figure to how much faster it could be in FFT implementations? Not looking for a detailed analysis, but are we talking ball park 1% or 10% for example?



It feels like my world is imploding, having bought Intels for so long due to their FPU performance, my next system(s) may switch to Zen 2. I will probably still get a Skylake-X refresh at some point for AVX-512.

Mysticial 2018-11-07 18:38

[QUOTE=mackerel;499824]Assuming that is correct, would you dare put a figure to how much faster it could be in FFT implementations? Not looking for a detailed analysis, but are we talking ball park 1% or 10% for example?



It feels like my world is imploding, having bought Intels for so long due to their FPU performance, my next system(s) may switch to Zen 2. I will probably still get a Skylake-X refresh at some point for AVX-512.[/QUOTE]

I'm not even gonna try. Especially if they keep upping the core count without adding more memory channels.

mackerel 2018-11-07 19:17

Fortunately for my prime interests, many of them are still searchable at small enough FFT sizes to run out of cache. Peak CPU performance without ram limiting is still possible. It does limit the value somewhat... I also wonder if the combined effective cache is big enough to take ram out of the equation, at least on higher core parts.

ewmayer 2018-11-09 20:47

Just by way of reference, for the current Ryzen (a.k.a. Ryzen 1), would someone be so kind as to post mprime/Prime95 timings at various FFT lengths of interest here? Please not the usual blizzard of various-thread-count-configurations, just the ms/iter (or throughput in iters/sec) for the total-system-throughput-maximizing config at each FFT length.

Back when I did a trial build of Mlucas 17.1 on David Stanfill's then-new octocore Ryzen, I got some very interesting results. Here are ms/iter numbers for 1-core-running at various FFT lengths:
[code]
ewmayer@RyzenBeast:~/mlucas_v17.1/obj_avx2$ cat mlucas.cfg
17.1
1024 msec/iter = 9.96 ROE[avg,max] = [0.233398438, 0.281250000] radices = 32 16 32 32 0 0 0 0 0 0
1152 msec/iter = 11.78 ROE[avg,max] = [0.262165179, 0.312500000] radices = 36 16 32 32 0 0 0 0 0 0
1280 msec/iter = 12.73 ROE[avg,max] = [0.277678571, 0.343750000] radices = 40 16 32 32 0 0 0 0 0 0
1408 msec/iter = 14.78 ROE[avg,max] = [0.286049107, 0.343750000] radices = 44 16 32 32 0 0 0 0 0 0
1536 msec/iter = 15.36 ROE[avg,max] = [0.246595982, 0.312500000] radices = 48 16 32 32 0 0 0 0 0 0
1664 msec/iter = 17.80 ROE[avg,max] = [0.299107143, 0.375000000] radices = 52 16 32 32 0 0 0 0 0 0
1792 msec/iter = 17.98 ROE[avg,max] = [0.292968750, 0.343750000] radices = 56 16 32 32 0 0 0 0 0 0
1920 msec/iter = 20.80 ROE[avg,max] = [0.290178571, 0.375000000] radices = 60 16 32 32 0 0 0 0 0 0
2048 msec/iter = 20.58 ROE[avg,max] = [0.238539342, 0.281250000] radices = 32 32 32 32 0 0 0 0 0 0
2304 msec/iter = 24.23 ROE[avg,max] = [0.237191336, 0.281250000] radices = 36 32 32 32 0 0 0 0 0 0
2560 msec/iter = 26.22 ROE[avg,max] = [0.294642857, 0.375000000] radices = 40 32 32 32 0 0 0 0 0 0
2816 msec/iter = 30.43 ROE[avg,max] = [0.241350446, 0.281250000] radices = 44 32 32 32 0 0 0 0 0 0
3072 msec/iter = 31.48 ROE[avg,max] = [0.235825893, 0.281250000] radices = 48 32 32 32 0 0 0 0 0 0
3328 msec/iter = 36.65 ROE[avg,max] = [0.308035714, 0.375000000] radices = 52 32 32 32 0 0 0 0 0 0
3584 msec/iter = 36.95 ROE[avg,max] = [0.255747768, 0.312500000] radices = 56 32 32 32 0 0 0 0 0 0
3840 msec/iter = 42.50 ROE[avg,max] = [0.255217634, 0.281250000] radices = 240 16 16 32 0 0 0 0 0 0
4096 msec/iter = 43.84 ROE[avg,max] = [0.238957868, 0.281250000] radices = 32 16 16 16 16 0 0 0 0 0
4608 msec/iter = 50.09 ROE[avg,max] = [0.236300223, 0.281250000] radices = 144 16 32 32 0 0 0 0 0 0
5120 msec/iter = 55.05 ROE[avg,max] = [0.298772321, 0.343750000] radices = 160 16 32 32 0 0 0 0 0 0
5632 msec/iter = 64.28 ROE[avg,max] = [0.233816964, 0.281250000] radices = 176 16 32 32 0 0 0 0 0 0
6144 msec/iter = 66.76 ROE[avg,max] = [0.273158482, 0.343750000] radices = 24 16 16 16 32 0 0 0 0 0
6656 msec/iter = 76.08 ROE[avg,max] = [0.249162946, 0.281250000] radices = 208 16 32 32 0 0 0 0 0 0
7168 msec/iter = 76.73 ROE[avg,max] = [0.261049107, 0.312500000] radices = 224 16 32 32 0 0 0 0 0 0
7680 msec/iter = 85.84 ROE[avg,max] = [0.266587612, 0.312500000] radices = 240 16 32 32 0 0 0 0 0 0[/code]
Now for the interesting part - I tried various multithreaded-run configs, here is the upshot in terms of total-system-throughput-maximization:

o 'Overloading' each physical core with 2 threads (1 per logical core on that single phys-core) cuts throughput by ~10%;

o 1 thread per physical core is best, not by a huge amount, but still;

o Running 1 single-thread LL test on each of the 8 physical cores [b]barely dents the timing versus just 1 such job on the entire system[/b]. I.e. if competition for system memory bandwidth were as big an issue here as it is known to be on Intel, running 8 single-thread jobs should appreciably reduce the per-job throughput versus the above numbers. But e.g. with 8 single-thread LL tests running, all @4608K, here are the numbers:
[code]
M85836229 Iter# = 870000 [ 1.01% complete] clocks = 00:08:47.698 [ 0.0528 sec/iter]
M85836271 Iter# = 860000 [ 1.00% complete] clocks = 00:08:46.750 [ 0.0527 sec/iter]
M85836449 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.069 [ 0.0527 sec/iter]
M85836847 Iter# = 860000 [ 1.00% complete] clocks = 00:08:48.398 [ 0.0528 sec/iter]
M85836869 Iter# = 870000 [ 1.01% complete] clocks = 00:08:45.096 [ 0.0525 sec/iter]
M85836871 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.868 [ 0.0528 sec/iter]
M85836931 Iter# = 860000 [ 1.00% complete] clocks = 00:09:01.621 [ 0.0542 sec/iter]
M85836953 Iter# = 860000 [ 1.00% complete] clocks = 00:08:45.647 [ 0.0526 sec/iter][/code]
Thus an average timing of 52.9 ms/iter, only ~6% slower than for a single such LL test running on the system, and total throughput of a smidge over 150 iter/sec. (Note that the throughput-maximizing setup for George's code may be very different than this).

Here are salient /proc/cpuinfo details (just for the first processor in the file) for the above system:
[code]
processor : 0
vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1800X Eight-Core Processor
stepping : 1
microcode : 0x8001126
cpu MHz : 3850.000
cache size : 512 KB
physical id : 0
siblings : 16
core id : 0
cpu cores : 8
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall
nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monit
or ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa
lignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bm
i1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_sa
ve tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 7685.18
TLB size : 2560 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14][/code]

mackerel 2018-11-12 15:55

Interesting interview at Anandtech: [url]https://www.anandtech.com/show/13578/naples-rome-milan-zen-4-an-interview-with-amd-cto-mark-papermaster[/url]

One part of interest:

[QUOTE]IC: With the FP units now capable of doing 256-bit on their own, is there a frequency drop when 256-bit code is run, similar to when Intel runs AVX2?

MP: No, we don’t anticipate any frequency decrease. We leveraged 7nm. One of the things that 7nm enables us is scale in terms of cores and FP execution. It is a true doubling because we didn’t only double the pipeline with, but we also doubled the load-store and the data pipe into it.

IC: Now the Zen 2 core has two 256-bit FP pipes, can users perform AVX512-esque calculations?

MP: At the full launch we’ll share with you exact configurations and what customers want to deploy around that.[/QUOTE]

Even if they do support AVX-512 instructions, I think it would only be resourced as a single unit Intel would be so wouldn't give throughput improvements.

xx005fs 2018-11-13 20:22

Epyc 7nm IO Die
 
I saw the image of the delidded Epyc 7nm 64 core part in which the central IO die seems massive. Could there possibly be a really fast L4 cache that's a decent size and with very high bandwidth (aka much higher than around 150GB/s from 2666 8 channel memory)


All times are UTC. The time now is 05:27.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.