mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2018-11-07, 00:10   #1
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3·199 Posts
Default Zen 2 details announced

Zen 2 details were announced today. Highlights include:
  • MCM design, with a central 14nm I/O die connected to multiple 7nm chiplets
  • Each chiplet has up to 8 cores, probably two four core CCX per die as is the case now but unconfirmed
  • Up to 64C/128T per socket by having up to 8 chiplets
  • FPU upgraded to 256-bit, meaning AVX2 parity with intel offerings
  • Vague improvements over zen+ (improved branch predictor, better instruction pre-fetching, re-optimised instruction cache, larger op cache)

There's more details about Zen2, 7nm GPUs and a few other things if you want to sift through the talk:
https://www.youtube.com/watch?v=GwX13bo0RDQ&t=3686s

They also announced gen 1 Epyc AWS instances: https://www.mersenneforum.org/showthread.php?t=23782

For our purposes the biggest news is the upgraded FPU, with it each core will be able to utilise more memory bandwidth meaning the lower core count parts will be the sweet spot for saturating available memory bandwidth. Probably quad core will be enough to saturate dual channel on Ryzen 3rd gen similar to intel's sweet spot now. Better power efficiency with the 7nm node shrink is another no-brainer. They didn't mention cache beyond the vague bullet point above.

Speculation: Having a dedicated I/O die may allow for a more performant memory controller. It may allow for UMA. It makes sense that they will create two 14nm dies (a smaller one for Ryzen and a bigger one for Epyc and Threadripper) but use the same 7nm chiplet throughout. They've said nothing of an iGPU chiplet as the event was all about servers. I wouldn't be surprised if the iGPU chiplet was a zen2 version of what we have now in the 2400G (quad core CPU + iGPU on a single chip). This allows for the possibility of Ryzen to have up to 12 cores and an iGPU, and scales down nicely to quad and dual core + iGPU for the low end. I don't think they'll dedicate an entire chiplet to iGPU as the one in the 2400G is limited by memory bandwidth as it is, unless they've decided against a 12 core Ryzen and make the iGPU chiplet much smaller than the 8 core chiplet.
M344587487 is offline   Reply With Quote
Old 2018-11-07, 08:40   #2
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

38010 Posts
Default

Yup, most excited about the FPU upgrade as it means, if I want a farm of LLR crunchers, quad core offerings on dual channel ram will probably hit a sweet spot of price/performance/power. Question is, how long will it be before consumer versions? I'm hoping, at worse, they'll keep the cadence they had up to now, with Ryzen 1000 and 2000 both being launched around April from memory.
mackerel is offline   Reply With Quote
Old 2018-11-07, 10:49   #3
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

10010101012 Posts
Default

Quote:
Originally Posted by mackerel View Post
Yup, most excited about the FPU upgrade as it means, if I want a farm of LLR crunchers, quad core offerings on dual channel ram will probably hit a sweet spot of price/performance/power. Question is, how long will it be before consumer versions? I'm hoping, at worse, they'll keep the cadence they had up to now, with Ryzen 1000 and 2000 both being launched around April from memory.
I think that's wishful thinking but perhaps that'll be the case, to hit intel where it hurts they'll probably prioritise server. The power efficiency is nice for consumer particularly mobile but it's a game changer for servers. A likely clock of around 4.5 GHz and IPC improvements is big news for enthusiasts but unfortunately for them they'll have to get in line. I'm expecting something along the lines of:
  • Q4 2018 7nm compute GPUs
  • Q1 2019 Epyc 2 for validation
  • Early Q2 2019 perhaps a trickle of Ryzen 3700 8C/16T and Ryzen 3600 6C/12T while waiting for Epyc 2 validation and catering to the enthusiasts who are willing to pay more
  • Q2 2019 Epyc 2 full steam ahead. Possibly Threadripper
  • Q3 2019 the rest of the Ryzen lineup that had to wait for the right kind of defective dies to get stockpiled to have any meaningful supply. Ramping up of the top end Ryzen. Threadripper. Possibly some secret sauce
If it's not too taxing on the io die they may allow for two chiplets on Ryzen. That could allow them to keep 10 and 12 core chips with an iGPU (3800/3900?) in the back pocket for if intel make an unexpected move next year. Not the most useful of parts but it would allow enthusiasts to have their cake and eat it too, and answers a lot of headline grabbing moves intel could make.

Last fiddled with by M344587487 on 2018-11-07 at 10:51
M344587487 is offline   Reply With Quote
Old 2018-11-07, 14:57   #4
M344587487
 
M344587487's Avatar
 
"Composite as Heck"
Oct 2017

3×199 Posts
Default

The official video is now up with much better audio and seemingly extra details: https://www.youtube.com/watch?v=kC3ny3LBfi4
M344587487 is offline   Reply With Quote
Old 2018-11-07, 17:14   #5
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

If I understood the presentation and slides correctly, the Zen FPU is the same as the old one, but double in size. That makes it (in theory) better than the non-AVX512 Intels.

On Zen1, you could do 2 x 128-bit FMA + 2 x 128-bit FADD. (though not sustainably)
If it's the same or better on Zen2, you can probably do 2 x 256-bit FMA + 2 x 256-bit FADD. (also unlikely to be unsustainable)

So you get FMA parity with non-AVX512 Intel. And you have a couple extra FADD units to pick at the FADDs lying around. (since most code is not going to be 100% FMA and will have FADDs as well)
Mysticial is offline   Reply With Quote
Old 2018-11-07, 18:31   #6
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

22·5·19 Posts
Default

Quote:
Originally Posted by M344587487 View Post
[*]Early Q2 2019 perhaps a trickle of Ryzen 3700 8C/16T and Ryzen 3600 6C/12T while waiting for Epyc 2 validation and catering to the enthusiasts who are willing to pay more
My original thinking was that, roughly speaking Intel had been releasing refreshed consumer desktop CPUs roughly once a year, at least as far back as Haswell. We only have limited data points for Ryzen but the two generational launches so far have been in April a year apart.

You make a good point, that this may be skewed to higher end parts. We don't have Zen+ quad cores, do we? The 2000 series APUs are still Zen.

I need to catch up on the full video tonight.

Quote:
Originally Posted by Mysticial View Post
If I understood the presentation and slides correctly, the Zen FPU is the same as the old one, but double in size. That makes it (in theory) better than the non-AVX512 Intels.
Assuming that is correct, would you dare put a figure to how much faster it could be in FFT implementations? Not looking for a detailed analysis, but are we talking ball park 1% or 10% for example?



It feels like my world is imploding, having bought Intels for so long due to their FPU performance, my next system(s) may switch to Zen 2. I will probably still get a Skylake-X refresh at some point for AVX-512.
mackerel is offline   Reply With Quote
Old 2018-11-07, 18:38   #7
Mysticial
 
Mysticial's Avatar
 
Sep 2016

7·47 Posts
Default

Quote:
Originally Posted by mackerel View Post
Assuming that is correct, would you dare put a figure to how much faster it could be in FFT implementations? Not looking for a detailed analysis, but are we talking ball park 1% or 10% for example?



It feels like my world is imploding, having bought Intels for so long due to their FPU performance, my next system(s) may switch to Zen 2. I will probably still get a Skylake-X refresh at some point for AVX-512.
I'm not even gonna try. Especially if they keep upping the core count without adding more memory channels.
Mysticial is offline   Reply With Quote
Old 2018-11-07, 19:17   #8
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

5748 Posts
Default

Fortunately for my prime interests, many of them are still searchable at small enough FFT sizes to run out of cache. Peak CPU performance without ram limiting is still possible. It does limit the value somewhat... I also wonder if the combined effective cache is big enough to take ram out of the equation, at least on higher core parts.
mackerel is offline   Reply With Quote
Old 2018-11-09, 20:47   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·5·17·67 Posts
Default

Just by way of reference, for the current Ryzen (a.k.a. Ryzen 1), would someone be so kind as to post mprime/Prime95 timings at various FFT lengths of interest here? Please not the usual blizzard of various-thread-count-configurations, just the ms/iter (or throughput in iters/sec) for the total-system-throughput-maximizing config at each FFT length.

Back when I did a trial build of Mlucas 17.1 on David Stanfill's then-new octocore Ryzen, I got some very interesting results. Here are ms/iter numbers for 1-core-running at various FFT lengths:
Code:
ewmayer@RyzenBeast:~/mlucas_v17.1/obj_avx2$ cat mlucas.cfg
17.1
      1024  msec/iter =    9.96  ROE[avg,max] = [0.233398438, 0.281250000]  radices =  32 16 32 32  0  0  0  0  0  0
      1152  msec/iter =   11.78  ROE[avg,max] = [0.262165179, 0.312500000]  radices =  36 16 32 32  0  0  0  0  0  0
      1280  msec/iter =   12.73  ROE[avg,max] = [0.277678571, 0.343750000]  radices =  40 16 32 32  0  0  0  0  0  0
      1408  msec/iter =   14.78  ROE[avg,max] = [0.286049107, 0.343750000]  radices =  44 16 32 32  0  0  0  0  0  0
      1536  msec/iter =   15.36  ROE[avg,max] = [0.246595982, 0.312500000]  radices =  48 16 32 32  0  0  0  0  0  0
      1664  msec/iter =   17.80  ROE[avg,max] = [0.299107143, 0.375000000]  radices =  52 16 32 32  0  0  0  0  0  0
      1792  msec/iter =   17.98  ROE[avg,max] = [0.292968750, 0.343750000]  radices =  56 16 32 32  0  0  0  0  0  0
      1920  msec/iter =   20.80  ROE[avg,max] = [0.290178571, 0.375000000]  radices =  60 16 32 32  0  0  0  0  0  0
      2048  msec/iter =   20.58  ROE[avg,max] = [0.238539342, 0.281250000]  radices =  32 32 32 32  0  0  0  0  0  0
      2304  msec/iter =   24.23  ROE[avg,max] = [0.237191336, 0.281250000]  radices =  36 32 32 32  0  0  0  0  0  0
      2560  msec/iter =   26.22  ROE[avg,max] = [0.294642857, 0.375000000]  radices =  40 32 32 32  0  0  0  0  0  0
      2816  msec/iter =   30.43  ROE[avg,max] = [0.241350446, 0.281250000]  radices =  44 32 32 32  0  0  0  0  0  0
      3072  msec/iter =   31.48  ROE[avg,max] = [0.235825893, 0.281250000]  radices =  48 32 32 32  0  0  0  0  0  0
      3328  msec/iter =   36.65  ROE[avg,max] = [0.308035714, 0.375000000]  radices =  52 32 32 32  0  0  0  0  0  0
      3584  msec/iter =   36.95  ROE[avg,max] = [0.255747768, 0.312500000]  radices =  56 32 32 32  0  0  0  0  0  0
      3840  msec/iter =   42.50  ROE[avg,max] = [0.255217634, 0.281250000]  radices = 240 16 16 32  0  0  0  0  0  0
      4096  msec/iter =   43.84  ROE[avg,max] = [0.238957868, 0.281250000]  radices =  32 16 16 16 16  0  0  0  0  0
      4608  msec/iter =   50.09  ROE[avg,max] = [0.236300223, 0.281250000]  radices = 144 16 32 32  0  0  0  0  0  0
      5120  msec/iter =   55.05  ROE[avg,max] = [0.298772321, 0.343750000]  radices = 160 16 32 32  0  0  0  0  0  0
      5632  msec/iter =   64.28  ROE[avg,max] = [0.233816964, 0.281250000]  radices = 176 16 32 32  0  0  0  0  0  0
      6144  msec/iter =   66.76  ROE[avg,max] = [0.273158482, 0.343750000]  radices =  24 16 16 16 32  0  0  0  0  0
      6656  msec/iter =   76.08  ROE[avg,max] = [0.249162946, 0.281250000]  radices = 208 16 32 32  0  0  0  0  0  0
      7168  msec/iter =   76.73  ROE[avg,max] = [0.261049107, 0.312500000]  radices = 224 16 32 32  0  0  0  0  0  0
      7680  msec/iter =   85.84  ROE[avg,max] = [0.266587612, 0.312500000]  radices = 240 16 32 32  0  0  0  0  0  0
Now for the interesting part - I tried various multithreaded-run configs, here is the upshot in terms of total-system-throughput-maximization:

o 'Overloading' each physical core with 2 threads (1 per logical core on that single phys-core) cuts throughput by ~10%;

o 1 thread per physical core is best, not by a huge amount, but still;

o Running 1 single-thread LL test on each of the 8 physical cores barely dents the timing versus just 1 such job on the entire system. I.e. if competition for system memory bandwidth were as big an issue here as it is known to be on Intel, running 8 single-thread jobs should appreciably reduce the per-job throughput versus the above numbers. But e.g. with 8 single-thread LL tests running, all @4608K, here are the numbers:
Code:
M85836229 Iter# = 870000 [ 1.01% complete] clocks = 00:08:47.698 [  0.0528 sec/iter]
M85836271 Iter# = 860000 [ 1.00% complete] clocks = 00:08:46.750 [  0.0527 sec/iter]
M85836449 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.069 [  0.0527 sec/iter]
M85836847 Iter# = 860000 [ 1.00% complete] clocks = 00:08:48.398 [  0.0528 sec/iter]
M85836869 Iter# = 870000 [ 1.01% complete] clocks = 00:08:45.096 [  0.0525 sec/iter]
M85836871 Iter# = 860000 [ 1.00% complete] clocks = 00:08:47.868 [  0.0528 sec/iter]
M85836931 Iter# = 860000 [ 1.00% complete] clocks = 00:09:01.621 [  0.0542 sec/iter]
M85836953 Iter# = 860000 [ 1.00% complete] clocks = 00:08:45.647 [  0.0526 sec/iter]
Thus an average timing of 52.9 ms/iter, only ~6% slower than for a single such LL test running on the system, and total throughput of a smidge over 150 iter/sec. (Note that the throughput-maximizing setup for George's code may be very different than this).

Here are salient /proc/cpuinfo details (just for the first processor in the file) for the above system:
Code:
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 23
model		: 1
model name	: AMD Ryzen 7 1800X Eight-Core Processor
stepping	: 1
microcode	: 0x8001126
cpu MHz		: 3850.000
cache size	: 512 KB
physical id	: 0
siblings	: 16
core id		: 0
cpu cores	: 8
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall 
nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monit
or ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misa
lignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 mwaitx hw_pstate vmmcall fsgsbase bm
i1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf arat npt lbrv svm_lock nrip_sa
ve tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic overflow_recov succor smca
bugs		: fxsave_leak sysret_ss_attrs null_seg
bogomips	: 7685.18
TLB size	: 2560 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 48 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate eff_freq_ro [13] [14]

Last fiddled with by ewmayer on 2018-11-09 at 20:49
ewmayer is offline   Reply With Quote
Old 2018-11-12, 15:55   #10
mackerel
 
mackerel's Avatar
 
Feb 2016
UK

38010 Posts
Default

Interesting interview at Anandtech: https://www.anandtech.com/show/13578...rk-papermaster

One part of interest:

Quote:
IC: With the FP units now capable of doing 256-bit on their own, is there a frequency drop when 256-bit code is run, similar to when Intel runs AVX2?

MP: No, we don’t anticipate any frequency decrease. We leveraged 7nm. One of the things that 7nm enables us is scale in terms of cores and FP execution. It is a true doubling because we didn’t only double the pipeline with, but we also doubled the load-store and the data pipe into it.

IC: Now the Zen 2 core has two 256-bit FP pipes, can users perform AVX512-esque calculations?

MP: At the full launch we’ll share with you exact configurations and what customers want to deploy around that.
Even if they do support AVX-512 instructions, I think it would only be resourced as a single unit Intel would be so wouldn't give throughput improvements.
mackerel is offline   Reply With Quote
Old 2018-11-13, 20:22   #11
xx005fs
 
"Eric"
Jan 2018
USA

7·29 Posts
Default Epyc 7nm IO Die

I saw the image of the delidded Epyc 7nm 64 core part in which the central IO die seems massive. Could there possibly be a really fast L4 cache that's a decent size and with very high bandwidth (aka much higher than around 150GB/s from 2666 8 channel memory)
xx005fs is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
RX470 and RX460 announced VictordeHolland GPU Computing 0 2016-07-30 13:05
Intel Xeon D announced VictordeHolland Hardware 7 2015-03-11 23:26
Factoring details mturpin Information & Answers 4 2013-02-08 02:43
Euler (6,2,5) details. Death Math 10 2011-08-03 13:49
Larrabee instruction set announced fivemack Hardware 0 2009-03-25 12:09

All times are UTC. The time now is 07:10.

Thu Aug 6 07:10:54 UTC 2020 up 20 days, 2:57, 1 user, load averages: 1.53, 1.51, 1.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.