![]() |
![]() |
#1 |
"Composite as Heck"
Oct 2017
93810 Posts |
![]()
https://www.tomshardware.com/news/am...ng-improvement
It's actually a 64MiB SRAM chip stacked on top of a Ryzen chiplet which already has 32MiB of L3 cache. That they've billed it as L3 is a good sign, hopefully the 64MiB is transparently unified with the 32MiB. Latency TBD. Bandwidth TBD, the link suggests up to 2TB/s based on the TSV tech used for stacking. They're talking in terms of gaming performance so it seems likely it's destined for consumers. 96MiB of L3 has to at least make a sizeable dent in the amount of memory used in wavefront tests. At best it'll eliminate RAM as a factor entirely, at worst it should severely raise the memory bandwidth it/s cap (hopefully to the point where the cores are the main bottleneck again?). Either way it's a tasty bit of news for the GIMP. |
![]() |
![]() |
![]() |
#2 |
Feb 2016
UK
3·149 Posts |
![]()
Very interesting development, but opening many questions:
Bigger cache usually = slower cache. For gaming and P95, might not be a problem. For other use cases: ??? Comes down to size of data. How will it affect thermals? Will it be productised any time soon, or more a Zen 4 target? Cost implication? May also imply they don't expect great scaling on Infinity Fabric and/or ram interfaces. Note I haven't seen the original keynote yet, only scanned some news reports. |
![]() |
![]() |
![]() |
#3 | |
Sep 2006
The Netherlands
3·269 Posts |
![]() Quote:
So you would really want to have 64 DIMMs so to speak inside each threadripper box. And 8 dimmes for each CCD. Yet that's not how it works. In short adding quite a lot of SRAM (L3) to every CCD is a very clever way of doing things. Intel has a major problem competing against this. Intel's way of doing business is make tons of cash for faster systems with more sockets and dimms and other facilities. AMD is packing it all in 1 cpu and the price of the 64 core threadripper is say factor 10 too cheap for intels way of doing business. If intel would follow the same path like AMD then they have a major problem business wise as they gonna make less cash. Last fiddled with by diep on 2021-06-01 at 14:42 |
|
![]() |
![]() |
![]() |
#4 |
Feb 2016
UK
3·149 Posts |
![]()
There are many different possible workloads, and not all of them stress bandwidth. For those that do, moving data has been a bigger problem than execution for a long time. Caches are one solution to that. More ram channels, more cache layers/different sizes, even compute in ram are being looked at.
I like monolithic CPUs more but recognise they're on the way out. It is so much simpler to deal with data when it isn't split into different pools that is unavoidable as we go into chiplets. I'm wondering where the 2 TB/s claim comes from. Can 8 cores move data in/out of that cache at that speed? Or it is a theoretical maximum based on stacking? I don't have numbers for Zen 3 currently, what does that do as currently sold? I did a quick Aida64 on my 8 core Cezanne and that's only 100GB/s copy in L3. The full fat desktop versions might do a bit better but that's still far from 2 TB/s. |
![]() |
![]() |
![]() |
#5 | |
Sep 2006
The Netherlands
3·269 Posts |
![]() Quote:
For example SGI sold a Teras supercomputer end 90s saying it had a 1 TB bandwidth over the network. Actually we can disprove this. 1024 processors were 256 nodes each node a quad socket cpu box. The connectivity from each node to the network was 1 GB from which quite some part was actually hardware overhead data. So roughly 600MB or 0.6GB would be user data. From head i remember 680MB user data actually. It's nearly 20 years ago... Actual theoretic bandwidth = 0.68 * 256 = 174 GB/s bandwidth (with a GB as 10^9 bytes) Quite different from its claim of 1 TB. Also a single core of a cpu was quite slower, about 20% for my program, than other partitions. As if the 512 processor partition had still cpu's with slower caches - which would've been far cheaper cpu's - whereas on paper it should have had the more expensive cpu's. But well the government never benchmarked the machine - so what did they know they received from SGI there. It's all fantastic claims there. the threadripper internally with its 8 CCDs is basically a 8 socket system. So there will be a complicated cache-coherency protocol of some sort to the crossbar. It's far cheaper to produce 8 core CCDs and clock those 3Ghz than to produce Xeon cpu's of 22 or even 54+ cores which by definition need to get clocked far far lower - 2Ghz is a lot actually for a 20+ core cpu. Threadripper's biggest advantage is its high clock and as it's a 'gamers cpu' it's easy to overclock as well. As for marketing managers - i remember a convention where some intel dude was giving a presentation - but before giving some data - we first had 10 minutes of sheets with DISCLAIMERS that anything he was gonna say was a big freaking lie. That was about fantastic claims about knights corner and knights landing and how this would revolutionize supercomputing. This was around 2010... You can shredder any bandwidth claim. More L3 cache is very important for whatever sort of prime number software you're busy with. |
|
![]() |
![]() |
![]() |
#6 |
Feb 2016
UK
6778 Posts |
![]()
AMD showed a consumer CPU with that cache, not a threadripper. It is on that basis I'm asking questions trying to figure out how it might work in practice. We can worry about it scaling elsewhere later.
I'm starting to get the feeling that the 2 TB/s quoted is the maximum bandwidth TSVs could allow, not that it could be attainable in the cache implementation. Hot info from Ian Cutress of Ananadtech: "Confirmed with AMD that V-Cache will be coming to Ryzen Zen 3 products, with production at end of year." https://twitter.com/IanCutress/statu...66139769602058 Last fiddled with by mackerel on 2021-06-01 at 16:37 |
![]() |
![]() |
![]() |
#7 | |
Sep 2006
The Netherlands
3·269 Posts |
![]() Quote:
edit: please note that usually read speed is much faster than write speed because you can usually read concurrently and writing some sort of cache coherency protocol will kill performance. Typical AMD processors it's 2 : 1 (2 cacheline reads can be done for every cacheline write) Last fiddled with by diep on 2021-06-01 at 16:43 |
|
![]() |
![]() |
![]() |
#8 |
Feb 2016
UK
3·149 Posts |
![]()
Again looking at my Cezanne, that's benchmarking with Aida64 (old version not optimised for recent CPUs) at around 1 TB/s on L2. I'm not sure if that is aggregate or per core. Even if aggregate, it is on a similar magnitude. Then we can argue measured vs peak.
|
![]() |
![]() |
![]() |
#9 |
Sep 2006
The Netherlands
3·269 Posts |
![]()
Note your link doesn't work here. Maybe can cite some text or links to AMD websites? Twitter not friendly here (i'm under linux).
I would be amazed if any new cpu from AMD wouldn't have the CCD concept because with the CCD concept they have intel at their balls. |
![]() |
![]() |
![]() |
#10 | |
Feb 2016
UK
3×149 Posts |
![]() Quote:
Sone new news which answers pretty much what I asked. |
|
![]() |
![]() |
![]() |
#11 |
Sep 2006
The Netherlands
32716 Posts |
![]()
Especially the benchmarking based upon an 'unknown videocard' which gives a 20% performance boost or 15% fps increase tells a lot :)
It's a theoretic bandwidth claim. |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Is "mung" or "munged" a negative word in a moral sense? | Uncwilly | Lounge | 15 | 2020-04-14 18:35 |
Stockfish game: "Move 8 poll", not "move 3.14159 discussion" | MooMoo2 | Other Chess Games | 5 | 2016-10-22 01:55 |
"Master" and "helper" threads | Madpoo | Software | 0 | 2016-09-08 01:27 |
Aouessare-El Haddouchi-Essaaidi "test": "if Mp has no factor, it is prime!" | wildrabbitt | Miscellaneous Math | 11 | 2015-03-06 08:17 |
Would Minimizing "iterations between results file" may reveal "is not prime" earlier? | nitai1999 | Software | 7 | 2004-08-26 18:12 |