![]() |
Unexpected from AMD
I think AMD has gone off the deep end with its new high end 24 and 32 core Threadripper 2 parts.
[url]https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh[/url] [quote=tldr]At the AMD press event at Computex, it was revealed that these new processors would have up to 32 cores in total, mirroring the 32-core versions of EPYC. On EPYC, those processors have four active dies, with eight active cores on each die (four for each CCX). On EPYC however, there are eight memory channels, and AMD’s X399 platform only has support for four channels. [b]For the first generation this meant that each of the two active die would have two memory channels attached – in the second generation Threadripper this is still the case: the two now ‘active’ parts of the chip do not have direct memory access. This technically adds latency to the platform, however AMD is of the impression that for all but the most memory bound tasks, this should not be an issue (usually it is suggested to just go buy an EPYC for those workloads)[/b]. While it does put more pressure on the internal Infinity Fabric, AMD ultimately designed Infinity Fabric for scalable scenarios like this between different silicon with different levels of cache and memory access.[/quote] That's one way to press the advantage in the core wars, not necessarily one I'd recommend at a glance. Are there other examples of wacky configurations out there? This setup means you basically have 16 first class cores like with Threadripper 1, and up to 16 second class cores beyond that. Obviously not good for memory intensive things like LL, except for fringe cases like crypto mining which fit in cache. It'll be interesting to see which benchmarks are not phased by this, and which ones shit the bed. The 28 core HEDT part from intel makes more sense now at least. |
[QUOTE=M344587487;489270]I think AMD has gone off the deep end with its new high end 24 and 32 core Threadripper 2 parts.
[url]https://www.anandtech.com/show/12906/amd-reveals-threadripper-2-up-to-32-cores-250w-x399-refresh[/url] That's one way to press the advantage in the core wars, not necessarily one I'd recommend at a glance. Are there other examples of wacky configurations out there? This setup means you basically have 16 first class cores like with Threadripper 1, and up to 16 second class cores beyond that. Obviously not good for memory intensive things like LL, except for fringe cases like crypto mining which fit in cache. It'll be interesting to see which benchmarks are not phased by this, and which ones shit the bed. The 28 core HEDT part from intel makes more sense now at least.[/QUOTE] I don't think it matters too much for LL as the 16 first class cores will max the memory bandwidth anyway. |
What FFT size is leading edge work at? Can it fit in the L3 cache?
|
[QUOTE=mackerel;489272]What FFT size is leading edge work at? Can it fit in the L3 cache?[/QUOTE]
Possibly: [url]http://www.mersenneforum.org/showpost.php?p=483512&postcount=6[/url] This would almost certainly have 64MiB of L3 cache, but it's split into 8MiB chunks (one for each CCX). It's also a victim cache, whatever that means technically. There's a penalty sharing data between two CCX on the same die, and a bigger penalty sharing data between two CCX on different dies. I'm not saying it's impossible, but you'd probably have to be a cache wizard. Can someone more technically minded make an educated guess on if it's possible and viable? |
These new cache structures on Ryzen family and similarly Skylake-X... I do wonder if they would benefit from a software update to use the new cache structure there. I'm seeing interesting things on scaling and it isn't as simple as lower core counts and ring bus.
|
[QUOTE=M344587487;489275]
This would almost certainly have 64MiB of L3 cache, but it's split into 8MiB chunks (one for each CCX). It's also a victim cache, whatever that means technically.[/QUOTE] Victim cache means it gets filled with data evicted from higher caches (so L1 and L2). Getting data to cores fast enough is becoming a real problem with these many-cores designs. Intel didn't change to a mesh design with Skylake-SP for no reason, their ringbus design didn't scale well beyond 2 rings. |
I bet these will perform exceedingly well at most workloads.
About the only place where the weird latency will get you is in highly branching code, like gaming or databases. But Zen+ has improved latency, so we won't really know until the benchmarks are out. |
[QUOTE=mackerel;489272]What FFT size is leading edge work at? Can it fit in the L3 cache?[/QUOTE]
The 2990WX is coming out soon priced at $1800 USD so it's an interesting question if we can do LL/PRP with minimal DRAM access and if it's worth it. The 2970WX is a 24 core part priced at $1300 USD that may also have 64MiB of cache ([URL]https://en.wikichip.org/wiki/amd/ryzen_threadripper/2970wx[/URL] ). Here are some of the features/limitations of the chips ([URL]https://en.wikichip.org/wiki/amd/microarchitectures/zen+[/URL] ): [LIST][*] A CCX is a quad channel core complex where every core has equal access to 8MiB of L3 victim cache (the 2970WX has some cores disabled, either 8+8+4+4 or 6+6+6+6 per die but I'm just guessing)[*] A zen+ die contains two CCX[*] The 2970WX and 2990WX contain four zen+ dies (aka there is 64MiB total of cache spread across 8 CCX)[*] The L3 cache has 40 cycles of latency[*] Two of the dies have direct access to two channels of memory each. The other two rely on Infinity Fabric (IF) to access DRAM through the other two dies[*] The speed of IF depends on the clockspeed of the RAM[*] The bandwidth between dies is 42.67GB/s @1333MHz (that's 2666MHz RAM clock, scaled up we get 51.22GB/s @1600MHz, 64.02GB/s @2000MHz) ([URL]https://en.wikichip.org/wiki/amd/infinity_fabric#IFOP[/URL] )[/LIST] My questions to the technically knowledgable [LIST][*] How detrimental is this setup to the performance of a divide and conquer FFT algorithm? The overhead of using IF when gathering the results from each CCX or otherwise could be negligible or massive, I have no idea[*] What would you guess is the biggest bottleneck to doing LL/PRP with minimal DRAM access?[*] Are there any gotchas?[*] What do you think the potential exponent range could be?[/LIST] I know this is theoretical nonsense and no one has the time to create a specific solution for such niche hardware even if it were viable. Still, it's interesting to know if it is viably viable. |
Maybe not a fair comparison, but my previous attempts to run multi-thread across a two socket Xeon NUMA system, let's just say the 2nd socket wasn't utilised well at all, and I got much better results with one task per socket.
I think the question will come down to how local can the data be kept to the cores working on it. The multi-die approach can be treated as NUMA if memory serves me correctly from 1st gen TR, and I don't think it'll be a benefit to turn that off either. Communication between the dies would be my concern. |
If people were choosing between the best AMD CPU of this batch and the best X series Intel CPU to be released this fall, which would be better for world record LL testing?
|
[QUOTE=simon389;493292]If people were choosing between the best AMD CPU of this batch and the best X series Intel CPU to be released this fall, which would be better for world record LL testing?[/QUOTE]
The one with the most memory bandwidth. |
| All times are UTC. The time now is 07:16. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.