Originally Posted by lavalamp View Post
That is not my understanding of how the L3 cache is configured.

As I understand it:
4 cores share an L3 cache and form a "core complex" (CCX).
2 CCX's on a die and are connected by infinity facbric, these are called a CCD.
2 CCD chiplets on the 3900 and 3950X are connected individually to the IO die, and any access of L3 cache or RAM must occur via the IO die.

Additionally, the post you linked to is based on a 3600, which only has a single CCD chiplet.

Therefore I would still expect 2 workers (1 per chiplet) with either 8 or 16 threads to perform optimally, but as I said, benchmarks would be interesting to see.
  • A CCX is 4 cores sharing 16MB of discrete L3 as you say, conjoined was a poor choice of words on my part
  • It's my understanding that there's no intra-chiplet IF in zen2, all communication between CCX's has to go through the IO die even if the CCX's are on the same chiplet. This makes chiplets simpler and memory latency more uniform at the cost of latency in some situations
  • A 3600 is a 3700X with one core disabled per CCX and likely a worse bin, but the same 16MB of cache per CCX
My post was just about throughput of the single chiplet SKUs, mackerel's data suggests that joblack should be able to get higher throughput on the 3700X with a single worker spanning both CCX's but joblack's tests contradict that.

The jury is still out on how efficiently a worker spans CCX's across chiplets, the lack of direct intra-chiplet IF link at least doesn't rule out 1 worker across everything as viable. We have a small data point ( ) indicating that a single worker on a 3900X seems to scale fine, but there's no 2+ worker 3900X data and the test conditions were not equal to the 3600 it was compared against (different RAM and possibly Fclk configurations). Everything needs more data.

My guess is that either 1 or 2 workers will be optimal for 3900X/3950X depending partly on how saturated RAM bandwidth is when there are 2 workers and partly on how detrimental spanning more than 2 CCX's is to throughput with 1 worker. It could be a weird situation where 1 worker across 3 CCX's is ideal as it mostly alleviates RAM while not incurring too much of an inter-CCX penalty. Lets hope that a test junkie among us gets their hands on a 3900X/3950X sometime this year.
