mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Would a massive cache make much difference? (https://www.mersenneforum.org/showthread.php?t=23196)

tServo 2018-03-26 19:37

Would a massive cache make much difference?
 
GW has just started a thread about a new build and one of the main topics being discussed was the usual memory bottleneck. Recently, I had the opportunity to purchase a Dell Precision workstation with a Xeon E5-2699 v4 processor for less than the cost of the processor itself. This cpu has 55 meg of cache ! If only 1 task is running, it can run @ 3.6 ghz and use all the cache for itself. I was wondering what effect that would have on Prime95 ? Would that make it super fast ? Would P95 still have to deal with ram bottlenecks or would everything fit in that vast amount of cache? Just wondering.

chalsall 2018-03-26 19:49

[QUOTE=tServo;483455]Just wondering.[/QUOTE]

Run some experiments.

Find out for yourself.

tServo 2018-03-26 20:43

[QUOTE=chalsall;483457]Run some experiments.

Find out for yourself.[/QUOTE]

I didn't buy the machine so I can't.

VictordeHolland 2018-03-26 21:27

The (slower) 128MB L4 cache in Iris Pro products didn't make a huge difference:
[URL]http://www.mersenneforum.org/showpost.php?p=368793&postcount=485[/URL]

It's all about getting enough memory bandwidth to keep the (many) cores occupied.

The E5-2699 v4 has 4 memory channels and total max memory bandwidth of 76.8 GB/s
[url]https://ark.intel.com/products/91317/Intel-Xeon-Processor-E5-2699-v4-55M-Cache-2_20-GHz[/url]
So after 8-10 cores (depending on the turbo speeds @AVX2) it'll throttle due to memory bandwidth being saturated. There are probably some benchmarks in the benchmarks-thread.

mackerel 2018-03-27 08:32

I found the L4 eDRAM of Broadwell to help a lot, relative to similar Haswell CPUs. The problem there was that my systems only had 1600 ram, which was choking things. The eDRAM from memory was specified at 50GB/s bandwidth, or ball park comparable to dual channel 3200 ram. For practical purposes, my i5-5675C was not bandwidth limited, which helps as they reduced L3 compared to other CPUs.

A word of caution, I'm not sure the new mesh cache starting in Skylake-X is working as well with current code compared to historic ring cache. I overclocked mine 50% which helps a bit, but I suspect more consideration from software may help with that.

fivemack 2018-03-27 09:01

[QUOTE=tServo;483455]GW has just started a thread about a new build and one of the main topics being discussed was the usual memory bottleneck. Recently, I had the opportunity to purchase a Dell Precision workstation with a Xeon E5-2699 v4 processor for less than the cost of the processor itself. This cpu has 55 meg of cache ! If only 1 task is running, it can run @ 3.6 ghz and use all the cache for itself. I was wondering what effect that would have on Prime95 ? Would that make it super fast ? Would P95 still have to deal with ram bottlenecks or would everything fit in that vast amount of cache? Just wondering.[/QUOTE]

For a 4M FFT, everything would indeed fit in the cache (32MB of active data plus about 10MB of weights); probably still for a 5M FFT. Running double-checks with a 2M FFT on a 20MB-cache Xeon is definitely pleasingly fast.

But accessing the L3 cache from a core on the other side of the other ring will still take quite perceptible time.

retina 2018-03-27 09:14

[QUOTE=fivemack;483512]But accessing the L3 cache from a core on the other side of the other ring will still take quite perceptible time.[/QUOTE]Yeah. And therein lies the inherent problem of caches. The larger they are the slower they are. Its just the nature of how the chips are constructed, not some conspiracy with DRAM manufacturers.

Maybe a "better" way of having more cache is to break it down and distribute it among the cores and keep it close to them. Oh wait, we already do that with L1 (and sometimes with L2). And if one were able to make the L1 larger then it would also become slower. Oh the trade-offs ... we hates thems.

So be wary of CPUs with super-large caches, they might harm performance rather than enhance it. It depends upon how it is used.

mackerel 2018-03-27 17:13

A question if I may. Assume we have a multi-core CPU running a FFT with multiple threads, fitting within the L3 cache. For most of the time, does each thread need to access all the available data, or just a specific part of it?

I'm speculating if the same data is needed by more than one thread at the same time, then the inclusive L3 cache as traditionally used by Intel might be better suited, as opposed to the exclusive or non-inclusive structures in Ryzen and Skylake-X respectively. I'm thinking in the latter cases, would there be extra data shuffling around between L2-L3? That is, unless it were coded to avoid such a scenario.

Not a programmer, not a CPU architect. Don't shoot me if the above doesn't make sense!

Mark Rose 2018-03-27 17:54

[QUOTE=mackerel;483545]A question if I may. Assume we have a multi-core CPU running a FFT with multiple threads, fitting within the L3 cache. For most of the time, does each thread need to access all the available data, or just a specific part of it?

I'm speculating if the same data is needed by more than one thread at the same time, then the inclusive L3 cache as traditionally used by Intel might be better suited, as opposed to the exclusive or non-inclusive structures in Ryzen and Skylake-X respectively. I'm thinking in the latter cases, would there be extra data shuffling around between L2-L3? That is, unless it were coded to avoid such a scenario.

Not a programmer, not a CPU architect. Don't shoot me if the above doesn't make sense![/QUOTE]

You may find this PDF insightful: [url]https://people.freebsd.org/~lstewart/articles/cpumemory.pdf[/url]

It doesn't cover the new Skylake-X ring cache, but it will explain just about everything you ever wanted to know about memory.

kladner 2018-03-27 22:50

[QUOTE=Mark Rose;483552]You may find this PDF insightful: [URL]https://people.freebsd.org/~lstewart/articles/cpumemory.pdf[/URL]

It doesn't cover the new Skylake-X ring cache, but it will explain just about everything you ever wanted to know about memory.[/QUOTE]
Thanks for (re?)-posting that article. I am sure I have seen something like this, and that I have a copy "somewhere." I had forgotten its existence. :confused2:

NookieN 2018-03-30 02:29

I have the chip in question and it is fast but I didn't find any compelling reason to run more than one worker. With all 22 cores active they will run at 2.6GHz. For 4M the peak throughput was about 725i/s (1.38ms) on all 22 cores. 2 workers is noticeably slower and with 3 workers and up the total throughput was always around 300i/s. I can rerun and post benchmarks or gwnum if anyone is really curious.


All times are UTC. The time now is 07:12.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.