mersenneforum.org #1
 Register FAQ Search Today's Posts Mark Forums Read

 2020-01-30, 00:32 #111 Xyzzy     "Mike" Aug 2002 5×23×67 Posts
2020-02-01, 21:55   #112
chalsall
If I May

"Chris Halsall"
Sep 2002

2×4,643 Posts

Quote:
 Originally Posted by chalsall I'm going to run an experiment, and use most of it for a swap partition. I sometimes want to do Blender rendering jobs which won't fit in my main workstation's RAM, so I have to spin up a "cloud" instance.
So, my world finally stabilized a little bit, and I was just now able to install the 256 GB SSD. 100 GB for a Fedora 31 root partition, and 120 GB of swap. Left 36 GB un-partitioned for additional fail-over capacity.

I haven't tried running a big Blender job which causes swapping yet, but that will be very interesting. I have to say I've never worked on an SSD based machine before. Wow! Snappy! The latency in the polished rust is nice to get away from.

2020-02-01, 23:09   #113
Runtime Error

Sep 2017
USA

25·5 Posts

Quote:
 Originally Posted by VBCurtis [...] when the FFT data fits into the CPU cache. That's why on some machines (and for small-enough FFT sizes), one can get nice timings without crushing memory bandwidth. This applies generally either to tests with FFT sizes far below the Prime95 interest level (say, exponents below 10M on other projects), or on Xeons with large L3 caches.
Thanks for the great explanation!

My next naive question is of course, "Why don't CPUs come with larger caches?" Eons ago, I bought a wicked awesome custom computer from a local company and remember upgrading the cache. I'm not aware that is a build option anymore. Do motherboards still support cache expansion? Is there a non-monetary benefit for having smaller caches? I imagine that if you had a (say) whole gigabyte L4 cache it wouldn't be very efficient quickly access the bits you need immediately next, but wouldn't it be faster than going to RAM?

P.S. Sorry for taking this thread away from "big memory", and thanks for the ELI5 explanations, perhaps I should send in tuition payments. You folks rock.

2020-02-01, 23:38   #114
M344587487

"Composite as Heck"
Oct 2017

65510 Posts

Quote:
 Originally Posted by Runtime Error Thanks for the great explanation! My next naive question is of course, "Why don't CPUs come with larger caches?" Eons ago, I bought a wicked awesome custom computer from a local company and remember upgrading the cache. I'm not aware that is a build option anymore. Do motherboards still support cache expansion? Is there a non-monetary benefit for having smaller caches? I imagine that if you had a (say) whole gigabyte L4 cache it wouldn't be very efficient quickly access the bits you need immediately next, but wouldn't it be faster than going to RAM? P.S. Sorry for taking this thread away from "big memory", and thanks for the ELI5 explanations, perhaps I should send in tuition payments. You folks rock.
Motherboards do not support cache expansion. For L3 cache (and probably the rest) there is a latency penalty associated with larger caches but the benefits generally outweigh the negatives.

 2020-02-01, 23:41 #115 mackerel     Feb 2016 UK 389 Posts Not an expert in the area, but understand that bigger cache = slower access. That's why there are usually 3 levels of cache on x86 CPUs. Small but ultra-fast next to the cores, medium size medium speed in 2nd tier, and relatively big slower speed as 3rd tier. Then you reach ram. There is some overhead to keeping track of what data is in the cache. I think L2 is generally tied to the core, but L3 can be shared between cores to some extent. Broadwell consumer desktop CPUs were an odd ball, with 128MB of L4 cache. For its time, that was great as it was practically ram bandwidth unlimited for prime number finding. I didn't lose performance even running a single stick of slow ram since it worked out of the cache. However its cache speed isn't amazing today, so if it were to be revisited, it would have to be much faster. Removable cache isn't really a thing any more, unless you count Intel Optane but that acts more like an extra tier between ram and bulk storage so not of direct help here.
2020-02-02, 01:27   #116
xx005fs

"Eric"
Jan 2018
USA

24×13 Posts

Quote:
 Originally Posted by Runtime Error Thanks for the great explanation! My next naive question is of course, "Why don't CPUs come with larger caches?" Eons ago, I bought a wicked awesome custom computer from a local company and remember upgrading the cache. I'm not aware that is a build option anymore. Do motherboards still support cache expansion? Is there a non-monetary benefit for having smaller caches? I imagine that if you had a (say) whole gigabyte L4 cache it wouldn't be very efficient quickly access the bits you need immediately next, but wouldn't it be faster than going to RAM? P.S. Sorry for taking this thread away from "big memory", and thanks for the ELI5 explanations, perhaps I should send in tuition payments. You folks rock.
If you have ever seen a die shot of Ryzen 3000 series, you realize that the cache blocks takes a significant amount of die space for the tiny amount of space provided. In the future as cores gets more complex and process technology improves, more cache can definitely be crammed on the die, or even an L4 cache structure like Broadwell that's on package (hell HBM memory on video cards have similar ideas to Broadwell's L4 cache). As the cache gets bigger, bandwidth increases, but on the other hand Latency also increases, but not significantly so.

The whole point of having a cache integrated within a CPU is that its latency is significantly lower than memory latency, and the closer it is to the CPU, the lower the latency, meaning that motherboard expansions will just defeat the whole purpose of having CPU cache and it will be similar to RAM (for example, L3 cache latency can be in the single digit nanoseconds, while memory latency are generally above 60ns).

 2020-02-02, 05:44 #117 LaurV Romulan Interpreter     Jun 2011 Thailand 22×3×739 Posts Paraphrasing what ante-posters said, CPUs do come with larger cache, but larger cache = slower access, and larger cache + faster access = buckets of money, because each row of cell blocks you add may double or triple the silicon, and decrease the fabrication yield. Re upgrading the cache, you make confusion between different layers of cache, the fastest one was always inside of the CPU (except for the very early models of CPUs which had no cache). The slowest one, less expensive, may be outside of the CPU and still could be upgradable for some mobos, but current RAM memories are fast enough and provide a wide-enough bus to make the external cache obsolete. Many systems could have multiple layers of cache, with pipelines for both instructions and data, from which the fastest one (in small amount) is always internal (in the CPU) and the slowest one (in larger amount) is inside of your RAM stick. What makes the RAM "slow" is the multiplexing of the pins, and the refresh cycle (dynamic RAM). Memories can be static or dynamic. In their quest to make larger memories with smaller dimensions and cheaper price, manufacturers decreased the RAM cells to such small, micronic dimensions that the cell is not able anymore to hold the information for long time, so the memory cell needs a periodic "refresh". This means "somebody" must read the content of the memory every few milliseconds, and write it back. If you found a zero, write a zero. If you found a one, write a one. If you forget to do that, after more milliseconds the content of the cell is lost (the static charge stored there will discharge through the parasitic circuits of the cell) and the memory will contain unreliable, random data. That is called "dynamic" memory, as opposite to "static", where the cells are big enough to retain the electric charge as long as the power is applied to them, and no refresh cycle is needed. Also, to reduce the cost, manufacturers will multiplex the access lines for data and address. Imagine the memory like a neighborhood map, with horizontal and vertical streets, and houses at the corners, arranged in square fashion. You can tell the post man "give this letter to the house at the intersection of the horizontal 17 street with the vertical 23 street". He will know exactly where to go. Alternatively, imagine the postman is a bit stupid and he can't remember two numbers, but only one. You will have to tell him "go to horizontal 17 street and when you are there, call me back". Then, when he is there, he'll call you and you can tell him "go on that street till you intersect vertical 23 street and let the letter there". This is called "multiplexed address". It takes more time to be reached, and you need more communication to get there. In silicon terms, what makes an integrated circuit expensive nowadays, is not the amount of memory, but the package, and the amount of pins. For example, one Cortex ARM microcontroller (in short, MCU) with 64 pins and 128K memory costs $2, if you need to double the memory to 256k, the new MCU will cost$2.20, or if you need only half of the memory, 64k, you could pay $1.80. That is because memory is just one grain of sand/silicone more or less, inside the package. But, for example, if you only need few input/outputs, and move to a package with 48 pins, then the price is$0.8, and if you need more I/Os like you have to command a lot of stuff, or LCDs, then you will have to pay $4 or$5 for the 80-pin or 100-pin packages, which are much larger, and have a lot more metal (the pins) and wires bonded inside (wires made of gold, or gold bumps for non-bond dies), in spite of the fact that exactly the same chip/die is used in all 4 packages (in the packages with low pin count, some MCU pins are not used, and not bonded internally). Back to memories, the manufacturers reduce the costs by making integrated circuits (ICs) with less pins, smaller packages, but the drawback is that you have less bandwidth to communicate with them, your postman can only remember one number. In memory terms, you need to send the address of the cell you want to access in two (or more) chunks, send the first half of it now, and the second half a bit later. Because you do not have enough communication lines (channels) to transmit all at once. This also goes when you read the data back. So, static memories are much faster (no refresh needed, most of them no need any multiplex for addresses or data, but there are static RAMs which are multiplexed too). But they are bulky and expensive. Cache memories are something derived from the static memories, interposed between your dynamic RAM and your CPU. Every time you read something from the dynamic RAM, the information is stored in the cache too. Next time when you need the same data, this is already in the cache and it is read from there, without accessing the (slow) dynamic RAM. That is all the trick. Well, mainly. You can have one or more layers of cache, the fastest, most expensive, and smallest in size toward the CPU, the slower, cheaper, coming in larger amount, toward the RAM. There are also systems with complete static RAM and wide enough bus to make the multiplex futile. These systems, you can consider that they have "only cache memory", as they have zero-latency to read-write the static cells. But they can get bloody expensive. Imagine that is is not only the 512 pins to read the data non-muxed, but you also need 512 pins on the CPU/GPU side, and 512 copper tracks on the PCB/mobo/chipset, whatever... A lot of metal, a lot of money, not talking about how much EMI (electromagnetic interference) all these parallel lines generate, and the investment you need to make to protect and shield against such things... That is why new/fast GPUs are so bloody expensive... Last fiddled with by LaurV on 2020-02-02 at 07:11
2020-02-02, 07:03   #118
VBCurtis

"Curtis"
Feb 2005
Riverside, CA

23×32×61 Posts

Quote:
 Originally Posted by Runtime Error Thanks for the great explanation! .... P.S. Sorry for taking this thread away from "big memory", and thanks for the ELI5 explanations, perhaps I should send in tuition payments. You folks rock.
You're quite welcome, and thank you for the kind words. This place is full of inquisitive idiots, some of which enjoy sharing our findings with the freshly arrived ones. Welcome!

2020-02-02, 09:37   #119
M344587487

"Composite as Heck"
Oct 2017

65510 Posts

Quote:
 Originally Posted by xx005fs ... In the future as cores gets more complex and process technology improves, more cache can definitely be crammed on the die, or even an L4 cache structure like Broadwell that's on package (hell HBM memory on video cards have similar ideas to Broadwell's L4 cache). ....
I can see HBM potentially becoming an L4 of sorts particularly as we try to break the bandwidth limit for iGPU's. Currently an iGPU shares DDR4 memory bandwidth with the cores which heavily caps how performant iGPU's can be. A stack of HBM would solve that problem and has the potential to be used as a victim cache (or something) by the CPU cores.

I'm probably giving intel/AMD too much credit, such a thing might exist one day in a mobile form factor that does away with DDR and a discrete card altogether but it's unlikely to exist in a desktop form factor.

2020-02-02, 11:35   #120
Xyzzy

"Mike"
Aug 2002

5·23·67 Posts

Quote:
 Originally Posted by LaurV In their quest to make larger memories with smaller dimensions and cheaper price, manufacturers decreased the RAM cells to such small, micronic dimensions that the cell is not able anymore to hold the information for long time, so the memory cell needs a periodic "refresh". This means "somebody" must read the content of the memory every few milliseconds, and write it back. If you found a zero, write a zero. If you found a one, write a one. If you forget to do that, after more milliseconds the content of the cell is lost (the static charge stored there will discharge through the parasitic circuits of the cell) and the memory will contain unreliable, random data. That is called "dynamic" memory, as opposite to "static", where the cells are big enough to retain the electric charge as long as the power is applied to them, and no refresh cycle is needed.
We use a memory testing program that does this sub-test:
Quote:
 Bit fade test, 2 patterns The bit fade test initalizes all of memory with a pattern and then sleeps for 5 minutes (or a custom user-specifed time interval). Then memory is examined to see if any memory bits have changed. All ones and all zero patterns are used.
Do you think the memory can hold its contents that long? We've never experienced an error with this particular sub-test.

2020-02-02, 11:45   #121
mackerel

Feb 2016
UK

389 Posts

Quote:
 Originally Posted by M344587487 I can see HBM potentially becoming an L4 of sorts particularly as we try to break the bandwidth limit for iGPU's.
The strength and weakness of HBM is that it is very wide but relatively slow clocked. For a highly parallel workload like GPU, that's not a problem. For a CPU it might be too wide to be effective unless you have a lot of cores working on the same kind of data.

Quote:
 I'm probably giving intel/AMD too much credit, such a thing might exist one day in a mobile form factor that does away with DDR and a discrete card altogether but it's unlikely to exist in a desktop form factor.
Intel Foveros might some day turn into that, as well as a future implementation of AMD's chiplet strategy. It'll be interesting to see how things go.