mersenneforum.org Zen 3 speculation
 Register FAQ Search Today's Posts Mark Forums Read

2020-11-14, 22:53   #45
chalsall
If I May

"Chris Halsall"
Sep 2002

22·2,393 Posts

Quote:
 Originally Posted by PhilF That's a cool test bench!
ROFLMAO... Clearly Mike doesn't have cats...

 2020-11-15, 00:30 #46 Xyzzy     "Mike" Aug 2002 25×3×5×17 Posts Our study is a cat-free environment. The rest of the house?
2020-11-15, 00:52   #47
Xyzzy

"Mike"
Aug 2002

25×3×5×17 Posts

Quote:
 Originally Posted by Xyzzy The board we ended up with (C8I) "trained" to a 1T command rate for our memory. We have never gotten 1T to work on any previous board with this memory and certainly not automatically.
Further detective work has revealed that we have "geardown mode" enabled. This is apparently a stability option.

Here is an explanation from https://www.reddit.com/r/overclockin...ram_overclock/
Quote:
 What GDM does is essentially forces the tCL and tCWL timings* to use an internal half-frequency clock instead of the memory clock. That is, if you're running for example 3000MHz, instead of the timings running off of 1500MHz (the real memory clock), they will reference a 750MHz clock. To make this work, the timings have to** be rounded up and divided by two. So if you're running CAS 15 with GDM on, the system will tell you you're running at CAS 16, but technically you're actually running CAS 8 at half the frequency. The latency works out the same, CAS commands are just asserted half as often. So that's why it increases stability: it both loosens tCL and tCWL if they are odd and reduces the rate at which the corresponding signals are asserted, which all means the memory is a little bit less stressed. * - not sure if there are others but that's what I've read ** - "have to" might be strong wording
Quote:
 Memory has two communication interfaces - the data bus which goes direct from pins on the CPU to pins on a memory chip and runs at the full DDR speed (eg 3200MT/s for DDR4-3200), and the command/address bus which goes from the CPU to ALL the memory chips via a loop-the-loop* and runs at the reduced "physical clock" speed (eg 1600MHz for DDR4-3200). The command/address bus can often be a limit on memory speed. As /u/varexos717 said, geardown mode slows down the command/address bus by only allowing communication to take place every other cycle. The communication still only takes one cycle (as opposed to 2T command rate where a command has to be sent over two cycles), but then the bus can return to a 'neutral' level between a 1 and 0 which make it easier for the next signal to get through. *This is not a joke. DDR5 will have a much more sensible layout.
We put in the new memory, which is the exact same as the old memory except it is dual rank instead of single rank and it has twice the capacity.

Old: https://www.gskill.com/product/165/1...35V16GB-(2x8GB)
New: https://www.gskill.com/product/165/1...5V32GB-(2x16GB)

We "erased" the motherboard's memory timings and had it go through the training process again. It ended up with the same numbers as before even though the signal/clock/whatever load is significantly increased.

As an experiment, we then forced geardown mode off and the command rate to 1. It passed a severe memory check with that setting, but any gain we measured was lost in the run-to-run variation of our benchmarks. IOW, we think the difference was negligible. So we enabled geardown mode to have a safety net for stability. We like fast things but only if they are utterly reliable.

Attached Thumbnails

2020-11-15, 00:54   #48
Xyzzy

"Mike"
Aug 2002

25×3×5×17 Posts

Attached are benchmark timings for our usual 560K and 6144K FFT lengths.

You may notice that the 560K FFT single rank and dual rank timings are very similar. We figure this is because the data is cached. The 6144K FFT data shows a surprising (to us) increase in throughput, up to 25% higher with six cores running.

Attached Files
 560K-SR.txt (4.5 KB, 47 views) 560K-DR.txt (4.5 KB, 46 views) 6144K-SR.txt (4.5 KB, 50 views) 6144K-DR.txt (4.5 KB, 51 views)

 2020-11-16, 05:41 #49 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 55638 Posts 25% higher! Wow! That does say something about how memory starved the 6 core chip is though, if rank interleaving can provide that much more bandwidth. Could you test that dual rank memory configuration at different CPU clock speeds? I'm curious where "knee" in performance is for the 6144k FFT. That could save a lot of power.
2020-11-16, 20:32   #50
Xyzzy

"Mike"
Aug 2002

25×3×5×17 Posts

Quote:
 Originally Posted by Mark Rose Could you test that dual rank memory configuration at different CPU clock speeds? I'm curious where "knee" in performance is for the 6144k FFT. That could save a lot of power.
We tested at three levels of power. We only tested one worker because additional workers are rarely if ever faster and by using just one we can benchmark in a reasonable time.

ECO = 40W
STK = 57W
PBO = 105W

Code:
ECO
Timings for 6144K FFT length (1 core, 1 worker): 18.20 ms.  Throughput: 54.95 iter/sec.
Timings for 6144K FFT length (2 cores, 1 worker):  9.67 ms.  Throughput: 103.40 iter/sec.
Timings for 6144K FFT length (3 cores, 1 worker):  6.94 ms.  Throughput: 144.17 iter/sec.
Timings for 6144K FFT length (4 cores, 1 worker):  5.55 ms.  Throughput: 180.20 iter/sec.
Timings for 6144K FFT length (5 cores, 1 worker):  4.85 ms.  Throughput: 206.26 iter/sec.
Timings for 6144K FFT length (6 cores, 1 worker):  4.30 ms.  Throughput: 232.78 iter/sec.

STK
Timings for 6144K FFT length (1 core, 1 worker): 18.11 ms.  Throughput: 55.21 iter/sec.
Timings for 6144K FFT length (2 cores, 1 worker):  9.49 ms.  Throughput: 105.32 iter/sec.
Timings for 6144K FFT length (3 cores, 1 worker):  6.67 ms.  Throughput: 149.83 iter/sec.
Timings for 6144K FFT length (4 cores, 1 worker):  5.23 ms.  Throughput: 191.16 iter/sec.
Timings for 6144K FFT length (5 cores, 1 worker):  4.41 ms.  Throughput: 226.98 iter/sec.
Timings for 6144K FFT length (6 cores, 1 worker):  4.00 ms.  Throughput: 249.94 iter/sec.

PBO
Timings for 6144K FFT length (1 core, 1 worker): 18.33 ms.  Throughput: 54.55 iter/sec.
Timings for 6144K FFT length (2 cores, 1 worker):  9.58 ms.  Throughput: 104.38 iter/sec.
Timings for 6144K FFT length (3 cores, 1 worker):  6.85 ms.  Throughput: 146.04 iter/sec.
Timings for 6144K FFT length (4 cores, 1 worker):  5.24 ms.  Throughput: 190.98 iter/sec.
Timings for 6144K FFT length (5 cores, 1 worker):  4.48 ms.  Throughput: 223.26 iter/sec.
Timings for 6144K FFT length (6 cores, 1 worker):  3.84 ms.  Throughput: 260.43 iter/sec.
PS - Note that ~300 iterations per second with six cores would be perfect scaling.
Attached Thumbnails

 2020-11-16, 20:51 #51 Mark Rose     "/X\(‘-‘)/X\" Jan 2013 55638 Posts That eco mode is super efficient! What clock speeds do you see running in eco?
2020-11-16, 23:07   #52
Xyzzy

"Mike"
Aug 2002

1FE016 Posts

Quote:
 Originally Posted by Mark Rose That eco mode is super efficient! What clock speeds do you see running in eco?
We were going to try to write down the frequency but it changes every second and the swings are pretty wild. That is why we used the power figure instead. And that power figure was actually a twelve-thread small-FFT torture test so it is the worst possible case scenario.

 2020-11-16, 23:18 #53 M344587487     "Composite as Heck" Oct 2017 2·397 Posts Ah this is the good stuff. If only there were a way to pipe benchmarks directly into a vein.
2020-11-16, 23:46   #54
PhilF

Feb 2005

617 Posts

Quote:
 Originally Posted by M344587487 Ah this is the good stuff. If only there were a way to pipe benchmarks directly into a vein.
Lol!

 2020-11-17, 00:03 #55 M344587487     "Composite as Heck" Oct 2017 14328 Posts I can quit whenever I want! But there'll be more right...