mersenneforum.org Skylake and RAM scaling
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

 2016-02-20, 19:59 #1 mackerel     Feb 2016 UK 25×13 Posts Skylake and RAM scaling I tested. I tested some more. Had a mug of tea, then went back for more testing. Then I made some charts and here are the results! (looks like I can't embed images here, so you'll have to do a little more work clicking the links to the charts) Test system: CPU: i7-6700k at 4.2 GHz, ring at 4.1 GHz, HT off Mobo: MSI Gaming Pro, bios 1.7 GPU: 9500 GT (just to make sure no ram bandwidth is stolen by integrated graphics) RAM: for the results presented I use two types G.Skill F4-3333C16-4GRRD Ripjaws 4, 4x4GB kit G.Skill F4-3200C16-8GVK Ripjaws V, 2x8GB kit Testing was performed using Prime95 28.7 built in benchmark in Windows 7 64-bit. Each setting was run once, after the PC had been given time to settle down after rebooting. All test configurations had the ram in dual channel mode. Timing values listed are ordered CAS-RCD-RP-RAS as commonly shown in most software. Most testing was with all 4 modules of the Ripjaws 4 kit fitted, for reasons discussed later. This ram is known from previous experience not to boot in this mobo at 3333 with 4 modules fitted, so I tested at common ram speeds from 2133 to 3200. To not complicate matters with timings, these were fixed at 16-18-18-38 for scaling tests, which may disadvantage the slower speeds since the values would typically be lower in practice. Latency will be considered separately later. http://s5.postimg.org/uibfgcfvr/p95_1w.png As the clock increases we see no significant difference in performance. This is not ram limited. http://s5.postimg.org/tl507h4dj/p95_2w.png There is a slight increase in performance here as ram clocks go up, but not much. http://s5.postimg.org/ac7xhgk87/p95_3w.png Now we are starting to see something happen. http://s5.postimg.org/vzwvrwkmf/p95_4w.png And here we see a clear relation with speed and performance. http://s5.postimg.org/paqcbvzaf/p95_4w_latency.png Here we alter the display a bit so we can compare ram settings. 3 speeds are tested. Actually, two of these are not exciting. At 3200 the results for 15-16-16-36 and 16-18-18-38 are practically identical. At 2800, 14-16-16-36 and 16-18-18-38 gave a 1% average advantage to C14, but this is so small it is hard to say if this could just be measurement variation. It gets a little more interesting at 2133, where 3 speeds were tested: 14-14-14-35, 15-15-15-35, and 16-18-18-38. The last one is on average 4% slower than the other two, which were the same. This may be an area for future research, although it seems ram speed is more important to performance. Timings might get you a little more as a secondary optimisation. http://s5.postimg.org/jp3x7tylj/p95_cpu_clock.png Putting the ram to one side, how does CPU speed affect performance? These 4 lines show the combinations of CPU at 3.5 and 4.2 GHz, with 1 or 4 workers active. With one worker, the scaling is near perfect with the faster CPU 19% faster, compared to 20% for ideal clock scaling. With 4 workers, it would seem the ram is the limit. We only see 4% increase for the 20% clock increase. This may present opportunities for power saving as the higher clock doesn't help here. It may be interesting to see how scaling applies over a wider range of CPU speeds. http://s5.postimg.org/8e19jgrqf/p95_rank.png And finally, this is the cause of some unexpected behaviour I saw. I had two comparable systems, but I saw a massive performance difference between them which I struggled to explain. I tried various things and even wrongly blamed the mobo for being rubbish, but it would seem module rank has a major influence. This isn't so commonly discussed or even specified. I found Thaiphoon Burner as free software that can read this. The Ripjaws 4 modules are single rank, and the Ripjaws 5 module is dual rank (caution: other parts in the series may vary!). General consensus seems to be that having higher ranks can slightly increase bandwidth, at the cost of slightly higher latency. This chart is going to take some explaining. The chart again shows the 4 worker throughput. The grey line is the Ripjaws V kit, and light blue line is Ripjaws 4 kit with 4 modules fitted, both at 3200. So on each memory channel is a total of 2 rank, and performance is so identical you can't see the light blue line under the grey line! So far so good? Let's take two of the Ripjaws 4 modules out, leaving it running in dual channel mode. Logically, this shouldn't make a difference. It is still 2 channels, running at the same clock and timings. Nope. We see a 19% drop in performance (orange line). This is massive! How massive? The yellow and blue lines are 4 modules running at 2666 and 2400 respectively, and they go neatly either side of the orange line. That is quite a performance drop! The tentative conclusion from this is that, it seems it is worth having the higher rank modules, or running more modules to do so, otherwise you will reduce your potential significantly. Unfortunately it doesn't seem that easy for find out what rank a module is. Ideally more testing could be done to make sure it is the rank, and not something else. I'd need for example 8GB modules with single rank to make sure the module capacity isn't in some way influencing it. Or alternatively, 4GB modules with dual rank. I have quite a lot of data from this testing, so if there are different ways the data could be cut, I could have a go at showing it.
2016-02-21, 01:32   #2
Madpoo
Serpentine Vermin Jar

Jul 2014

329310 Posts

Quote:
 Originally Posted by mackerel I tested. I tested some more. Had a mug of tea, then went back for more testing.
Those are interesting graphs indeed.

It's a little eye opening to me to see how CPU speed didn't alter the iter/s measurement with multiple workers, but on reflection it makes sense. You're probably right, that it's because it's being memory bandwidth throttled at that point, so more CPU cycles/sec isn't going to do much.

The comparisons of memory speed are also interesting and pretty much demonstrates what we knew (although not tested as exhaustively) that faster memory makes a HUGE difference with multiple workers.

Now, none of that reflects what happens when you have a single worker, but with 1-4 cores assigned to that single worker. As in, how does the performance scale with different mem speeds as more cores are added in to that worker. That might be interesting to see in the same level of detail, but I know it's not easy or quick to do tests like that.

 2016-02-21, 02:19 #3 Prime95 P90 years forever!     Aug 2002 Yeehaw, FL 32×823 Posts If possible, it would be useful to see power draw for all these combinations. Some are trying to maximize (iter/s)/watt.
 2016-02-21, 10:22 #4 mackerel     Feb 2016 UK 6408 Posts CPU speed did have an influence, but just a lot less than would be expected based on CPU clock alone. I guess we can also look at that the other way. We could reduce (or choose a lower clock to start with) and give up less performance than you think, with what could be a more significant power saving. With hindsight I could have measured the power during the test, but I missed that opportunity. Power measurements might not be best during benchmarking anyway since the load and therefore power would be varying. I'll have a bit of a think about how to approach that, but more likely I'll have to do something with the stress test to get a more stable scenario. What size FFT would be used for leading edge real work right now?
2016-02-21, 10:55   #5
ATH
Einyen

Dec 2003
Denmark

2×29×53 Posts

Quote:
 Originally Posted by Prime95 If possible, it would be useful to see power draw for all these combinations. Some are trying to maximize (iter/s)/watt.
You almost sound like you are living in Denmark paying $0.33 for each KWh With my new computer including Titan Black running all the time it looks like I'm going to pay$150+ every month in electricity.

Last fiddled with by ATH on 2016-02-21 at 10:57

2016-02-21, 16:35   #6
chris2be8

Sep 2009

37518 Posts

Quote:
 Originally Posted by mackerel Let's take two of the Ripjaws 4 modules out, leaving it running in dual channel mode.
Are you sure it was running in dual channel mode? If the slots are numbered 1,2,3,4 some motherboards need modules in 1 and 3 to get dual channel mode, not 1 and 2.

Comparing results with modules in slots 1 and 2 against modules in slots 1 and 3 would be one way to find out. And comparing with only 1 module would give you a baseline speed for single channel mode.

Chris

2016-02-21, 18:01   #7
Prime95
P90 years forever!

Aug 2002
Yeehaw, FL

32×823 Posts

Quote:
 Originally Posted by mackerel I'll have a bit of a think about how to approach that, but more likely I'll have to do something with the stress test to get a more stable scenario. What size FFT would be used for leading edge real work right now?
The 4096K FFT size is typical for leading edge work.

I just replaced my broken kill-a-watt, plus I just got a Z170 board and DDR4-3200 memory, so I'll be generating some of that data myself.....maybe. I got it running using Ubuntu 14.04 installed on a USB stick. I powered down to install the kill-a-watt and now it won't boot off the USB stick.

 2016-02-21, 19:30 #8 mackerel     Feb 2016 UK 25·13 Posts Chris, I use slots 2 and 4 (counting 1 as nearest CPU). This seems consistent across pretty much all 4 slot mobos as best for dual channel mode, even better than 1 and 3 for some reason (when overclocking). Software like CPU-Z reports dual channel mode in this scenario. The performance drop from running single channel is very obvious.
 2016-02-25, 00:12 #9 mackerel     Feb 2016 UK 25·13 Posts http://s5.postimg.org/dwdnzw7s7/p95_bwpc.png I just tried to find a way to re-express the previous data in a way to give a better insight to ram requirements. I think I found it. I took the 3200 speed ram result as the reference, as it was the fastest I ran at, and thus should be least limited. Strictly speaking, it would be nice to have even more bandwidth but that's not happening unless I get a quad channel ram system. Anyway, I took the 1, 2, 3, 4 worker results for the tested ram speeds (2133 to 3200) and divided that by the reference results to give a scaling indication. That was then divided again by the number workers. I then divided the resulting value into the calculated ram bandwidth at each speed. For indication, 3200 ram in dual channel mode should offer 50GB/s. I tested with 4 single rank modules, so this is the higher performing state with 2 rank per channel. CPU was the i7-6700k at 4.2 GHz and cache at 4.1 GHz. As there was a lot of data, I tried to simplify it by only showing 1, 2, 4, 8M FFT sizes. They follow a similar trend, with minor variations throughout. I will leave that for another day, but the overall trend is clear enough. More bandwidth = faster up to a point of diminishing returns. I should add this only applies to the Skylake at 4.2 GHz. Presumably a slower clocked CPU will have more ram bandwidth relative to the CPU speed, and shift the charts up a bit. I need to test this and will have to work in some lower clocked CPU results later. I need to do more checking to see how well this fits in with past real data on scaling.
 2016-02-25, 10:21 #10 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 13×491 Posts Thank you for doing these measurements, it's really interesting data. When you run with one worker, is that one worker using all four cores, or one worker using one core? It's mostly through impatience - I quite like seeing a 50M number double-checked in 48 hours flat - but I tend to run with one worker using all the cores in the machine. My guess is that that will be memory-independent at the smaller sizes (where the working set fits in L3) and memory-limited at larger size.
 2016-02-25, 19:51 #11 mackerel     Feb 2016 UK 6408 Posts It is doing whatever the built in benchmark is doing. I think one worker = 1 core, and for multiple workers they each use a core independently. I haven't actually looked around for other options. Maybe I should... I also gave a bit of thought to adding CPU clock to the chart in some way. Let's say, if you halve the clock rate of the CPU without changing anything else, you would have the same efficiency if you likewise halved the memory bandwidth. Would it be really as simple as changing the horizontal axis to scale for CPU clock too? I have yet to test this idea, but I think all I need to do is re-scale it as something like ram bandwidth per working core GHz or similar.

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post TheJudger Software 1 2016-05-02 21:09 tha Hardware 7 2015-03-05 23:49 clarke Software 15 2015-03-04 21:48 stephensmedley Math 6 2015-01-04 15:12 mdettweiler Hardware 3 2014-07-28 16:35

All times are UTC. The time now is 05:07.

Wed Apr 14 05:07:32 UTC 2021 up 5 days, 23:48, 0 users, load averages: 2.32, 2.34, 2.54

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.