mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2015-10-26, 02:19   #1
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

BA316 Posts
Default Quad Channel DDR4 vs Dual Channel DDR4

I got my new computer: Haswell-E 8x core 5960X with 4x8 Gb 3000 Mhz 15-16-16-39 RAM (XMP).

I have been curious about the benefits of quad channel RAM vs dual channel for Prime95, so almost the first thing I did was testing Prime95 with 4 x 8Gb in quad channel and 2x8Gb in dual channel both setups running at 3000Mhz 15-16-16-39 and the 5960X processor running at 3500Mhz with Hyper Threading turned off.

I tested 40M, 60M and 80M exponents with different setups of threads and cores per thread and each timing is the average over 100,000 iterations:

Code:
				    Quad Channel DDR4		    Dual Channel DDR4
				   3000Mhz 15-16-16-39		   3000Mhz 15-16-16-39
				  iteration times in ms		  iteration times in ms

40M exponent(s)
FMA3 FFT 2240K
1 worker 8 cores/worker			  1.356				  1.366
2 workers 4 cores/worker	       2.935/2.937		       4.159/4.122	     (+41%)
4 workers 2 cores/worker	  5.961/5.968/5.924/5.923	  9.472/9.469/9.472/9.473    (+59%)
8 workers 1 core/worker		11.611/11.590/11.590/11.594	19.372/19.353/19.374/19.379  (+67%)
				11.598/11.599/11.622/11.623	19.371/19.370/19.368/19.372

60M exponent(s)
FMA3 FFT 3200K
1 worker 8 cores/worker			  2.015				  2.487		     (+23%)
2 workers 4 cores/worker	       4.296/4.297		       6.524/6.525	     (+52%)
4 workers 2 cores/worker	  8.563/8.559/8.565/8.606	13.777/13.751/13.751/13.751  (+60%)
8 workers 1 core/worker		16.960/16.946/16.931/16.946	27.827/27.823/27.826/27.837  (+64%)
				16.988/16.961/16.943/16.942	27.842/27.835/27.843/27.874

80M exponent(s)
FMA3 FFT 4608K
1 worker 8 cores/worker			  3.045				  4.622		     (+52%)
2 workers 4 cores/worker	       6.156/6.156		       9.908/9.923	     (+61%)
4 workers 2 cores/worker	12.348/12.347/12.401/12.348	20.203/20.126/20.211/20.222  (+63%)
8 workers 1 core/worker		24.542/24.517/24.700/24.736	39.979/39.725/39.934/40.065  (+62%)
				24.729/24.717/24.720/24.785	40.199/40.080/40.239/40.057

Last fiddled with by ATH on 2015-10-26 at 03:09
ATH is offline   Reply With Quote
Old 2015-10-26, 02:24   #2
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

32·331 Posts
Default

I also ran the Benchmark in Prime95 with:
BenchAllComplex=1
BenchHyperthreads=0
BenchMultithreads=1
in prime.txt.

Prime95Benchmark.html


and I ran standard memory speed benchmarks:

MemoryBenchmark.html
ATH is offline   Reply With Quote
Old 2015-10-26, 02:48   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3×1,193 Posts
Default

Fascinating, for most FFT lengths it appears that you are better off running 1 multi-threaded worker.
Prime95 is online now   Reply With Quote
Old 2015-10-26, 03:01   #4
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

32×331 Posts
Default

Looking at the (8 cpus, 1 worker) benchmark quad and dual throughput are tied up to 2048K FFT, then quad takes over as memory becomes the bottleneck.

But at (8 cpus, 8 workers) quad channel wins over dual channel for all the FFTs.
ATH is offline   Reply With Quote
Old 2015-10-26, 03:02   #5
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

133678 Posts
Default

Can you please clarify the the meaning of "thread" here. I don't understand how you can have 8 cores on a single thread. Do you mean "worker" instead of "thread"?
Quote:
Code:
1 thread 8 cores/thread
retina is offline   Reply With Quote
Old 2015-10-26, 03:07   #6
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

297910 Posts
Default

Yes sorry I mean workers. I called them threads because they are called "WorkerThreads" in local.txt and somehow I got stuck on threads instead of workers.
ATH is offline   Reply With Quote
Old 2015-10-26, 03:11   #7
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

1100110011012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Fascinating, for most FFT lengths it appears that you are better off running 1 multi-threaded worker.
That's been my thought as well regarding my own systems with triple or quad channel memory, for the most part.

I still see a certain amount of % loss when adding additional cores to a worker, but definitely on DDR4 with quad-channel, it keeps up a LOT more than DDR3 with quad channel (Xeon E5 26xx v1/v2) or triple channel (Xeon 56xx for example).

I've probably crippled some of my triple-channel systems because I've had to add 3 DIMMs per channel to some of them which lowers the clock rate... oh well. So don't do that. If you want max performance, stick with 1 DIMM per channel at the fastest speed the system allows.
Madpoo is offline   Reply With Quote
Old 2015-10-26, 03:39   #8
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

32×331 Posts
Default

Added total iterations per second to my manual LL test times and 1 worker with 8 cores on it is the most efficient like in the benchmarks:

Code:
				    Quad Channel DDR4			       Dual Channel DDR4
				   3000Mhz 15-16-16-39			      3000Mhz 15-16-16-39
				  iteration times in ms			     iteration times in ms
				(total iterations per sec)		   (total iterations per sec)

40M exponent(s)
FMA3 FFT 2240K
1 worker 8 cores/worker			  1.356	            (737 ite/sec)	    1.366			(732 ite/sec)
2 workers 4 cores/worker	       2.935/2.937	    (681 ite/sec)         4.159/4.122	        (+41%)	(483 ite/sec)
4 workers 2 cores/worker	  5.961/5.968/5.924/5.923   (673 ite/sec)    9.472/9.469/9.472/9.473    (+59%)	(422 ite/sec)
8 workers 1 core/worker		11.611/11.590/11.590/11.594 (689 ite/sec)  19.372/19.353/19.374/19.379  (+67%)	(413 ite/sec)
				11.598/11.599/11.622/11.623	 	   19.371/19.370/19.368/19.372

60M exponent(s)
FMA3 FFT 3200K	
1 worker 8 cores/worker			  2.015		    (496 ite/sec)	    2.487		(+23%)	(402 ite/sec)
2 workers 4 cores/worker	       4.296/4.297	    (465 ite/sec)         6.524/6.525	        (+52%)	(306 ite/sec)
4 workers 2 cores/worker	  8.563/8.559/8.565/8.606   (466 ite/sec)  13.777/13.751/13.751/13.751  (+60%)	(291 ite/sec)
8 workers 1 core/worker		16.960/16.946/16.931/16.946 (472 ite/sec)  27.827/27.823/27.826/27.837  (+64%)	(287 ite/sec)
				16.988/16.961/16.943/16.942		   27.842/27.835/27.843/27.874

80M exponent(s)
FMA3 FFT 4608K
1 worker 8 cores/worker			  3.045		    (328 ite/sec)	    4.622		(+52%)	(216 ite/sec)
2 workers 4 cores/worker	       6.156/6.156	    (325 ite/sec)         9.908/9.923	        (+61%)	(202 ite/sec)
4 workers 2 cores/worker	12.348/12.347/12.401/12.348 (324 ite/sec)  20.203/20.126/20.211/20.222  (+63%)	(198 ite/sec)
8 workers 1 core/worker		24.542/24.517/24.700/24.736 (324 ite/sec)  39.979/39.725/39.934/40.065  (+62%)	(200 ite/sec)
				24.729/24.717/24.720/24.785		   40.199/40.080/40.239/40.057
ATH is offline   Reply With Quote
Old 2015-10-26, 05:19   #9
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

213448 Posts
Default

Very good job ATH!
Very informative, too.

It also seems curious to me how a single worker 8 cores is, per total, more productive than 8 workers single threaded. The single worker would still need time to put the pieces of the FFT together from the 8 cores. It is only explainable if Intel did wonders with internal process switching and their MESI protocols or whatever they were called, when sharing cache between cores... (I still have in mind the document about memories shared here on the forum some time ago).

Last fiddled with by LaurV on 2015-10-26 at 05:20
LaurV is offline   Reply With Quote
Old 2015-10-26, 07:52   #10
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Welcome to the club.
Karl M Johnson is offline   Reply With Quote
Old 2015-10-26, 10:16   #11
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

22×3×479 Posts
Default

Could you do some tests with only 4 and maybe 6 cores in use rather than 8?
That should give an indication of how much the cheaper Haswell cpus would be limited with ddr4 rather than ddr3.
henryzz is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
i3 w/DDR4? Fred Hardware 13 2016-03-24 08:16
Single vs Dual channel memory TObject Hardware 5 2014-12-24 05:58
Importance of dual channel memory for dual core processors patrik Hardware 3 2007-01-07 09:26
Opteron 175, Asus A8V-Deluxe, OCZ dual channel pc4000 optyguy Hardware 3 2006-01-21 08:06
Cache, dual channel memory and Mprime performance optim Hardware 4 2004-06-25 03:20

All times are UTC. The time now is 21:38.

Thu Nov 26 21:38:01 UTC 2020 up 77 days, 18:48, 4 users, load averages: 1.45, 1.53, 1.50

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.