mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2009-09-13, 03:03   #1
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default Core i7 memory speeds

I wanted to see effect of memory speeds on a core i7 920.

My core i7 920 is a linux box. So I downloaded Stream from http://www.cs.virginia.edu/stream/ref.html as an app to test memory throughput. I used mprime 25.11 for benching prime.

My ram is rated 9-9-9@1333MHz. Core i7 920 on the other hand is only rated for 1066MHz memory speeds. So I ran 2x tests ram setting ram to 9-9-9@1333MHz and 8-8-8 @ 1066MHz. To get memory to run at 1333MHz, I did this by changing BCLK from 133 to 166MHz, and dropping the clock multiplier from 20x133MHz to 16x166MHz. So I kept the CPU clock roughly constant, but the uncore was overclocked.

Raw memory speeds: (min/avg/max MByte/sec)
8-8-8@1066MHz 8073/8139/8173
9-9-9@1333MHz 10799/10801/10805

With the mprime bench speeds:
!9-9-9 @ 1333
!single thread
Best time for 2560K FFT length: 52.424 ms.
!Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 2560K FFT length: 18.051 ms.

!8-8-8 @ 1066
!single thread
Best time for 2560K FFT length: 52.861 ms.
!Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 2560K FFT length: 18.717 ms.

Not much improvement for what is essentially overclocking the uncore.

If I enable all the 'turbo' options in the bios I get these timings:
!8-8-8 @ 1066
!accelaration features enabled
!single thread
Best time for 2560K FFT length: 50.880 ms.
!Timing FFTs using 8 threads on 4 physical CPUs.
Best time for 2560K FFT length: 35.003 ms.

I don't know what to make of the 8threads/4xcpu timing.

If I look at iteration times on normal operation:
!9-9-9 @ 1333
[Worker #3 Sep 13 01:42] Iteration: 18950000 / 22629017 [83.74%]. Per iteration time: 0.025 sec.
[Worker #4 Sep 13 01:42] Iteration: 40530000 / 44317951 [91.45%]. Per iteration time: 0.054 sec.

!8-8-8 @ 1066
[Worker #3 Sep 13 03:08] Iteration: 18970000 / 22629017 [83.83%]. Per iteration time: 0.026 sec.
[Worker #4 Sep 13 03:09] Iteration: 40540000 / 44317951 [91.47%]. Per iteration time: 0.056 sec.

There looks to be 'some' benefit, but all within error margins. M22629017 hovers between 0.025-0.026 normally.

It's looking that corei7 on one thread at least isn't memory limited. But I guess we already knew that.

-- Craig
nucleon is offline   Reply With Quote
Old 2009-09-13, 21:49   #2
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10001001102 Posts
Default

I don't know how stream works, but I get slightly higher numbers for memory reads.I started from this bench: http://home.comcast.net/~fbui/bandwidth.html. I tested various tricks including explicit prefetch and using non temporal loads; none increased BW, I guess the hardware prefetcher can do a good job on regular streams

On my i7 920, with my memory clocked at 1066 MHz and no overclocking at all I get >10000 MB/s. With multithreading I reached about 17000 MB/s with 4 threads.
ldesnogu is offline   Reply With Quote
Old 2009-09-14, 04:37   #3
CADavis
 
CADavis's Avatar
 
Jul 2005
Des Moines, Iowa, USA

2·5·17 Posts
Default

I'm currently running 1140MHz @ 7-7-7-21, and I recently ordered some memory that is rated 1600MHz @ 6-7-6-18, so I'll post some comparisons when I get the new memory installed.

I run P-1 on worker #1 and LL tests on workers #2, #3, and #4. All three LL tests are currently working with 2560K FFT length.

Currently @ 3800 MHz for the cpu my best time for 2560K FFT is around 37.49 ms (benchmark) and normal timings (when all 4 tests are running simultaneosly) for worker #2 are ~41.5-42.5ms, worker #3 ~45-48ms, worker #4 ~43.5-45ms.
CADavis is offline   Reply With Quote
Old 2009-09-14, 22:48   #4
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

135510 Posts
Default

Why did you not simply increase the memory multiplier from 8 to 10?

Alternatively, once setting the BCLK to 166, why not knock the uncore multiplier back down so that it maintains the same speed?

As long as the RAM is running at less than or equal to half the uncore speed, the system should be perfectly stable.
lavalamp is offline   Reply With Quote
Old 2009-09-14, 23:13   #5
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

20316 Posts
Default

Quote:
Originally Posted by lavalamp View Post
Why did you not simply increase the memory multiplier from 8 to 10?
Didn't work for me. Machine didn't complete boot.

Quote:
Originally Posted by lavalamp View Post
Alternatively, once setting the BCLK to 166, why not knock the uncore multiplier back down so that it maintains the same speed?
Didn't work for me. Machine didn't even complete bios checks.

Quote:
Originally Posted by lavalamp View Post
As long as the RAM is running at less than or equal to half the uncore speed, the system should be perfectly stable.
There's theory and then there's practice. I'm leaving things at the default. As a couple of percentage points improvement isn't worth a chance of instability.

-- Craig
nucleon is offline   Reply With Quote
Old 2009-09-14, 23:18   #6
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

5·271 Posts
Default

That's very curious, what is your motherboard?

All X58 boards are enthusiast boards, basically meant for overclocking. Something as simple as changing the memory multiplier shouldn't trip them up. What was the QPI multi at? If it was at the lowest 36 (or 18), then try increasing it to 48 (or 24).

Edit: Someone had a theory a while back that the uncore multi should be 2x or 2x+1 the RAM multi. Additionally, the QPI multi should be at least 2x the uncore multi, or 18/8 times the uncore multi for best stability.

The second part I'm not so sure about, but the first part does make a certain kind of sense.

Last fiddled with by lavalamp on 2009-09-14 at 23:21
lavalamp is offline   Reply With Quote
Old 2009-09-16, 01:51   #7
CADavis
 
CADavis's Avatar
 
Jul 2005
Des Moines, Iowa, USA

2·5·17 Posts
Default

Ok this might be a little bit of information/data overload for some, so the TL;DR version is at the bottom of my post.

Since I got my new memory today I have spent about the past 3 hours getting it to work and doing all these benchmarks. I decided to use exponents with an FFT length of 2560K since that is where all of my current LL tests are. I loaded one exponent into a worktodo on a completely new folder with Prime95, set the priority to 9 to get the most stable times, and closed out all other applications running on my computer. Oh and I'm using 64-bit Windows 7 Ultimate RTM and Prime95v25.11 64-bit.

My system before was configured as follows:
Bus speed 190MHz
CPU multiplier 20x, 3800MHz
QPI multiplier 18x, 3420MHz
UnCore multiplier 12x, 2280MHz
Memory multiplier x3 (x6), 570MHz (DDR3-1140MHz)
Memory timings 7-7-7-21-1N (CL-tRCD-tRP-tRAS-CR)
Triple channel, 3x 2048MB

Best times for the benchamrk were:
1 thread Best time for 2560K FFT length: 37.439 ms.
2 threads Best time for 2560K FFT length: 19.590 ms.
3 threads Best time for 2560K FFT length: 13.335 ms.
4 threads Best time for 2560K FFT length: 10.353 ms.

For each of the following results, each line with iteration times is a separate thread/core with the representative 1000 iterations from the first 5000 iterations outputting times at every 1000.

With one LL test running:
[Sep 15 15:25:12] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 37.575 ms.

With two LL tests running:
[Sep 15 15:36:12] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 38.963 ms.
[Sep 15 15:36:12] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 38.953 ms.

With three LL tests running:
[Sep 15 15:42:44] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 39.464 ms.
[Sep 15 15:42:44] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 39.634 ms.
[Sep 15 15:42:44] Iteration: 1000 / 41544631 [0.002407%]. Per iteration time: 39.482 ms.

With four LL tests running:
[Sep 15 15:18:59] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 40.572 ms.
[Sep 15 15:19:39] Iteration: 2000 / 41542693 [0.004814%]. Per iteration time: 40.357 ms.
[Sep 15 15:19:39] Iteration: 2000 / 41544631 [0.004814%]. Per iteration time: 40.191 ms.
[Sep 15 15:19:39] Iteration: 2000 / 41546737 [0.004813%]. Per iteration time: 40.385 ms.

Now I installed the new memory and without too much tweaking yet I got my system running stable enough to complete the benchmark and at least 5000 iterations of each test:
Bus speed 200MHz
CPU multiplier 19x, 3800MHz
QPI multiplier 18x, 3600MHz
UnCore multiplier 16x, 3200MHz
Memory multiplier x4 (x8), 800MHz (DDR3-1600MHz)
Memory timings 8-8-8-22-2N (CL-tRCD-tRP-tRAS-CR)
Triple channel, 3x 2048MB

So what really changed is the exact same CPU speed, memory increased from 570mhz to 800mhz, uncore increased from 2280MHz to 3200MHz, and QPI increased from 3420MHz to 3600MHz.

Benchmarks (best time, %faster):
1 thread Best time for 2560K FFT length: 36.756 ms., ~1.8%
2 threads Best time for 2560K FFT length: 18.996 ms., ~3%
3 threads Best time for 2560K FFT length: 12.857 ms., ~3.5%
4 threads Best time for 2560K FFT length: 09.814 ms., ~5.2%

Running 1 LL test:
[Sep 15 19:50:17] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 36.809 ms.

As we mostly expect, with only one LL test running, the difference in iteration times is only ~0.7ms, ~1.8%.

Running 2 LL tests:
[Sep 15 19:55:47] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 37.498 ms.
[Sep 15 19:55:09] Starting primality test of M41542693 using FFT length 2560K
[Sep 15 19:55:47] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 37.519 ms.

With 2 LL tests running, the difference in iteration times is ~1.4ms, ~3.6%.

Running 3 LL tests:
[Sep 15 20:05:58] Iteration: 3000 / 41544119 [0.007221%]. Per iteration time: 37.515 ms.
[Sep 15 20:05:58] Iteration: 3000 / 41542693 [0.007221%]. Per iteration time: 37.572 ms.
[Sep 15 20:05:20] Iteration: 2000 / 41544631 [0.004814%]. Per iteration time: 37.467 ms

With 3 LL tests running, the difference in iteration times is ~2ms, ~5%.

Running 4 LL tests:
[Sep 15 19:34:50] Iteration: 2000 / 41544119 [0.004814%]. Per iteration time: 37.920 ms.
[Sep 15 19:35:28] Iteration: 3000 / 41542693 [0.007221%]. Per iteration time: 37.916 ms
[Sep 15 19:35:29] Iteration: 3000 / 41544631 [0.007221%]. Per iteration time: 38.058 ms.
[Sep 15 19:34:50] Iteration: 2000 / 41546737 [0.004813%]. Per iteration time: 37.892 ms.

So I'm getting the same performance running 4 LLs that I was getting with only one LL running. ~2.5ms decrease in iteration times, makes for ~6% speed increase.

TL;DR version (Conclusion and consolidation):

Keeping the Core i7 CPU at 3800MHz, increasing the memory from 570MHz to 800MHz, increasing the QPI

from 3420MHz to 3600MHz, and the UnCore from 2280MHz to 3200MHz, iteration times decreased as

follows (rounded to 0.1ms):

1 thread: 37.5ms to 36.8ms, 0.7ms 1.8%
2 threads: 38.9ms to 37.5ms, 1.4ms 3.6%
3 threads: 39.5ms to 37.5ms, 2.0ms 5.0%
4 threads: 40.5ms to 38.0ms, 2.5ms 6.0%

So I conclude that a 40% increase in memory speed has a negligible 2-6% decrease in iteration times on the Core i7 architechture for my system that is overclocked from stock 2.66GHz to 3.8GHz.
Attached Files
File Type: txt benchmarks.txt (15.3 KB, 117 views)

Last fiddled with by CADavis on 2009-09-16 at 02:04 Reason: lots of formatting b/c I accidentally hit submit instead of preview :-/
CADavis is offline   Reply With Quote
Old 2009-09-16, 02:18   #8
lavalamp
 
lavalamp's Avatar
 
Oct 2007
Manchester, UK

5×271 Posts
Default

Quote:
Originally Posted by CADavis View Post
So I conclude that a 40% increase in memory speed has a negligible 2-6% decrease in iteration times on the Core i7 architechture for my system that is overclocked from stock 2.66GHz to 3.8GHz.
I wouldn't say 6% is negligible, it's like you suddenly overclocked your CPU further to 4.028 GHz.

I suspect that raising the RAM speed didn't effect the 6% gain though, rather that increasing the Uncore frequency by almost a full GHz is responsible for most if not all of the gains. The Uncore contains the L3 cache after all, and with triple channel memory the system was already swimming in memory bandwidth.
lavalamp is offline   Reply With Quote
Old 2009-09-17, 11:28   #9
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5×103 Posts
Default

motherboard: GA-EX58-UD5

http://www.gigabyte.com.au/Products/...ProductID=2958

-- Craig
nucleon is offline   Reply With Quote
Old 2009-09-17, 11:47   #10
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Quote:
Originally Posted by lavalamp View Post
All X58 boards are enthusiast boards, basically meant for overclocking. Something as simple as changing the memory multiplier shouldn't trip them up.
Memory controller for core i7 is on the CPU. It's the core i7 920 cpu that doesn't support the higher multiplier. Core i7 8xx do support higher multiplier.

Quote:
Originally Posted by lavalamp View Post
What was the QPI multi at? If it was at the lowest 36 (or 18), then try increasing it to 48 (or 24).
QPI talks to the PCI-E and others - you don't want to muck around with this too much. The bios couldn't clock it down any lower.


-- Craig
nucleon is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Intel 6 core Gulftown memory saturation stars10250 Hardware 6 2010-01-15 18:49
LL tests running at different speeds GARYP166 Information & Answers 11 2009-07-13 19:39
sieving speeds for Intels jasong Sierpinski/Riesel Base 5 11 2007-08-09 00:15
Importance of dual channel memory for dual core processors patrik Hardware 3 2007-01-07 09:26
Factoring Speeds Khemikal796 Lone Mersenne Hunters 5 2005-04-26 20:28

All times are UTC. The time now is 16:36.


Fri Jul 16 16:37:00 UTC 2021 up 49 days, 14:24, 1 user, load averages: 2.02, 1.76, 1.65

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.