![]() |
|
|
#1 |
|
Mar 2003
Melbourne
5·103 Posts |
I wanted to see effect of memory speeds on a core i7 920.
My core i7 920 is a linux box. So I downloaded Stream from http://www.cs.virginia.edu/stream/ref.html as an app to test memory throughput. I used mprime 25.11 for benching prime. My ram is rated 9-9-9@1333MHz. Core i7 920 on the other hand is only rated for 1066MHz memory speeds. So I ran 2x tests ram setting ram to 9-9-9@1333MHz and 8-8-8 @ 1066MHz. To get memory to run at 1333MHz, I did this by changing BCLK from 133 to 166MHz, and dropping the clock multiplier from 20x133MHz to 16x166MHz. So I kept the CPU clock roughly constant, but the uncore was overclocked. Raw memory speeds: (min/avg/max MByte/sec) 8-8-8@1066MHz 8073/8139/8173 9-9-9@1333MHz 10799/10801/10805 With the mprime bench speeds: !9-9-9 @ 1333 !single thread Best time for 2560K FFT length: 52.424 ms. !Timing FFTs using 8 threads on 4 physical CPUs. Best time for 2560K FFT length: 18.051 ms. !8-8-8 @ 1066 !single thread Best time for 2560K FFT length: 52.861 ms. !Timing FFTs using 8 threads on 4 physical CPUs. Best time for 2560K FFT length: 18.717 ms. Not much improvement for what is essentially overclocking the uncore. If I enable all the 'turbo' options in the bios I get these timings: !8-8-8 @ 1066 !accelaration features enabled !single thread Best time for 2560K FFT length: 50.880 ms. !Timing FFTs using 8 threads on 4 physical CPUs. Best time for 2560K FFT length: 35.003 ms. I don't know what to make of the 8threads/4xcpu timing. If I look at iteration times on normal operation: !9-9-9 @ 1333 [Worker #3 Sep 13 01:42] Iteration: 18950000 / 22629017 [83.74%]. Per iteration time: 0.025 sec. [Worker #4 Sep 13 01:42] Iteration: 40530000 / 44317951 [91.45%]. Per iteration time: 0.054 sec. !8-8-8 @ 1066 [Worker #3 Sep 13 03:08] Iteration: 18970000 / 22629017 [83.83%]. Per iteration time: 0.026 sec. [Worker #4 Sep 13 03:09] Iteration: 40540000 / 44317951 [91.47%]. Per iteration time: 0.056 sec. There looks to be 'some' benefit, but all within error margins. M22629017 hovers between 0.025-0.026 normally. It's looking that corei7 on one thread at least isn't memory limited. But I guess we already knew that. -- Craig |
|
|
|
|
|
#2 |
|
Jan 2008
France
2×52×11 Posts |
I don't know how stream works, but I get slightly higher numbers for memory reads.I started from this bench: http://home.comcast.net/~fbui/bandwidth.html. I tested various tricks including explicit prefetch and using non temporal loads; none increased BW, I guess the hardware prefetcher can do a good job on regular streams
![]() On my i7 920, with my memory clocked at 1066 MHz and no overclocking at all I get >10000 MB/s. With multithreading I reached about 17000 MB/s with 4 threads. |
|
|
|
|
|
#3 |
|
Jul 2005
Des Moines, Iowa, USA
2·5·17 Posts |
I'm currently running 1140MHz @ 7-7-7-21, and I recently ordered some memory that is rated 1600MHz @ 6-7-6-18, so I'll post some comparisons when I get the new memory installed.
I run P-1 on worker #1 and LL tests on workers #2, #3, and #4. All three LL tests are currently working with 2560K FFT length. Currently @ 3800 MHz for the cpu my best time for 2560K FFT is around 37.49 ms (benchmark) and normal timings (when all 4 tests are running simultaneosly) for worker #2 are ~41.5-42.5ms, worker #3 ~45-48ms, worker #4 ~43.5-45ms. |
|
|
|
|
|
#4 |
|
Oct 2007
Manchester, UK
22·3·113 Posts |
Why did you not simply increase the memory multiplier from 8 to 10?
Alternatively, once setting the BCLK to 166, why not knock the uncore multiplier back down so that it maintains the same speed? As long as the RAM is running at less than or equal to half the uncore speed, the system should be perfectly stable. |
|
|
|
|
|
#5 | |||
|
Mar 2003
Melbourne
5·103 Posts |
Quote:
Quote:
Quote:
-- Craig |
|||
|
|
|
|
|
#6 |
|
Oct 2007
Manchester, UK
101010011002 Posts |
That's very curious, what is your motherboard?
All X58 boards are enthusiast boards, basically meant for overclocking. Something as simple as changing the memory multiplier shouldn't trip them up. What was the QPI multi at? If it was at the lowest 36 (or 18), then try increasing it to 48 (or 24). Edit: Someone had a theory a while back that the uncore multi should be 2x or 2x+1 the RAM multi. Additionally, the QPI multi should be at least 2x the uncore multi, or 18/8 times the uncore multi for best stability. The second part I'm not so sure about, but the first part does make a certain kind of sense. Last fiddled with by lavalamp on 2009-09-14 at 23:21 |
|
|
|
|
|
#7 |
|
Jul 2005
Des Moines, Iowa, USA
101010102 Posts |
Ok this might be a little bit of information/data overload for some, so the TL;DR version is at the bottom of my post.
Since I got my new memory today I have spent about the past 3 hours getting it to work and doing all these benchmarks. I decided to use exponents with an FFT length of 2560K since that is where all of my current LL tests are. I loaded one exponent into a worktodo on a completely new folder with Prime95, set the priority to 9 to get the most stable times, and closed out all other applications running on my computer. Oh and I'm using 64-bit Windows 7 Ultimate RTM and Prime95v25.11 64-bit. My system before was configured as follows: Bus speed 190MHz CPU multiplier 20x, 3800MHz QPI multiplier 18x, 3420MHz UnCore multiplier 12x, 2280MHz Memory multiplier x3 (x6), 570MHz (DDR3-1140MHz) Memory timings 7-7-7-21-1N (CL-tRCD-tRP-tRAS-CR) Triple channel, 3x 2048MB Best times for the benchamrk were: 1 thread Best time for 2560K FFT length: 37.439 ms. 2 threads Best time for 2560K FFT length: 19.590 ms. 3 threads Best time for 2560K FFT length: 13.335 ms. 4 threads Best time for 2560K FFT length: 10.353 ms. For each of the following results, each line with iteration times is a separate thread/core with the representative 1000 iterations from the first 5000 iterations outputting times at every 1000. With one LL test running: [Sep 15 15:25:12] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 37.575 ms. With two LL tests running: [Sep 15 15:36:12] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 38.963 ms. [Sep 15 15:36:12] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 38.953 ms. With three LL tests running: [Sep 15 15:42:44] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 39.464 ms. [Sep 15 15:42:44] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 39.634 ms. [Sep 15 15:42:44] Iteration: 1000 / 41544631 [0.002407%]. Per iteration time: 39.482 ms. With four LL tests running: [Sep 15 15:18:59] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 40.572 ms. [Sep 15 15:19:39] Iteration: 2000 / 41542693 [0.004814%]. Per iteration time: 40.357 ms. [Sep 15 15:19:39] Iteration: 2000 / 41544631 [0.004814%]. Per iteration time: 40.191 ms. [Sep 15 15:19:39] Iteration: 2000 / 41546737 [0.004813%]. Per iteration time: 40.385 ms. Now I installed the new memory and without too much tweaking yet I got my system running stable enough to complete the benchmark and at least 5000 iterations of each test: Bus speed 200MHz CPU multiplier 19x, 3800MHz QPI multiplier 18x, 3600MHz UnCore multiplier 16x, 3200MHz Memory multiplier x4 (x8), 800MHz (DDR3-1600MHz) Memory timings 8-8-8-22-2N (CL-tRCD-tRP-tRAS-CR) Triple channel, 3x 2048MB So what really changed is the exact same CPU speed, memory increased from 570mhz to 800mhz, uncore increased from 2280MHz to 3200MHz, and QPI increased from 3420MHz to 3600MHz. Benchmarks (best time, %faster): 1 thread Best time for 2560K FFT length: 36.756 ms., ~1.8% 2 threads Best time for 2560K FFT length: 18.996 ms., ~3% 3 threads Best time for 2560K FFT length: 12.857 ms., ~3.5% 4 threads Best time for 2560K FFT length: 09.814 ms., ~5.2% Running 1 LL test: [Sep 15 19:50:17] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 36.809 ms. As we mostly expect, with only one LL test running, the difference in iteration times is only ~0.7ms, ~1.8%. Running 2 LL tests: [Sep 15 19:55:47] Iteration: 1000 / 41544119 [0.002407%]. Per iteration time: 37.498 ms. [Sep 15 19:55:09] Starting primality test of M41542693 using FFT length 2560K [Sep 15 19:55:47] Iteration: 1000 / 41542693 [0.002407%]. Per iteration time: 37.519 ms. With 2 LL tests running, the difference in iteration times is ~1.4ms, ~3.6%. Running 3 LL tests: [Sep 15 20:05:58] Iteration: 3000 / 41544119 [0.007221%]. Per iteration time: 37.515 ms. [Sep 15 20:05:58] Iteration: 3000 / 41542693 [0.007221%]. Per iteration time: 37.572 ms. [Sep 15 20:05:20] Iteration: 2000 / 41544631 [0.004814%]. Per iteration time: 37.467 ms With 3 LL tests running, the difference in iteration times is ~2ms, ~5%. Running 4 LL tests: [Sep 15 19:34:50] Iteration: 2000 / 41544119 [0.004814%]. Per iteration time: 37.920 ms. [Sep 15 19:35:28] Iteration: 3000 / 41542693 [0.007221%]. Per iteration time: 37.916 ms [Sep 15 19:35:29] Iteration: 3000 / 41544631 [0.007221%]. Per iteration time: 38.058 ms. [Sep 15 19:34:50] Iteration: 2000 / 41546737 [0.004813%]. Per iteration time: 37.892 ms. So I'm getting the same performance running 4 LLs that I was getting with only one LL running. ~2.5ms decrease in iteration times, makes for ~6% speed increase. TL;DR version (Conclusion and consolidation): Keeping the Core i7 CPU at 3800MHz, increasing the memory from 570MHz to 800MHz, increasing the QPI from 3420MHz to 3600MHz, and the UnCore from 2280MHz to 3200MHz, iteration times decreased as follows (rounded to 0.1ms): 1 thread: 37.5ms to 36.8ms, 0.7ms 1.8% 2 threads: 38.9ms to 37.5ms, 1.4ms 3.6% 3 threads: 39.5ms to 37.5ms, 2.0ms 5.0% 4 threads: 40.5ms to 38.0ms, 2.5ms 6.0% So I conclude that a 40% increase in memory speed has a negligible 2-6% decrease in iteration times on the Core i7 architechture for my system that is overclocked from stock 2.66GHz to 3.8GHz. Last fiddled with by CADavis on 2009-09-16 at 02:04 Reason: lots of formatting b/c I accidentally hit submit instead of preview :-/ |
|
|
|
|
|
#8 | |
|
Oct 2007
Manchester, UK
22×3×113 Posts |
Quote:
I suspect that raising the RAM speed didn't effect the 6% gain though, rather that increasing the Uncore frequency by almost a full GHz is responsible for most if not all of the gains. The Uncore contains the L3 cache after all, and with triple channel memory the system was already swimming in memory bandwidth. |
|
|
|
|
|
|
#9 |
|
Mar 2003
Melbourne
5×103 Posts |
|
|
|
|
|
|
#10 | ||
|
Mar 2003
Melbourne
10000000112 Posts |
Quote:
Quote:
-- Craig |
||
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Intel 6 core Gulftown memory saturation | stars10250 | Hardware | 6 | 2010-01-15 18:49 |
| LL tests running at different speeds | GARYP166 | Information & Answers | 11 | 2009-07-13 19:39 |
| sieving speeds for Intels | jasong | Sierpinski/Riesel Base 5 | 11 | 2007-08-09 00:15 |
| Importance of dual channel memory for dual core processors | patrik | Hardware | 3 | 2007-01-07 09:26 |
| Factoring Speeds | Khemikal796 | Lone Mersenne Hunters | 5 | 2005-04-26 20:28 |