![]() |
George, I run mostly my own code, so am not intimately familiar with the Prime95 self-test protocols:
o The above timing sets start with the expected "4 cores" blurb re. the CPU, but are the ensuing timings for running on 1 core or all 4? o How does the very slight speedup for the faster ddr3 memory compare with what you expected? |
[QUOTE=ewmayer;343177]
o The above timing sets start with the expected "4 cores" blurb re. the CPU, but are the ensuing timings for running on 1 core or all 4? o How does the very slight speedup for the faster ddr3 memory compare with what you expected?[/QUOTE] The first batch of numbers are running on 1 core. I didn't really have any expectations for the single core numbers with the higher bandwidth ddr3. As I expected the huge gains for faster ddr3 come when running a worker on all 4 cores. I really need to rerun these benchmarks after turning off turbo boost. The CPU frequency could be different at times in the two benchmark runs. |
[QUOTE=Prime95;343188]The first batch of numbers are running on 1 core.
I didn't really have any expectations for the single core numbers with the higher bandwidth ddr3. As I expected the huge gains for faster ddr3 come when running a worker on all 4 cores.[/QUOTE] That's what I guessed - since the multiworker tests indicate "1 single-thread process per core" tests of overall system memory bandwidth. |
I actually had the same question as Ernst.
If you run benchmark and at the same time observe the task manager, you may see that when the test reports "running on 1 cpu", all N cores are 100% busy; then, during "Timing FFT using 2 threads" (N-1) cores are 100% busy (!), then during "Timing FFT using 3 threads" (N-2) cores are 100% busy (!), and so on. This is 27.9 on a 6-core Xeon. |
[QUOTE=Batalov;343193]you may see that when the test reports "running on 1 cpu", all N cores are 100% busy;[/QUOTE]
Now that you mention it, I remember adding some code to the benchmarking routine that starts dummy tasks on the other cores to simulate a fully loaded system. This should limit the way turbo / speedstep messes with the benchmarks. Your description makes it sound like the code isn't working perfectly. |
Ah. Well, if they don't use the bus, then the results should be "simulatedly" correct. (Activity would make all cores wake up from being downclocked, e.g. to 1600MHz, and at the same time the bandwith would be all available to the cores that are being tested for computational throughput.)
Now it makes sense; it was just hard to conjecture this from what task manager was showing. |
We seem to be stable at 4.2GHz, 1.2V, DDR3-2400. Temps in the mid-70s running large FFTs and 80 running small FFTs. When I tried running 4.4GHz at 1.25V, as some overclockers have done, the large torture test seemed to be OK with temps in the low 80s. I then tried the small FFT torture test. Instantly, temps spiked to 100 and throttling began.
Note that version 27 definitely has problems determining the CPU speed on Haswell. All cores at 4.2 GHz, DDR3-2400 Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz CPU speed: 3909.47 MHz, 4 cores Prime95 64-bit version 27.9, RdtscTiming=1 Best time for 768K FFT length: 3.082 ms., avg: 3.103 ms. Best time for 896K FFT length: 3.844 ms., avg: 3.885 ms. Best time for 1024K FFT length: 4.346 ms., avg: 4.360 ms. Best time for 1280K FFT length: 5.413 ms., avg: 5.472 ms. Best time for 1536K FFT length: 6.613 ms., avg: 6.684 ms. Best time for 1792K FFT length: 7.933 ms., avg: 7.962 ms. Best time for 2048K FFT length: 9.223 ms., avg: 9.409 ms. Best time for 2560K FFT length: 11.368 ms., avg: 11.388 ms. Best time for 3072K FFT length: 13.849 ms., avg: 13.882 ms. Best time for 3584K FFT length: 16.915 ms., avg: 16.935 ms. Best time for 4096K FFT length: 18.718 ms., avg: 18.746 ms. Best time for 5120K FFT length: 24.605 ms., avg: 24.632 ms. Best time for 6144K FFT length: 29.271 ms., avg: 29.313 ms. Best time for 7168K FFT length: 35.181 ms., avg: 36.428 ms. Best time for 8192K FFT length: 41.031 ms., avg: 41.083 ms. Times for LL test on 77000003 (4M FFT): 1 worker: 18.8 ms. 2 workers: 19.2, 19.2 ms. 3 workers: 20.2, 20.1, 20.1 ms. 4 workers: 22.5 ms. each |
In preparation for the upgrade to Haswell, some benchmark Mlucas timings on my soon-to-be-legacy CPU:
[b]Test #1: SSE2 mode on 3.3 GHz Sandy Bridge quad, DDR3 SDRAM 1333 (PC3 10600):[/b] [code] Mersenne-mod: Fermat-mod: FFT len sec/iter sec/iter (Kdbl) 1-thr 2-thr 4-thr 1-thr 2-thr 4-thr 1024 0.019 0.010 0.005 .0179 .0093 .0050 1152 0.026 0.013 0.007 1280 0.025 0.013 0.008 1408 0.038 0.019 0.010 1536 0.030 0.015 0.008 1664 0.045 0.023 0.013 1792 0.037 0.019 0.010 .0343 .0175 .0096 1920 0.052 0.027 0.014 .0498 .0250 .0135 2048 0.041 0.021 0.012 .0385 .0197 .0107 2304 0.052 0.027 0.015 2560 0.054 0.028 0.018 2816 0.078 0.040 0.022 3072 0.064 0.034 0.019 3328 0.094 0.048 0.027 3584 0.078 0.040 0.024 .0730 .0371 .0213 3840 0.109 0.055 0.030 .1027 .0520 .0281 4096 0.083 0.044 0.027 .0813 .0416 .0236 4608 0.111 0.057 0.033 5120 0.108 0.058 0.038 5632 0.165 0.084 0.047 6144 0.128 0.069 0.050 6656 0.191 0.099 0.057 7168 0.157 0.083 0.059 .1473 .0771 .0571 7680 0.228 0.116 0.065 .2153 .1094 .0605 8192 0.176 0.094 0.068 .1689 .0867 .0641 -------------------------------------------------- Avg || Scaling: 1.934x 3.347x ---- 1.955x 3.374x [/code] Note that Fermat-mod convolution mode is only supported for runlengths which are powers of 2 or just slightly less than such - runlengths shorter than the 7*2^n ones are infeasible due to excess roundoff error, and ones of form 9*2^n are wasteful for the above size range because of "too little roundoff error". (The latter variety of lengths will become useful out around F34-F35 or so, when 16 bits per double will be too much, based on ROE levels. But that's still a few years down the road). "Avg parallel scaling" is computed as the arithmetic average of the 1-thread column runtimes divided by their 2 and 4-threaded counterparts. Note how many of the FFT lengths involving larger odd primes 11 and 13 or odd composites 15 scale better than their smoother brethren as the thread count increases. For example 7680 K = 15.2^19 is crap running 1-threaded compared to its neighbors 7168K and 8192K, but interpolates them nicely at 4-threads. AVX timings later ... I need to get some dinner and chores done. |
[QUOTE=sdbardwick;343078]George, (please forgive silly question) did the doubling of cache bandwidth from IB to Haswell offer any improvement, or is the latency (unchanged) more important?[/QUOTE]
Haswell is faster than Sandy Bridge (I don't have Ivy). My guess is the difference is due to doubling of cache bandwidths, but it may be due to other architectural improvements. Here is the evidence I'm using to come to the conclusion above: Remember the 22 clock macro I rewrote to use FMA? The same macro runs in 25 clocks on SB (both are running out of the L1 cache). The same macro runs in 30 clocks on Haswell when data is in the L2 cache (an 8 clock penalty), this takes 40 clocks on SB when data is in the L2 cache (a 15 clock penalty). |
Thanks, George!
|
Here are the timings on my SB quad for the AVX-based Mlucas code. Note AVX support for Mersenne-mod convolution is still very recent - when I started my SSE2 -> AVX port at beginning of the year my first priority was to get things working/optimized for Fermat-mod arithmetic. As you can see, there is a large disparity in improvement-versus-SSE2 for the two distinct convolution modes.
[b]Test #2: AVX mode [/b][but with SSE2-based carry macros in the Mersenne-mod case][b] on 3.3 GHz Sandy Bridge quad, DDR3 SDRAM 1333 (PC3 10600):[/b] [code] Mersenne-mod: Fermat-mod: FFT len sec/iter sec/iter (Kdbl) 1-thr 2-thr 4-thr 1-thr 2-thr 4-thr 1024 0.018 0.009 0.005 .0133 .0070 .0039 1152 0.025 0.012 0.007 1280 0.024 0.012 0.007 1408 0.033 0.017 0.009 1536 0.029 0.014 0.008 1664 0.040 0.020 0.011 1792 0.036 0.018 0.010 .0247 .0127 .0073 1920 0.046 0.023 0.013 .0321 .0166 .0091 2048 0.037 0.020 0.012 .0291 .0151 .0086 2304 0.052 0.027 0.015 2560 0.048 0.025 0.016 2816 0.069 0.036 0.020 3072 0.058 0.030 0.019 3328 0.083 0.042 0.023 3584 0.074 0.039 0.023 .0525 .0275 .0168 3840 0.095 0.048 0.027 .0679 .0349 .0190 4096 0.073 0.040 0.027 .0579 .0310 .0194 4608 0.107 0.056 0.033 5120 0.093 0.050 0.037 5632 0.141 0.072 0.042 6144 0.115 0.062 0.048 6656 0.169 0.086 0.050 7168 0.146 0.077 0.059 .1034 .0557 .0487 7680 0.198 0.101 0.059 .1425 .0729 .0427(!) 8192 0.156 0.093 0.070 .1220 .0658 .0569 -------------------------------------------------- Avg || Scaling: 1.936x 3.229x ---- 1.909x 3.099x AvgGain, AVX vs SSE2: 1.100x 1.101x 1.059x 1.424x 1.390x 1.300x [/code] The SSE2 -> AVX speedup (sumarized in the last line of the table, you may need to scroll the code-window display down to see it) is much less for Mersenne-mod convolution mode than for Fermat-mod, and only some of that (at most ~1/3 in my estimation) is attributable to the lack of a fast AVX-mode Mersenne-mod carry macro which I noted in a recent post above. Clearly I have much work to do here. |
| All times are UTC. The time now is 05:39. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.