mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-06-12, 19:27   #100
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19·613 Posts
Default

George, I run mostly my own code, so am not intimately familiar with the Prime95 self-test protocols:

o The above timing sets start with the expected "4 cores" blurb re. the CPU, but are the ensuing timings for running on 1 core or all 4?

o How does the very slight speedup for the faster ddr3 memory compare with what you expected?
ewmayer is offline   Reply With Quote
Old 2013-06-12, 20:49   #101
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by ewmayer View Post

o The above timing sets start with the expected "4 cores" blurb re. the CPU, but are the ensuing timings for running on 1 core or all 4?

o How does the very slight speedup for the faster ddr3 memory compare with what you expected?
The first batch of numbers are running on 1 core.

I didn't really have any expectations for the single core numbers with the higher bandwidth ddr3. As I expected the huge gains for faster ddr3 come when running a worker on all 4 cores.

I really need to rerun these benchmarks after turning off turbo boost. The CPU frequency could be different at times in the two benchmark runs.
Prime95 is offline   Reply With Quote
Old 2013-06-12, 20:59   #102
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The first batch of numbers are running on 1 core.

I didn't really have any expectations for the single core numbers with the higher bandwidth ddr3. As I expected the huge gains for faster ddr3 come when running a worker on all 4 cores.
That's what I guessed - since the multiworker tests indicate "1 single-thread process per core" tests of overall system memory bandwidth.
ewmayer is offline   Reply With Quote
Old 2013-06-12, 21:04   #103
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

100101000110012 Posts
Default

I actually had the same question as Ernst.

If you run benchmark and at the same time observe the task manager, you may see that when the test reports "running on 1 cpu", all N cores are 100% busy; then, during "Timing FFT using 2 threads" (N-1) cores are 100% busy (!), then during "Timing FFT using 3 threads" (N-2) cores are 100% busy (!), and so on. This is 27.9 on a 6-core Xeon.
Batalov is offline   Reply With Quote
Old 2013-06-12, 22:52   #104
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19·397 Posts
Default

Quote:
Originally Posted by Batalov View Post
you may see that when the test reports "running on 1 cpu", all N cores are 100% busy;
Now that you mention it, I remember adding some code to the benchmarking routine that starts dummy tasks on the other cores to simulate a fully loaded system. This should limit the way turbo / speedstep messes with the benchmarks.

Your description makes it sound like the code isn't working perfectly.
Prime95 is offline   Reply With Quote
Old 2013-06-12, 23:00   #105
Batalov
 
Batalov's Avatar
 
"Serge"
Mar 2008
Phi(4,2^7658614+1)/2

9,497 Posts
Default

Ah. Well, if they don't use the bus, then the results should be "simulatedly" correct. (Activity would make all cores wake up from being downclocked, e.g. to 1600MHz, and at the same time the bandwith would be all available to the cores that are being tested for computational throughput.)

Now it makes sense; it was just hard to conjecture this from what task manager was showing.
Batalov is offline   Reply With Quote
Old 2013-06-13, 01:36   #106
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

We seem to be stable at 4.2GHz, 1.2V, DDR3-2400. Temps in the mid-70s running large FFTs and 80 running small FFTs. When I tried running 4.4GHz at 1.25V, as some overclockers have done, the large torture test seemed to be OK with temps in the low 80s. I then tried the small FFT torture test. Instantly, temps spiked to 100 and throttling began.

Note that version 27 definitely has problems determining the CPU speed on Haswell.


All cores at 4.2 GHz, DDR3-2400

Intel(R) Core(TM) i5-4670K CPU @ 3.40GHz
CPU speed: 3909.47 MHz, 4 cores
Prime95 64-bit version 27.9, RdtscTiming=1
Best time for 768K FFT length: 3.082 ms., avg: 3.103 ms.
Best time for 896K FFT length: 3.844 ms., avg: 3.885 ms.
Best time for 1024K FFT length: 4.346 ms., avg: 4.360 ms.
Best time for 1280K FFT length: 5.413 ms., avg: 5.472 ms.
Best time for 1536K FFT length: 6.613 ms., avg: 6.684 ms.
Best time for 1792K FFT length: 7.933 ms., avg: 7.962 ms.
Best time for 2048K FFT length: 9.223 ms., avg: 9.409 ms.
Best time for 2560K FFT length: 11.368 ms., avg: 11.388 ms.
Best time for 3072K FFT length: 13.849 ms., avg: 13.882 ms.
Best time for 3584K FFT length: 16.915 ms., avg: 16.935 ms.
Best time for 4096K FFT length: 18.718 ms., avg: 18.746 ms.
Best time for 5120K FFT length: 24.605 ms., avg: 24.632 ms.
Best time for 6144K FFT length: 29.271 ms., avg: 29.313 ms.
Best time for 7168K FFT length: 35.181 ms., avg: 36.428 ms.
Best time for 8192K FFT length: 41.031 ms., avg: 41.083 ms.


Times for LL test on 77000003 (4M FFT):

1 worker: 18.8 ms.
2 workers: 19.2, 19.2 ms.
3 workers: 20.2, 20.1, 20.1 ms.
4 workers: 22.5 ms. each
Prime95 is offline   Reply With Quote
Old 2013-06-13, 03:07   #107
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011111112 Posts
Default

In preparation for the upgrade to Haswell, some benchmark Mlucas timings on my soon-to-be-legacy CPU:

Test #1: SSE2 mode on 3.3 GHz Sandy Bridge quad, DDR3 SDRAM 1333 (PC3 10600):
Code:
            Mersenne-mod:          Fermat-mod:
FFT len       sec/iter              sec/iter
(Kdbl)   1-thr  2-thr  4-thr   1-thr  2-thr  4-thr
1024     0.019  0.010  0.005   .0179  .0093  .0050
1152     0.026  0.013  0.007
1280     0.025  0.013  0.008
1408     0.038  0.019  0.010
1536     0.030  0.015  0.008
1664     0.045  0.023  0.013
1792     0.037  0.019  0.010   .0343  .0175  .0096
1920     0.052  0.027  0.014   .0498  .0250  .0135
2048     0.041  0.021  0.012   .0385  .0197  .0107
2304     0.052  0.027  0.015
2560     0.054  0.028  0.018
2816     0.078  0.040  0.022
3072     0.064  0.034  0.019
3328     0.094  0.048  0.027
3584     0.078  0.040  0.024   .0730  .0371  .0213
3840     0.109  0.055  0.030   .1027  .0520  .0281
4096     0.083  0.044  0.027   .0813  .0416  .0236
4608     0.111  0.057  0.033
5120     0.108  0.058  0.038
5632     0.165  0.084  0.047
6144     0.128  0.069  0.050
6656     0.191  0.099  0.057
7168     0.157  0.083  0.059   .1473  .0771  .0571
7680     0.228  0.116  0.065   .2153  .1094  .0605
8192     0.176  0.094  0.068   .1689  .0867  .0641
--------------------------------------------------
Avg || Scaling: 1.934x 3.347x   ----  1.955x 3.374x
Note that Fermat-mod convolution mode is only supported for runlengths which are powers of 2 or just slightly less than such - runlengths shorter than the 7*2^n ones are infeasible due to excess roundoff error, and ones of form 9*2^n are wasteful for the above size range because of "too little roundoff error". (The latter variety of lengths will become useful out around F34-F35 or so, when 16 bits per double will be too much, based on ROE levels. But that's still a few years down the road).

"Avg parallel scaling" is computed as the arithmetic average of the 1-thread column runtimes divided by their 2 and 4-threaded counterparts.

Note how many of the FFT lengths involving larger odd primes 11 and 13 or odd composites 15 scale better than their smoother brethren as the thread count increases. For example 7680 K = 15.2^19 is crap running 1-threaded compared to its neighbors 7168K and 8192K, but interpolates them nicely at 4-threads.

AVX timings later ... I need to get some dinner and chores done.

Last fiddled with by ewmayer on 2013-06-13 at 23:59 Reason: Added Fermat-mod data to table
ewmayer is offline   Reply With Quote
Old 2013-06-13, 03:36   #108
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

19×397 Posts
Default

Quote:
Originally Posted by sdbardwick View Post
George, (please forgive silly question) did the doubling of cache bandwidth from IB to Haswell offer any improvement, or is the latency (unchanged) more important?
Haswell is faster than Sandy Bridge (I don't have Ivy). My guess is the difference is due to doubling of cache bandwidths, but it may be due to other architectural improvements.

Here is the evidence I'm using to come to the conclusion above:

Remember the 22 clock macro I rewrote to use FMA? The same macro runs in 25 clocks on SB (both are running out of the L1 cache). The same macro runs in 30 clocks on Haswell when data is in the L2 cache (an 8 clock penalty), this takes 40 clocks on SB when data is in the L2 cache (a 15 clock penalty).
Prime95 is offline   Reply With Quote
Old 2013-06-13, 04:12   #109
sdbardwick
 
sdbardwick's Avatar
 
Aug 2002
North San Diego County

12558 Posts
Default

Thanks, George!
sdbardwick is offline   Reply With Quote
Old 2013-06-13, 23:55   #110
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

19×613 Posts
Default

Here are the timings on my SB quad for the AVX-based Mlucas code. Note AVX support for Mersenne-mod convolution is still very recent - when I started my SSE2 -> AVX port at beginning of the year my first priority was to get things working/optimized for Fermat-mod arithmetic. As you can see, there is a large disparity in improvement-versus-SSE2 for the two distinct convolution modes.

Test #2: AVX mode [but with SSE2-based carry macros in the Mersenne-mod case] on 3.3 GHz Sandy Bridge quad, DDR3 SDRAM 1333 (PC3 10600):
Code:
            Mersenne-mod:          Fermat-mod:
FFT len       sec/iter              sec/iter
(Kdbl)   1-thr  2-thr  4-thr   1-thr  2-thr  4-thr
1024     0.018  0.009  0.005   .0133  .0070  .0039
1152     0.025  0.012  0.007   
1280     0.024  0.012  0.007   
1408     0.033  0.017  0.009   
1536     0.029  0.014  0.008   
1664     0.040  0.020  0.011   
1792     0.036  0.018  0.010   .0247  .0127  .0073
1920     0.046  0.023  0.013   .0321  .0166  .0091
2048     0.037  0.020  0.012   .0291  .0151  .0086
2304     0.052  0.027  0.015   
2560     0.048  0.025  0.016   
2816     0.069  0.036  0.020   
3072     0.058  0.030  0.019   
3328     0.083  0.042  0.023   
3584     0.074  0.039  0.023   .0525  .0275  .0168
3840     0.095  0.048  0.027   .0679  .0349  .0190
4096     0.073  0.040  0.027   .0579  .0310  .0194
4608     0.107  0.056  0.033   
5120     0.093  0.050  0.037   
5632     0.141  0.072  0.042   
6144     0.115  0.062  0.048   
6656     0.169  0.086  0.050   
7168     0.146  0.077  0.059   .1034  .0557  .0487
7680     0.198  0.101  0.059   .1425  .0729  .0427(!)
8192     0.156  0.093  0.070   .1220  .0658  .0569
--------------------------------------------------
Avg || Scaling: 1.936x 3.229x   ----  1.909x 3.099x
AvgGain, AVX
vs SSE2: 1.100x 1.101x 1.059x  1.424x 1.390x 1.300x
The SSE2 -> AVX speedup (sumarized in the last line of the table, you may need to scroll the code-window display down to see it) is much less for Mersenne-mod convolution mode than for Fermat-mod, and only some of that (at most ~1/3 in my estimation) is attributable to the lack of a fast AVX-mode Mersenne-mod carry macro which I noted in a recent post above. Clearly I have much work to do here.

Last fiddled with by ewmayer on 2013-06-13 at 23:58
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Haswell-E Prelim. Benchmark sdbardwick Hardware 37 2015-02-10 18:49
Prime95 and Haswell Pleco Information & Answers 22 2014-07-13 16:03
Haswell Rig Mini-Geek Hardware 64 2014-05-27 13:22
Prime95 version 27.1 early preview, not-even-close-to-beta release Prime95 Software 126 2012-02-09 16:17
Missing mouse-over preview text retina Forum Feedback 1 2011-09-12 15:32

All times are UTC. The time now is 05:39.


Fri Aug 6 05:39:46 UTC 2021 up 14 days, 8 mins, 1 user, load averages: 3.14, 3.22, 2.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.