mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-06-24, 12:09   #672
xtreme2k
 
xtreme2k's Avatar
 
Aug 2002

17410 Posts
Default

Would you post the 1w/14t throughput please for the 2690V4?
xtreme2k is offline   Reply With Quote
Old 2016-06-24, 17:58   #673
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

CF516 Posts
Default

Quote:
Originally Posted by xtreme2k View Post
Would you post the 1w/14t throughput please for the 2690V4?
Yeah, although I don't have any confidence in the benchmark timings.

Code:
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (14 cpus, 1 worker):  2.22 ms.  Throughput: 451.28 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (14 cpus, 2 workers):  3.62,  3.59 ms.  Throughput: 554.82 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (14 cpus, 7 workers): 15.66, 15.37, 15.66, 18.62, 14.13, 14.49, 14.09 ms.  Throughput: 457.31 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (14 cpus, 14 workers): 28.98, 29.16, 29.45, 28.96, 28.89, 28.92, 29.07, 28.96, 29.27, 29.52, 28.93, 28.95, 29.20, 29.48 ms.  Throughput: 480.73 iter/sec.
You'll see that it's saying the max throughput is from 2 workers of 7 threads each, but in reality that was not the case. I set it up like that and the actual iteration times were more than twice as long as with 1 worker using all 14 threads.

FYI, I focused on just the 4M FFT size since that's around the current leading edge of first-time LL tests.

Similarly, with all 28 cores (it's a dual CPU system after all), the benchmark has the same story to tell... works best with workers of 7 cores each, but again that was not the case in practice.

Code:
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 1 worker):  2.42 ms.  Throughput: 413.78 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 2 workers):  5.56,  5.44 ms.  Throughput: 363.66 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 4 workers):  5.69,  5.26,  5.51,  6.02 ms.  Throughput: 713.61 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 7 workers): 14.08, 15.61, 12.16, 15.31, 13.06, 16.80, 11.98 ms.  Throughput: 502.18 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 14 workers): 28.38, 29.96, 27.59, 29.77, 19.85, 21.59, 20.19, 28.38, 27.04, 28.78, 29.24, 20.06, 21.79, 19.90 ms.  Throughput: 571.84 iter/sec.
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (28 cpus, 28 workers): 43.21, 43.40, 42.99, 43.45, 44.02, 43.51, 43.15, 45.82, 45.41, 45.46, 45.58, 46.21, 45.32, 44.86, 43.86, 45.31, 44.95, 43.48, 44.06, 43.94, 44.85, 45.71, 45.13, 45.53, 45.68, 45.36, 45.33, 45.59 ms.  Throughput: 626.90 iter/sec.
I don't know if the Prime95 benchmark timings are being done in a totally different way than Prime95 itself, but it seems to be missing out on the effects of memory contention when it's actually running. I'd just caution anyone who relies on the benchmark output to setup their system that they should also look at the actual iteration times with the different # of cores/workers and see what works best in reality.
Madpoo is offline   Reply With Quote
Old 2016-06-24, 18:15   #674
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

173216 Posts
Default

It is worth noting that the second cpu didn't add that much compared with the first.
As I mentioned in the other thread the cpus are split into two NUMA nodes each. This could explain the 7 core behaviour.
henryzz is online now   Reply With Quote
Old 2016-06-24, 20:40   #675
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23·312 Posts
Default

Quote:
Originally Posted by Madpoo View Post

Code:
FFTlen=4096K, Type=0, Arch=0, Pass1=4194304, Pass2=0, clm=0 (14 cpus, 1 worker):  2.22 ms.
What are your benchmark settings in prime.txt? There is no type=0, pass1=4M FFT! Probably just an output bug.
Prime95 is offline   Reply With Quote
Old 2016-06-26, 02:48   #676
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

31·107 Posts
Default

Quote:
Originally Posted by Prime95 View Post
What are your benchmark settings in prime.txt? There is no type=0, pass1=4M FFT! Probably just an output bug.
Oh, I was using the AllBench=1 or whatever option and I think that's where the funny output came from. I used min/max FFT of 4096K so it didn't really do anything else too interesting.

I just used the AllBench=1 because the undoc said "This is only useful during the development cycle to find the optimal FFT implementations for each CPU."

I thought it might do something useful, but it only did the tests twice with no difference in the timings.

I do wonder if the benchmark isn't using the same affinity map I'm telling the program to use. For instance, with the two CPUs running as "2 workers, 14 cores each using all cores on a single CPU", I can run two tests side by side, no slowdown whatsoever compared to running one worker with 14 cores and the other CPU is idle.

However, that's not what the benchmark results indicate... I would expect double the throughput going from 14-cpus, 1 worker up to 28-cores 2 workers. Didn't work out that way though. Only went from 451.28 iter/sec to 363.66 iter/sec (actually went down... that ain't right).

Thus my recommendation that if you want real data, do real tests for now.
Madpoo is offline   Reply With Quote
Old 2016-06-27, 09:13   #677
xtreme2k
 
xtreme2k's Avatar
 
Aug 2002

101011102 Posts
Default

Madpoo
On your post #667 and #668 you are indicating the 2690V4 is much faster than the 2697V3 however in all your other posts it is not the case?

Is there a way to ensure 1w/14t run on the CPU1, and the 2nd 1w/14t on CPU2?

Can you install HWinfo64 to see what the MHz the CPU is actually running at on a per core level. CPU-Z only shows the first core.

It is also interesting to see bmurray7JHU's 6950X results, as P95 v28.9 actually recognises his Broadwell-E.
xtreme2k is offline   Reply With Quote
Old 2016-06-27, 21:38   #678
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

31·107 Posts
Default

Quote:
Originally Posted by xtreme2k View Post
Madpoo
On your post #667 and #668 you are indicating the 2690V4 is much faster than the 2697V3 however in all your other posts it is not the case?
...
That's because both are true.

It's faster for smaller FFT sizes, but it's slower for larger FFT sizes, for whatever reason.

I saw the v4 would typically run at about 1x turbo multiplier faster compared to the v3, plus the faster DDR4 speed, which makes it even stranger that it does worse with larger exponents. Hopefully it's just a software thing, with Prime95 doing something "interesting" since it doesn't quite know what kind of CPU that is, or needs some tuning to optimize?

Since I'm not doing 100M digit tests, right now it's not bugging me too much. It runs faster with the current LL and DC wavefronts (call it ~37M and ~68M). I do wonder if it could actually be even faster with some tuning, but whatever... that'll come, if possible.

Last fiddled with by Madpoo on 2016-06-27 at 22:02
Madpoo is offline   Reply With Quote
Old 2016-06-27, 22:12   #679
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

31·107 Posts
Default

Quote:
Originally Posted by xtreme2k View Post
Is there a way to ensure 1w/14t run on the CPU1, and the 2nd 1w/14t on CPU2?
Yup, I'm sure about that. I verified from the Sysinternals "CoreInfo" that it's still following the Windows scheme of mapping core 0+1 as the physical/hyperthread cores, just like before. And then using the same AffinityScramble2 mapping that the 14-core v3 chip is using. And finally just looking at task manager with one graph per CPU (56 separate graphs show up) that if I start worker #1, the correct cores are all chugging along at 100%, and same if I only run worker #2.

Quote:
Can you install HWinfo64 to see what the MHz the CPU is actually running at on a per core level. CPU-Z only shows the first core.
Did that and confirmed that it's running at 29x multiplier whether I'm doing a 68M exponent or a 332M exponent. And also confirmed that without any tests going on, they're static at 32x which matches the spec (26x stock plus 6x turbo with all cores enabled).

Quote:
It is also interesting to see bmurray7JHU's 6950X results, as P95 v28.9 actually recognises his Broadwell-E.
I guess the 6950X is slightly different from the Broadwell-EP processors... I don't know how Prime95 chooses to display the CPU model info. Other programs like CPU-Z or even the old CoreInfo I have from 2014 can/will use the string reported by the cpu id ops but I wonder if Prime95 is also using the cpu family/model in a table? I guess I could look at the source, but all I know is, Prime95 28.9 starts up and says:
Code:
[Main thread Jun 27 22:03] Mersenne number primality test program version 28.9
[Main thread Jun 27 22:03] Optimizing for CPU architecture: Unknown Intel, L2 cache size: 256 KB, L3 cache size: 35 MB
[Main thread Jun 27 22:03] Using AffinityScramble2 setting to set affinity mask.
[Main thread Jun 27 22:03] Starting workers.
Madpoo is offline   Reply With Quote
Old 2016-06-28, 01:01   #680
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

23×312 Posts
Default

Quote:
Originally Posted by Madpoo View Post
but I wonder if Prime95 is also using the cpu family/model in a table?
Yes.

However, almost all of prime95's decisions about which FFT implementation is appropriate are based on other CPUID flags (like FMA support, prefetch support, etc). Thus, my fixing the family/model table will make no difference.

Last fiddled with by Prime95 on 2016-06-28 at 01:06
Prime95 is offline   Reply With Quote
Old 2016-06-28, 02:47   #681
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

31·107 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Yes.

However, almost all of prime95's decisions about which FFT implementation is appropriate are based on other CPUID flags (like FMA support, prefetch support, etc). Thus, my fixing the family/model table will make no difference.
That's kind of what I guessed. In my limited testing, it seemed like it picked the same FFT sizes that a "known" CPU would, but I confess I wasn't testing with exponents near the FFT boundaries.

Well, specifically for George, just holler if there are any tests or info you'd like me to run which might help out. I don't know enough about the operation of the program to even make a guess on whether there's something there that could make it slower at the larger FFTs. On the hardware side I'm not aware of anything either; on the contrary, everything suggests it should still run faster just like it does at the smaller FFTs.

I suppose it could be something else server centric... not the CPU or memory. Although the same server, motherboard, firmware, etc. is being used on the E5-2697 v3 and the E5-2690 v4... the only differences are the CPU and memory. Heck, they even have the same array controller and number/size of hard drives, same # of fans, power supplies, etc.

I had to reinstall the 2nd DIMM per channel in the new box today, in prep for shipping to it's new home, so I can't test anything related to the 2400 MHz mem speed, but otherwise, just holler if there's something you'd like me to test or whatever.
Madpoo is offline   Reply With Quote
Old 2016-07-08, 01:15   #682
Xyzzy
 
Xyzzy's Avatar
 
Aug 2002

100000101010012 Posts
Default

Code:
AMD Athlon(tm) X4 880K Quad Core Processor     
CPU speed: 3992.52 MHz, 4 cores
CPU features: 3DNow! Prefetch, SSE, SSE2, SSE4, AVX, FMA
L1 cache size: 16 KB
L2 cache size: 2 MB
L1 cache line size: 64 bytes
L2 cache line size: 64 bytes
L1 TLBS: 64
L2 TLBS: 1024
Prime95 64-bit version 28.9, RdtscTiming=1
Best time for 1024K FFT length: 10.010 ms., avg: 11.270 ms.
Best time for 1280K FFT length: 12.614 ms., avg: 12.906 ms.
Best time for 1536K FFT length: 15.556 ms., avg: 15.815 ms.
Best time for 1792K FFT length: 18.877 ms., avg: 19.044 ms.
Best time for 2048K FFT length: 20.456 ms., avg: 20.495 ms.
Best time for 2560K FFT length: 26.180 ms., avg: 26.282 ms.
Best time for 3072K FFT length: 32.313 ms., avg: 32.739 ms.
Best time for 3584K FFT length: 39.220 ms., avg: 39.497 ms.
Best time for 4096K FFT length: 43.073 ms., avg: 44.114 ms.
Best time for 5120K FFT length: 56.635 ms., avg: 57.516 ms.
Best time for 6144K FFT length: 68.302 ms., avg: 69.329 ms.
Best time for 7168K FFT length: 82.788 ms., avg: 83.573 ms.
Best time for 8192K FFT length: 89.876 ms., avg: 91.534 ms.
Timing FFTs using 2 threads.
Best time for 1024K FFT length: 7.857 ms., avg: 7.994 ms.
Best time for 1280K FFT length: 9.959 ms., avg: 10.085 ms.
Best time for 1536K FFT length: 12.108 ms., avg: 12.372 ms.
Best time for 1792K FFT length: 14.877 ms., avg: 15.078 ms.
Best time for 2048K FFT length: 16.218 ms., avg: 16.828 ms.
Best time for 2560K FFT length: 20.900 ms., avg: 20.945 ms.
Best time for 3072K FFT length: 25.536 ms., avg: 25.943 ms.
Best time for 3584K FFT length: 30.956 ms., avg: 31.765 ms.
Best time for 4096K FFT length: 33.613 ms., avg: 34.454 ms.
Best time for 5120K FFT length: 43.658 ms., avg: 45.174 ms.
Best time for 6144K FFT length: 54.065 ms., avg: 55.816 ms.
Best time for 7168K FFT length: 68.734 ms., avg: 69.902 ms.
Best time for 8192K FFT length: 70.916 ms., avg: 72.768 ms.
Timing FFTs using 3 threads.
Best time for 1024K FFT length: 5.203 ms., avg: 5.248 ms.
Best time for 1280K FFT length: 6.508 ms., avg: 6.891 ms.
Best time for 1536K FFT length: 7.884 ms., avg: 8.572 ms.
Best time for 1792K FFT length: 9.488 ms., avg: 9.552 ms.
Best time for 2048K FFT length: 10.400 ms., avg: 10.825 ms.
Best time for 2560K FFT length: 13.212 ms., avg: 13.310 ms.
Best time for 3072K FFT length: 16.194 ms., avg: 16.283 ms.
Best time for 3584K FFT length: 19.408 ms., avg: 19.581 ms.
Best time for 4096K FFT length: 21.471 ms., avg: 21.603 ms.
Best time for 5120K FFT length: 27.852 ms., avg: 28.741 ms.
Best time for 6144K FFT length: 34.192 ms., avg: 35.219 ms.
Best time for 7168K FFT length: 42.224 ms., avg: 43.648 ms.
Best time for 8192K FFT length: 44.121 ms., avg: 44.815 ms.
Timing FFTs using 4 threads.
Best time for 1024K FFT length: 4.659 ms., avg: 4.852 ms.
Best time for 1280K FFT length: 5.828 ms., avg: 6.408 ms.
Best time for 1536K FFT length: 7.082 ms., avg: 7.163 ms.
Best time for 1792K FFT length: 8.429 ms., avg: 8.568 ms.
Best time for 2048K FFT length: 9.309 ms., avg: 9.432 ms.
Best time for 2560K FFT length: 11.875 ms., avg: 12.728 ms.
Best time for 3072K FFT length: 14.419 ms., avg: 14.583 ms.
Best time for 3584K FFT length: 17.422 ms., avg: 17.557 ms.
Best time for 4096K FFT length: 19.073 ms., avg: 19.235 ms.
Best time for 5120K FFT length: 25.090 ms., avg: 26.228 ms.
Best time for 6144K FFT length: 31.130 ms., avg: 31.377 ms.
Best time for 7168K FFT length: 39.248 ms., avg: 40.266 ms.
Best time for 8192K FFT length: 40.231 ms., avg: 41.124 ms.

Timings for 1024K FFT length (4 cpus, 4 workers): 17.15, 16.66, 16.73, 16.63 ms.  Throughput: 238.28 iter/sec.
Timings for 1280K FFT length (4 cpus, 4 workers): 30.26, 25.75, 27.13, 25.69 ms.  Throughput: 147.66 iter/sec.
Timings for 1536K FFT length (4 cpus, 4 workers): 46.37, 38.02, 34.43, 26.32 ms.  Throughput: 114.91 iter/sec.
Timings for 1792K FFT length (4 cpus, 4 workers): 36.96, 32.14, 32.23, 31.86 ms.  Throughput: 120.59 iter/sec.
Timings for 2048K FFT length (4 cpus, 4 workers): 75.09, 42.93, 47.02, 45.64 ms.  Throughput: 79.79 iter/sec.
Timings for 2560K FFT length (4 cpus, 4 workers): 179.39, 33.97, 44.16, 43.37 ms.  Throughput: 80.71 iter/sec.
Timings for 3072K FFT length (4 cpus, 4 workers): 175.11, 181.60, 175.32, 173.85 ms.  Throughput: 22.67 iter/sec.
Timings for 3584K FFT length (4 cpus, 4 workers): 217.28, 181.99, 97.64, 96.79 ms.  Throughput: 30.67 iter/sec.
Timings for 4096K FFT length (4 cpus, 4 workers): 209.09, 139.08, 101.27, 77.22 ms.  Throughput: 34.80 iter/sec.
Timings for 5120K FFT length (4 cpus, 4 workers): 403.35, 149.38, 122.86, 162.36 ms.  Throughput: 23.47 iter/sec.
Timings for 6144K FFT length (4 cpus, 4 workers): 298.96, 220.48, 143.27, 181.14 ms.  Throughput: 20.38 iter/sec.
Timings for 7168K FFT length (4 cpus, 4 workers): 220.75, 187.36, 173.04, 155.11 ms.  Throughput: 22.09 iter/sec.
Timings for 8192K FFT length (4 cpus, 4 workers): 285.44, 156.77, 210.51, 173.70 ms.  Throughput: 20.39 iter/sec.
Memory = DDR 1333 (No XMP)
Attached Thumbnails
Click image for larger version

Name:	CPU.png
Views:	132
Size:	27.2 KB
ID:	14610   Click image for larger version

Name:	Memory.png
Views:	117
Size:	15.5 KB
ID:	14611  
Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Perpetual "interesting video" thread... Xyzzy Lounge 43 2021-07-17 00:00
LLR benchmark thread Oddball Riesel Prime Search 5 2010-08-02 00:11
Perpetual I'm pi**ed off thread rogue Soap Box 19 2009-10-28 19:17
Perpetual autostereogram thread... Xyzzy Lounge 10 2006-09-28 00:36
Perpetual ECM factoring challenge thread... Xyzzy Factoring 65 2005-09-05 08:16

All times are UTC. The time now is 10:58.


Sun Dec 5 10:58:27 UTC 2021 up 135 days, 5:27, 0 users, load averages: 1.94, 2.00, 2.18

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.