mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-08, 11:48   #56
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

At this point I would take the benchmarks with a grain of salt. There are many ways to configure the KNL system and until we can more exhaustively benchmark those changes it will be hard to tell both peak (current prime95) performance and what the result of using the new tech will be.

Which FFTs are these using, AVX? FMA3? Is the HBM configured as cache or flat memory?
airsquirrels is offline   Reply With Quote
Old 2016-09-08, 13:34   #57
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

CDD16 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
At this point I would take the benchmarks with a grain of salt. There are many ways to configure the KNL system and until we can more exhaustively benchmark those changes it will be hard to tell both peak (current prime95) performance and what the result of using the new tech will be.

Which FFTs are these using, AVX? FMA3? Is the HBM configured as cache or flat memory?
Yeah, I wondered too about how the HBM was setup on that system. I don't know if the reference motherboards for these things have a default BIOS setting for that and what it would be... If I had to guess, it would NOT be the NUMA mode, but I base that guess on the notion that doing so would require some OS support and tweaking to take the best advantage of it. So I'll go with the default being the caching mode.

Prime95 uses CPU id info to get the capabilities and fork accordingly (I think?). The benchmark snippet I included showed "CPU features: Prefetchw, SSE, SSE2, SSE4, AVX, AVX2, FMA" so I'd assume it's using the AVX2/FMA features.

And remember that this is running at 1.3GHz without any kind of turbo boost (I think), so on a core-for-core comparison to a Haswell it really won't impress anyone. It's advantage comes from:
a) lots of cores
b) that fast HBM
c) AVX-512

a) and b) should be good benefits out of the box although the multithreading could be improved, I suppose. c) is where we could expect to see more benefit but until we get the dev system and a test version compiled to make use of it, it's just our best estimates on what it'll do.

EDIT: Oh yeah, another benefit is d) 6-channel main memory ... in addition to the HBM, that's also a huge bonus.

Last fiddled with by Madpoo on 2016-09-08 at 13:35
Madpoo is offline   Reply With Quote
Old 2016-09-08, 14:12   #58
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

37×89 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Aaron, how do the timings/parallel-scalings you posted above compare to those of the same benchmarks on the Xeon system you used to do the quick-verify of the latest M-prime?
I'm attaching the results from a benchmark I just ran. To keep it short and sweet, I used these options:
MinBenchFFT=2048
MaxBenchFFT=2048
BenchHyperthreads=0
BenchMultithreads=1
OnlyBenchThroughput=1
OnlyBenchMaxCPUs=0
Attached Files
File Type: txt results.txt (13.3 KB, 159 views)
Madpoo is offline   Reply With Quote
Old 2016-09-08, 14:15   #59
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

37×89 Posts
Default

And also, here's the results.txt from the user who ran the KNL benchmark. It has the hyperthread benchmark enabled so it's a bit more cluttered.
Attached Files
File Type: txt results.txt (85.6 KB, 114 views)
Madpoo is offline   Reply With Quote
Old 2016-09-08, 14:27   #60
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

I also suspect the default is to use the HBM as cache, which from what I've read actually introduces quite a bit of latency on cache misses. I'm not sure if the user with the KNL system has anything else active on the machine.

In theory, P95 shouldn't cache miss with a 16GB HBM cache.

It does look like peak throughput (iterations/second) at 2048K is pretty comparable between the two unoptimized. The 2690v4 is also of comparable price.

Interesting on your 2690v4 benchmarks, using the hyper threaded cores really gave a nice ~30% boost over just the non-hyperthreaded cores with 28 vs 14 workers....
airsquirrels is offline   Reply With Quote
Old 2016-09-08, 15:28   #61
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×3×7×13 Posts
Default

I can't seem to find the reference any more, but IIRC MCDRAM misses are 170 ns vs 150 ns DDR.

Last fiddled with by ldesnogu on 2016-09-08 at 15:29
ldesnogu is offline   Reply With Quote
Old 2016-09-08, 17:36   #62
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Talking

GTX Titan Black's still seem to be the fastest single-exponent solution:

FFTSize: 4096K Exponent: 73955041 (38.79%) Error: 0.14062 ms: 2.6468 eta: 1:10:56:44
Card 1 (GeForce GTX TITAN Black - 87.00C, 100% Load [862/1202]@248.79W/250.00W, M73955041 using 4096K) GhzDay: 90.47
FFTSize: 4096K Exponent: 73955131 (30.07%) Error: 0.14844 ms: 2.8341 eta: 1:17:25:13
Card 2 (GeForce GTX TITAN - 78.00C, 100% Load [797/1254]@212.78W/250.00W, M73955131 using 4096K) GhzDay: 84.50
FFTSize: 4096K Exponent: 73955033 (39.98%) Error: 0.14062 ms: 2.7179 eta: 1:09:58:15
Card 3 (GeForce GTX TITAN Black - 85.00C, 100% Load [928/1280]@248.87W/250.00W, M73955033 using 4096K) GhzDay: 88.11
FFTSize: 4096K Exponent: 73955039 (39.63%) Error: 0.14062 ms: 2.7722 eta: 1:10:22:39
Card 4 (GeForce GTX TITAN Black - 87.00C, 100% Load [862/1202]@249.46W/250.00W, M73955039 using 4096K) GhzDay: 86.38
FFTSize: 4096K Exponent: 73955683 (2.65%) Error: 0.14844 ms: 2.6862 eta: 2:07:22:37
Card 5 (GeForce GTX TITAN Black - 83.00C, 100% Load [862/1202]@246.95W/250.00W, M73955683 using 4096K) GhzDay: 89.15
FFTSize: 4096K Exponent: 73955591 (3.63%) Error: 0.13281 ms: 2.6926 eta: 2:06:59:10
Card 6 (GeForce GTX TITAN Black - 84.00C, 100% Load [862/1202]@248.93W/250.00W, M73955591 using 4096K) GhzDay: 88.94
FFTSize: 4096K Exponent: 73955653 (0.77%) Error: 0.14941 ms: 2.6784 eta: 2:09:25:46
Card 7 (GeForce GTX TITAN Black - 83.00C, 100% Load [862/1202]@246.70W/250.00W, M73955653 using 4096K) GhzDay: 89.41
FFTSize: 2592K Exponent: 48321017 (0.06%) Error: 0.26953 ms: 1.7453 eta: 1:18:23:13
Card 8 (GeForce GTX TITAN Black - 80.00C, 100% Load [888/1202]@246.63W/250.00W, M48321017 using 2592K) GhzDay: 91.09
Total GhzDay(8 cards): 708.05
airsquirrels is offline   Reply With Quote
Old 2016-09-08, 19:41   #63
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

31×101 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
GTX Titan Black's still seem to be the fastest single-exponent solution:
Xeon's are faster. Aaron double checked M74207281 in 34 hours on a dual Xeon v3 with 20 cores on 1 worker.

Last fiddled with by ATH on 2016-09-08 at 19:41
ATH is online now   Reply With Quote
Old 2016-09-08, 20:34   #64
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2·5·7·107 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
GTX Titan Black's still seem to be the fastest single-exponent solution:
We also have not seen Ernst's timings on KNL. His code scales much better than mine -- I'll have to research that.
Prime95 is offline   Reply With Quote
Old 2016-09-10, 03:22   #65
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

37×89 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I also suspect the default is to use the HBM as cache, which from what I've read actually introduces quite a bit of latency on cache misses. I'm not sure if the user with the KNL system has anything else active on the machine.

In theory, P95 shouldn't cache miss with a 16GB HBM cache.

It does look like peak throughput (iterations/second) at 2048K is pretty comparable between the two unoptimized. The 2690v4 is also of comparable price.

Interesting on your 2690v4 benchmarks, using the hyper threaded cores really gave a nice ~30% boost over just the non-hyperthreaded cores with 28 vs 14 workers....
Oh, I wasn't testing with hyperthreads, that was with the 2-socket system, using cores from both CPUs.
Madpoo is offline   Reply With Quote
Old 2016-09-10, 21:17   #66
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011011012 Posts
Default

Quote:
Originally Posted by Prime95 View Post
We also have not seen Ernst's timings on KNL. His code scales much better than mine -- I'll have to research that.
We also don't yet know how much of a difference ICC-for-KNL compilation will make, nor tuning of the core-affinities, nor the setup of the MCDRAM (flat/numa/hybrid) and usage of numactl-job-control (as described in the Colfax 'clustering modes' whitepaper linked a few posts back) in that context. Lots of experiments to do!

Aaron, did you just go with the default single-user license for the Intel tools? If so, will sudo-enablement allow multiple folks to use them?
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 14:37.

Tue May 11 14:37:22 UTC 2021 up 33 days, 9:18, 2 users, load averages: 2.96, 2.09, 1.97

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.