mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2016-09-17, 11:53   #155
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11·47 Posts
Default

Quote:
Originally Posted by LaurV View Post
Yes, at the time, and went back right now and read it again. (Lots of numbers, it can count from 0 to 255, so it can play minesweeper )
That post does not say much beside of the fact that it scales quite well. The number of iterations, without the attached size of the FFT, give no indication about the performance. Anyhow, am I very optimistic when I say that I expect a 20-fold performance increase from the actual P95/mlucas to the "tuned for phi" P95/mlucas? 10-folds? 5-folds? Then if so, I won't pay much attention to the actual benchmarks. They mean nothing when I ask "what this beast can do". As opposite to "what is doing right now". I will wait until I see the "can do".
(of course, not my work... it is easy to criticize others' work - don't pay much attention to me!)
I can sum up the current state of affairs.

Good Today: Total LL throughput using best settings on mprime or mlucas is competitive with massively (2-5x) more expensive than dual Xeon v2/3/4 systems.. The downside is this is best achieved running lots of exponents slowly.

Bad Today: mlucas scales better than mprime at the moment, but single exponent performance is not particularly great. This has more to do with the threading model/FFT splitting choices, memory locality, etc. needing optimized in these programs (for KNL) than anything.

Good Tomorrow: We should see about a 2x speed up from AVX512, maybe more. KNL is sensitive to instruction count given its limited resources in that department, so denser compute code will also scale better. Improving the threading will help single exponent performance scale closer to current multi-exponent numbers.

Bad Tomorrow: Eventually Xeons of Tomorrow with AVX512 are likely to outperform KNL, unless FFT code for mprime/mlucas is adjusted to provide something more akin to a streaming memory access model. KNL is significantly better than KNC for sparse memory access applications, however even Intels own docs declare that Xeons will outperform here. AVX512 prefetch instructions may also help us manage sparse access....

Cost: In terms of hardware cost, even the KNL developer system beats Xeons. I expect other released versions to be cheaper. If we could multiply it by 8x with the PCIe variant in a GPU SuperServer + host Xeons we could put nearly 2000GhxDay/Day in 4U. TDP of the current system is around 300W, so it is significantly cheaper to operate as well. It takes 600+ watts of GPUs or 400+ watts of Xeons to meet current performance.

Last fiddled with by airsquirrels on 2016-09-17 at 11:56
airsquirrels is offline   Reply With Quote
Old 2016-09-17, 14:39   #156
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

41110 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Bad Tomorrow: Eventually Xeons of Tomorrow with AVX512 are likely to outperform KNL, unless FFT code for mprime/mlucas is adjusted to provide something more akin to a streaming memory access model. KNL is significantly better than KNC for sparse memory access applications, however even Intels own docs declare that Xeons will outperform here. AVX512 prefetch instructions may also help us manage sparse access....
Intel also wants to put a FPGAs inside future Xeons, which could result in an extra speed boost if properly utilised.
Karl M Johnson is offline   Reply With Quote
Old 2016-09-17, 18:21   #157
Madpoo
Serpentine Vermin Jar
 
Madpoo's Avatar
 
Jul 2014

331310 Posts
Default

Quote:
Originally Posted by xathor View Post
I think the only advantage that KNL has is the AVX512 VPU's... it's hampered by the overall low clock speeds of each core....
It's true that the slower clock speed is a disadvantage

However, as AirSquirrels points out, the benefits of AVX-512 (which we haven't even tested yet) plus the benefits of many-cores and also the faster memory means you can run a lot of simultaneous tests, giving an effective throughput much higher than a top-end dual-core Xeon system, at a fraction of the price.

For now I think the benefit to the project is giving devs a chance to work with AVX-512 on real hardware and also work on the multi-threading aspect. Tuning those two things should keep GIMPS ahead of the curve for what (I assume) will eventually trickle down to desktop CPUs (512-bit VPUs and more cores).

Intel has already thrown in the towel on making chips increasingly faster by clock speed, so if we have any hope of continuing the steady march of making LL tests faster, then it rests on these other areas.
Madpoo is offline   Reply With Quote
Old 2016-09-17, 21:50   #158
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2·3·29·67 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
I can sum up the current state of affairs.
...
Good Tomorrow: We should see about a 2x speed up from AVX512, maybe more. KNL is sensitive to instruction count given its limited resources in that department, so denser compute code will also scale better. Improving the threading will help single exponent performance scale closer to current multi-exponent numbers.
Thanks for the nice summary, David. Twice as many (16 ==> 32) vector registers on KNL should be a big help w.r.to the instruction-count bottleneck ... in my AVX-512 code I intend to use that to ruthlessly eliminate register copies, use lots of memory-operand forms of instructions to reduce explicit loads, etc. I'm sure George can provide some excellent ideas here, as well.

Also, we still haven't used ICC and its allegedly marvelous suite of profiling and multithread-tuning tools.
ewmayer is offline   Reply With Quote
Old 2016-09-17, 23:08   #159
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10468 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
Cost: In terms of hardware cost, even the KNL developer system beats Xeons. I expect other released versions to be cheaper. If we could multiply it by 8x with the PCIe variant in a GPU SuperServer + host Xeons we could put nearly 2000GhxDay/Day in 4U. TDP of the current system is around 300W, so it is significantly cheaper to operate as well. It takes 600+ watts of GPUs or 400+ watts of Xeons to meet current performance.
I wonder if the dev system price isn't so low due to Intel heavily subsidizing it to help KNL proliferate outside of larger supercomputer centers. As far as chip goes the base price is driven by silicon area, and KNL is huge. So I wouldn't bet other KNL systems will be as cheap as this dev system.
ldesnogu is offline   Reply With Quote
Old 2016-09-18, 02:42   #160
xathor
 
Sep 2016

19 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I wonder if the dev system price isn't so low due to Intel heavily subsidizing it to help KNL proliferate outside of larger supercomputer centers. As far as chip goes the base price is driven by silicon area, and KNL is huge. So I wouldn't bet other KNL systems will be as cheap as this dev system.
Several large supercomputer centers I have spoken with (including mine) aren't going to go down the path of KNL, mainly because the low clock speeds and recompiling software for AVX512 is a slow and often fruitless process.
xathor is offline   Reply With Quote
Old 2016-09-18, 04:45   #161
VBCurtis
 
VBCurtis's Avatar
 
"Curtis"
Feb 2005
Riverside, CA

27·3·13 Posts
Default

How do you know it's fruitless if you haven't done it? And won't the avx512 optimizations be useful on future Xeons anyway?
I'm merely an observer, but there's quite a gap between "we might double mprime performance" and "fruitless".

Last fiddled with by VBCurtis on 2016-09-18 at 04:47
VBCurtis is offline   Reply With Quote
Old 2016-09-18, 04:52   #162
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

167208 Posts
Default

Good news: I wrote the AVX-512 TF code.
Bad news: I've run into HJWasm bugs.
Prime95 is online now   Reply With Quote
Old 2016-09-18, 05:07   #163
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

10058 Posts
Default

Quote:
Originally Posted by Prime95 View Post
Good news: I wrote the AVX-512 TF code.
Bad news: I've run into HJWasm bugs.
The last public HJWasm still has bugs / required strict casting. I have a patched version if you want to be able to build current mprime vs. fixing our explicitly casts

Last fiddled with by airsquirrels on 2016-09-18 at 05:08
airsquirrels is offline   Reply With Quote
Old 2016-09-18, 05:54   #164
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

763210 Posts
Default

Quote:
Originally Posted by airsquirrels View Post
The last public HJWasm still has bugs / required strict casting. I have a patched version if you want to be able to build current mprime vs. fixing our explicitly casts
I've fixed all the type cast problems in the prime95 source code.

I've created 2 new bug reports at github. Unfortunately, I never got my verification email for a masm32 forum account.
Prime95 is online now   Reply With Quote
Old 2016-09-18, 06:55   #165
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2×3×29×67 Posts
Default

Some TF data ... built my Mfactor code in || mode, using the 960-distinct-k-mod-residue-classes mode, allowing up to that many threads to be used. We start with pure-integer modmul, which is very fast on x86_64. Timing test was the double-Mersenne MM31 to a depth of 68 bits, sufficient to find the smallest 3 of the known factors of this number. That needed 22min running 2-threaded on my 2GHz Core2. Here timings on KNL:

16-threads:

M(2147483647) has 3 factors in range k = [0, 69004615680], passes 0-959
Performed 3350616141 trial divides
real 7m3.665s <*** Only 3x faster than 2-threaded on Core2 ... ugh. ***
user 110m8.104s
sys 0m1.163s


64-threads:

real 1m48.711s <*** Almost exactly 4x faster than 16-thread ***
user 109m50.797s
sys 0m0.465s


192-threads (I used that rather than 256 since 192 divides 960, i.e. leads to 5 fully-occupied threadpool waves getting done):

real 1m13.171s
user 217m42.613s
sys 0m1.813s


240 threads (4 full threadpool waves):

real 1m9.089s
user 249m24.402s
sys 0m3.070s

So we see more or less perfect ||ism up to 64 threads, still see a nice further improvement using 3x as many threads as physical cores, and a few % more going up to 240 threads (4x thread/core ratio). But I suspect these timings suck compared to any decent GPU - can someone confirm, using the same test case?

Tomorrow will try AVX2 build mode, which uses vector-double FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64-based TF on my Haswell.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLR development version 3.8.7 is available! Jean Penné Software 39 2012-04-27 12:33
LLR 3.8.5 Development version Jean Penné Software 6 2011-04-28 06:21
Do you have a dedicated system for gimps? Surge Hardware 5 2010-12-09 04:07
Query - Running GIMPS on a 4 way system Unregistered Hardware 6 2005-07-04 04:27
System tweaks to speed GIMPS Uncwilly Software 46 2004-02-05 09:38

All times are UTC. The time now is 05:56.


Sun Oct 17 05:56:57 UTC 2021 up 86 days, 25 mins, 0 users, load averages: 1.04, 0.88, 0.92

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.