20160917, 11:53  #155  
"David"
Jul 2015
Ohio
517_{10} Posts 
Quote:
Good Today: Total LL throughput using best settings on mprime or mlucas is competitive with massively (25x) more expensive than dual Xeon v2/3/4 systems.. The downside is this is best achieved running lots of exponents slowly. Bad Today: mlucas scales better than mprime at the moment, but single exponent performance is not particularly great. This has more to do with the threading model/FFT splitting choices, memory locality, etc. needing optimized in these programs (for KNL) than anything. Good Tomorrow: We should see about a 2x speed up from AVX512, maybe more. KNL is sensitive to instruction count given its limited resources in that department, so denser compute code will also scale better. Improving the threading will help single exponent performance scale closer to current multiexponent numbers. Bad Tomorrow: Eventually Xeons of Tomorrow with AVX512 are likely to outperform KNL, unless FFT code for mprime/mlucas is adjusted to provide something more akin to a streaming memory access model. KNL is significantly better than KNC for sparse memory access applications, however even Intels own docs declare that Xeons will outperform here. AVX512 prefetch instructions may also help us manage sparse access.... Cost: In terms of hardware cost, even the KNL developer system beats Xeons. I expect other released versions to be cheaper. If we could multiply it by 8x with the PCIe variant in a GPU SuperServer + host Xeons we could put nearly 2000GhxDay/Day in 4U. TDP of the current system is around 300W, so it is significantly cheaper to operate as well. It takes 600+ watts of GPUs or 400+ watts of Xeons to meet current performance. Last fiddled with by airsquirrels on 20160917 at 11:56 

20160917, 14:39  #156  
Mar 2010
3·137 Posts 
Quote:


20160917, 18:21  #157  
Serpentine Vermin Jar
Jul 2014
D0C_{16} Posts 
Quote:
However, as AirSquirrels points out, the benefits of AVX512 (which we haven't even tested yet) plus the benefits of manycores and also the faster memory means you can run a lot of simultaneous tests, giving an effective throughput much higher than a topend dualcore Xeon system, at a fraction of the price. For now I think the benefit to the project is giving devs a chance to work with AVX512 on real hardware and also work on the multithreading aspect. Tuning those two things should keep GIMPS ahead of the curve for what (I assume) will eventually trickle down to desktop CPUs (512bit VPUs and more cores). Intel has already thrown in the towel on making chips increasingly faster by clock speed, so if we have any hope of continuing the steady march of making LL tests faster, then it rests on these other areas. 

20160917, 21:50  #158  
∂^{2}ω=0
Sep 2002
República de California
26752_{8} Posts 
Quote:
Also, we still haven't used ICC and its allegedly marvelous suite of profiling and multithreadtuning tools. 

20160917, 23:08  #159  
Jan 2008
France
1001010010_{2} Posts 
Quote:


20160918, 02:42  #160  
Sep 2016
19 Posts 
Quote:


20160918, 04:45  #161 
"Curtis"
Feb 2005
Riverside, CA
12655_{8} Posts 
How do you know it's fruitless if you haven't done it? And won't the avx512 optimizations be useful on future Xeons anyway?
I'm merely an observer, but there's quite a gap between "we might double mprime performance" and "fruitless". Last fiddled with by VBCurtis on 20160918 at 04:47 
20160918, 04:52  #162 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}×43×47 Posts 
Good news: I wrote the AVX512 TF code.
Bad news: I've run into HJWasm bugs. 
20160918, 05:07  #163 
"David"
Jul 2015
Ohio
1000000101_{2} Posts 
The last public HJWasm still has bugs / required strict casting. I have a patched version if you want to be able to build current mprime vs. fixing our explicitly casts
Last fiddled with by airsquirrels on 20160918 at 05:08 
20160918, 05:54  #164  
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}·43·47 Posts 
Quote:
I've created 2 new bug reports at github. Unfortunately, I never got my verification email for a masm32 forum account. 

20160918, 06:55  #165 
∂^{2}ω=0
Sep 2002
República de California
26752_{8} Posts 
Some TF data ... built my Mfactor code in  mode, using the 960distinctkmodresidueclasses mode, allowing up to that many threads to be used. We start with pureinteger modmul, which is very fast on x86_64. Timing test was the doubleMersenne MM31 to a depth of 68 bits, sufficient to find the smallest 3 of the known factors of this number. That needed 22min running 2threaded on my 2GHz Core2. Here timings on KNL:
16threads: M(2147483647) has 3 factors in range k = [0, 69004615680], passes 0959 Performed 3350616141 trial divides real 7m3.665s <*** Only 3x faster than 2threaded on Core2 ... ugh. *** user 110m8.104s sys 0m1.163s 64threads: real 1m48.711s <*** Almost exactly 4x faster than 16thread *** user 109m50.797s sys 0m0.465s 192threads (I used that rather than 256 since 192 divides 960, i.e. leads to 5 fullyoccupied threadpool waves getting done): real 1m13.171s user 217m42.613s sys 0m1.813s 240 threads (4 full threadpool waves): real 1m9.089s user 249m24.402s sys 0m3.070s So we see more or less perfect ism up to 64 threads, still see a nice further improvement using 3x as many threads as physical cores, and a few % more going up to 240 threads (4x thread/core ratio). But I suspect these timings suck compared to any decent GPU  can someone confirm, using the same test case? Tomorrow will try AVX2 build mode, which uses vectordouble FMA arithmetic to effect a modmul, allowing candidate factors up to 78 bits. That cuts about 1/3 off the runtime (for TFing > 64 bits, that is) over int64based TF on my Haswell. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
LLR development version 3.8.7 is available!  Jean Penné  Software  39  20120427 12:33 
LLR 3.8.5 Development version  Jean Penné  Software  6  20110428 06:21 
Do you have a dedicated system for gimps?  Surge  Hardware  5  20101209 04:07 
Query  Running GIMPS on a 4 way system  Unregistered  Hardware  6  20050704 04:27 
System tweaks to speed GIMPS  Uncwilly  Software  46  20040205 09:38 