mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-10-10, 20:34   #23
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5×17×137 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?
The AVX512 stuff is something I planned to do anyway for the next-gen x86 CPUs which we knew were coming - as long as the work to support that can be leveraged for both CPUs and Phi-PUs, that's a good use of coding effort. The way I have the || stuff now makes it transparent to support scalar-double and SIMD in the same threading model.

Quote:
Originally Posted by TheJudger View Post
How much time does Ernst want to spent on Xeon Phi coding, even if its ten times faster than a highend consumer grade GPU for LL, how many people will run LL tests on a Xeon Phi?
Can't tell now - but if the AVX512 version is as fast as one might hope, with Intel behind it, the numbers could quickly exceed those of current high-end GPUs.
ewmayer is offline   Reply With Quote
Old 2013-10-10, 22:15   #24
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2·52·11 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Can't tell now - but if the AVX512 version is as fast as one might hope, with Intel behind it, the numbers could quickly exceed those of current high-end GPUs.
For Phi perhaps. But do we know if future AVX-512 standard chips will have enough memory bandwidth?
ldesnogu is offline   Reply With Quote
Old 2013-10-10, 22:36   #25
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
For Phi perhaps. But do we know if future AVX-512 standard chips will have enough memory bandwidth?
Surely the Intel engineers are keenly aware that cranking up the theoretical throughput of the processor is of little use if the thing is data-starved. While overall memory bandwidth might not be increasing fast enough to keep us Xtreme bandwidth addicts here at GIMPS happy, it is roughly keeping pace with CPU appetites. And on the RAM side the GPUs, with their truly massive appetites, are helping that trend.

George and I may be somewhat disappointed that we didn't get a 2x throughput boost from the SSE2->AVX transition with our respective codes, we nonetheless got enough of a boost to make the coding effort worthwhile. And for my part, once Haswell came out, I got another nice per-cycle boost without any added coding whatsoever, simply due to Haswell's larger caches and overall system-bandwidth improvements.

Anyway, if you're trying to dissuade me from looking at AVX512, it ain't gonna work. :)
ewmayer is offline   Reply With Quote
Old 2013-10-11, 07:00   #26
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2·52·11 Posts
Default

I'm certainly not trying to dissuade you to have fun with that beast, quite the contrary: I'm jealous, I'd like to play with such a toy

I'm just questioning some of the Intel moves. Their x86 everywhere motto has become utterly stupid with segmentation, castrated ISA (Quark), or simply a SIMD extension that will be used in a single product (Phi). Even my Haswell is lacking some features such as TSX. Compatibility doesn't mean anything anymore, so I'd like them to innovate more agressively in the instruction set department for some markets. But I still love the brutal speed of my 4770K which is twice faster than my i7 920 for gmp.

As far as your Haswell speedup goes, isn't it the result of cache bandwidth increase? My understanding is that external memory BW didn't increase.
ldesnogu is offline   Reply With Quote
Old 2013-10-12, 10:41   #27
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

588110 Posts
Default

Just bear in mind that the next gen memory architecture DDR4 is only a couple of years away. It wouldn't surprise me if that swings the balance the way of cpus needing more throughput rather than memory bandwidth limiting us.
If the large L4 caches on the cpus with iris pro gpus become commonplace than that could prove valuable as well(assuming that they aren't too small to be taken advantage of).
henryzz is offline   Reply With Quote
Old 2013-10-12, 20:32   #28
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

101101011111012 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
As far as your Haswell speedup goes, isn't it the result of cache bandwidth increase? My understanding is that external memory BW didn't increase.
Sure - but my point is that the various pieces of the memory hierarchy keep advancing, not necessarily in perfect sync, but the long-term effect is roughly keeping pace with CPU data appetites. When ddr4 becomes widespread that will likely be the next big speedup in external memory, the chipmakers will also keep boosting closer-to-chip data rates, etc.

Last fiddled with by ewmayer on 2013-10-12 at 20:33
ewmayer is offline   Reply With Quote
Old 2013-10-12, 22:14   #29
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Hi Ernst,

here we go:

Quote:
Originally Posted by ewmayer View Post
I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead:

time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X

That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number:

time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X
time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X
Xeon Phi 3120A (57cores @ 1100MHz), using the latest source you sent me including my hacks in platform.h and threadpool.c, Intel compiler 14.0.0.080.

Code:
time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X
1 thread per core *1, leaving remaining cores idle!

Code:
nthread 1: real    8m 40.76s
nthread 2: real    4m 22.96s
nthread 4: real    2m 23.40s
nthread 6: real    1m 14.24s
nthread 8: real    1m 9.27s
nthread 10: real    0m 41.30s
nthread 12: real    0m 39.35s
nthread 14: real    0m 39.78s
nthread 16: real    0m 37.30s
nthread 20: real    0m 25.26s
nthread 24: real    0m 23.29s
nthread 28: real    0m 22.96s
nthread 32: real    0m 22.40s
nthread 40: real    0m 20.21s
nthread 48: real    0m 19.84s
nthread 56: real    0m 19.27s
nthread 64: real    0m 19.28s
*1 (nthread = 40, 48, 56 and 64 are using 64 threads in carry step, some cores have to work on 2 threads!)

2 threads per core, leaving remaining cores idle!
Code:
nthread 2: real    6m 55.63s
nthread 4: real    3m 30.98s
nthread 6: real    1m 54.23s
nthread 8: real    1m 48.25s
nthread 10: real    1m 2.28s
nthread 12: real    1m 1.39s
nthread 14: real    0m 58.63s
nthread 16: real    0m 57.25s
nthread 20: real    0m 34.31s
nthread 24: real    0m 32.99s
nthread 28: real    0m 32.63s
nthread 32: real    0m 31.29s
nthread 40: real    0m 21.36s
nthread 48: real    0m 20.95s
nthread 56: real    0m 20.53s
nthread 64: real    0m 19.42s
56 and 60 thread tests with 3584k or 3840k FFT size aren't better than 64 threads for these FFT sizses. Seems that the carry step (which is always a power of 2?) is the limiting part on Xeon Phi.

Oliver
TheJudger is offline   Reply With Quote
Old 2013-10-12, 23:09   #30
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Many thanks for the timings, Oliver. Let's have a look inside:

Quote:
Originally Posted by TheJudger View Post
Code:
time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X
1 thread per core *1, leaving remaining cores idle!

Code:
nthread 1: real    8m 40.76s
nthread 2: real    4m 22.96s
nthread 4: real    2m 23.40s
nthread 6: real    1m 14.24s
nthread 8: real    1m 9.27s
nthread 10: real    0m 41.30s
nthread 12: real    0m 39.35s
nthread 14: real    0m 39.78s
nthread 16: real    0m 37.30s
nthread 20: real    0m 25.26s
nthread 24: real    0m 23.29s
nthread 28: real    0m 22.96s
nthread 32: real    0m 22.40s
nthread 40: real    0m 20.21s
nthread 48: real    0m 19.84s
nthread 56: real    0m 19.27s
nthread 64: real    0m 19.28s
*1 (nthread = 40, 48, 56 and 64 are using 64 threads in carry step, some cores have to work on 2 threads!)
So e.g. for 8/16/32/64-threads we get speedups of 7.5, 13.9, 23.2 and 26.9x, respectively.

Quote:
2 threads per core, leaving remaining cores idle!
Code:
nthread 2: real    6m 55.63s
nthread 4: real    3m 30.98s
nthread 6: real    1m 54.23s
nthread 8: real    1m 48.25s
nthread 10: real    1m 2.28s
nthread 12: real    1m 1.39s
nthread 14: real    0m 58.63s
nthread 16: real    0m 57.25s
nthread 20: real    0m 34.31s
nthread 24: real    0m 32.99s
nthread 28: real    0m 32.63s
nthread 32: real    0m 31.29s
nthread 40: real    0m 21.36s
nthread 48: real    0m 20.95s
nthread 56: real    0m 20.53s
nthread 64: real    0m 19.42s
For 8/16/32/64-threads we get speedups of 3.8, 7.3, 13.3 and 21.4x, respectively. The absolute max-throughput, though, is the same as for 1-thread-per-core. The baseline 1-core throughput is alas rather too dismal to make current Phis attractive for GIMPS LL-test work.

Quote:
56 and 60 thread tests with 3584k or 3840k FFT size aren't better than 64 threads for these FFT sizes. Seems that the carry step (which is always a power of 2?) is the limiting part on Xeon Phi.
The ||ization of the code is a compromise between two major factors:

1. The 'natural' way to partition a length-n FFT into large independently executable sub-chunks, which depends on the leading radix (a.k.a. radix0) in my implementation - that makes the optimal threadcount for processing of the resulting chunks be a divisor of radix0 for Fermat-mod and of radix0/2 for Mersenne-mod.

2. Each of the independently executable sub-chunks in [1] is itself a power of 2 in length - this makes the optimal threadcount for the fused final-iFFT-radix0/carry/initial fFFT-radix0 step be a divisor of that power of 2.

In practice, were I running on (say) a 6-core system I would consider running one job on 4 of the cores and another on the remaining 2, or having the remaining 2 do some other task.

Your 57-core system is, as you note, bad for the carry step because it's just a little less than a power of 2. Still, being able to get a 23x speedup using 32 of those cores is pretty good - were one doing "production work" on such a system one could use the other cores for something else.

It'll be interesting to see what kind of core counts the AVX512-capable Phis will have.
ewmayer is offline   Reply With Quote
Old 2013-10-16, 20:42   #31
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Friend just sent me this nVidia marketing blurb, let me comment on the 2 main assertions:

[1] "FACT: A GPU is significantly faster than Intel's Xeon Phi on real HPC applications."

Based on data seen around this forum, true ... for now.

[2] "FACT: Programming for a GPU and Xeon Phi require similar effort — but theresults are significantly better on a GPU."

False. Oliver and I were able to get a working build of my scalar-double pthreaded FFT code with only trivial header-file changes, and no special recoding of the FFT source.

OTOH, once AVX512 comes to Phi, that should significantly change the HPC-throughput-comparison in [1], but mainly for folks who take advantage of the vector SIMD capability, which *does* require nontrivial effort if one has not put in place such code targeted at x86 CPUs.

Perhaps the most telling part of the blurb is that nVidia feels compelled to even publish such stuff. Competition can only be good in this arean, I say.
ewmayer is offline   Reply With Quote
Old 2013-10-17, 07:37   #32
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×52×11 Posts
Default

I have talked with several people who have a Phi. The feedback is the same for all of them: it's easy to get code to run on it but the performance is very low, and that is the experience you had, right? Some of them had issues with intrinsics for code already tuned for Intel CPU. The end result is that total dev time is no lower than for GPU. So nVidia point #2 would be correct for the whole project.

Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong?
ldesnogu is offline   Reply With Quote
Old 2013-10-17, 13:01   #33
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

22×3×293 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I have talked with several people who have a Phi. The feedback is the same for all of them: it's easy to get code to run on it but the performance is very low, and that is the experience you had, right? Some of them had issues with intrinsics for code already tuned for Intel CPU. The end result is that total dev time is no lower than for GPU. So nVidia point #2 would be correct for the whole project.

Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong?
If the code is written correctly, then Intel's compiler's together with a few carefully considered #pragma's to guide vectorization can improve performance a *lot*. Not to the point that a hand-tuned intrinsics implementation can yield, which might take similar effort to a GPU port of the codebase, but there can be a nice middle ground on the performance/effort curve using only compiler auto-vectorization and #pragma guidance beyond the original code.
bsquared is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Xeon vs. Quad CPU (775) EdH Hardware 19 2017-06-08 22:06
Motherboard for Xeon ATH Hardware 7 2015-10-10 02:13
Intel® Xeon Phi pinhodecarlos Hardware 2 2015-02-10 18:42
New Xeon firejuggler Hardware 8 2014-09-10 06:37
Dual Xeon Help euphrus Software 12 2005-07-21 14:47

All times are UTC. The time now is 11:24.


Tue Jul 27 11:24:32 UTC 2021 up 4 days, 5:53, 0 users, load averages: 1.60, 1.86, 1.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.