![]() |
|
|
#23 | |
|
∂2ω=0
Sep 2002
República de California
101101011111012 Posts |
Quote:
Can't tell now - but if the AVX512 version is as fast as one might hope, with Intel behind it, the numbers could quickly exceed those of current high-end GPUs. |
|
|
|
|
|
|
#24 |
|
Jan 2008
France
2×52×11 Posts |
|
|
|
|
|
|
#25 | |
|
∂2ω=0
Sep 2002
República de California
101101011111012 Posts |
Quote:
George and I may be somewhat disappointed that we didn't get a 2x throughput boost from the SSE2->AVX transition with our respective codes, we nonetheless got enough of a boost to make the coding effort worthwhile. And for my part, once Haswell came out, I got another nice per-cycle boost without any added coding whatsoever, simply due to Haswell's larger caches and overall system-bandwidth improvements. Anyway, if you're trying to dissuade me from looking at AVX512, it ain't gonna work. :) |
|
|
|
|
|
|
#26 |
|
Jan 2008
France
10468 Posts |
I'm certainly not trying to dissuade you to have fun with that beast, quite the contrary: I'm jealous, I'd like to play with such a toy
![]() I'm just questioning some of the Intel moves. Their x86 everywhere motto has become utterly stupid with segmentation, castrated ISA (Quark), or simply a SIMD extension that will be used in a single product (Phi). Even my Haswell is lacking some features such as TSX. Compatibility doesn't mean anything anymore, so I'd like them to innovate more agressively in the instruction set department for some markets. But I still love the brutal speed of my 4770K which is twice faster than my i7 920 for gmp. As far as your Haswell speedup goes, isn't it the result of cache bandwidth increase? My understanding is that external memory BW didn't increase. |
|
|
|
|
|
#27 |
|
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
5,881 Posts |
Just bear in mind that the next gen memory architecture DDR4 is only a couple of years away. It wouldn't surprise me if that swings the balance the way of cpus needing more throughput rather than memory bandwidth limiting us.
If the large L4 caches on the cpus with iris pro gpus become commonplace than that could prove valuable as well(assuming that they aren't too small to be taken advantage of). |
|
|
|
|
|
#28 |
|
∂2ω=0
Sep 2002
República de California
5×17×137 Posts |
Sure - but my point is that the various pieces of the memory hierarchy keep advancing, not necessarily in perfect sync, but the long-term effect is roughly keeping pace with CPU data appetites. When ddr4 becomes widespread that will likely be the next big speedup in external memory, the chipmakers will also keep boosting closer-to-chip data rates, etc.
Last fiddled with by ewmayer on 2013-10-12 at 20:33 |
|
|
|
|
|
#29 | |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hi Ernst,
here we go: Quote:
Code:
time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X Code:
nthread 1: real 8m 40.76s nthread 2: real 4m 22.96s nthread 4: real 2m 23.40s nthread 6: real 1m 14.24s nthread 8: real 1m 9.27s nthread 10: real 0m 41.30s nthread 12: real 0m 39.35s nthread 14: real 0m 39.78s nthread 16: real 0m 37.30s nthread 20: real 0m 25.26s nthread 24: real 0m 23.29s nthread 28: real 0m 22.96s nthread 32: real 0m 22.40s nthread 40: real 0m 20.21s nthread 48: real 0m 19.84s nthread 56: real 0m 19.27s nthread 64: real 0m 19.28s 2 threads per core, leaving remaining cores idle! Code:
nthread 2: real 6m 55.63s nthread 4: real 3m 30.98s nthread 6: real 1m 54.23s nthread 8: real 1m 48.25s nthread 10: real 1m 2.28s nthread 12: real 1m 1.39s nthread 14: real 0m 58.63s nthread 16: real 0m 57.25s nthread 20: real 0m 34.31s nthread 24: real 0m 32.99s nthread 28: real 0m 32.63s nthread 32: real 0m 31.29s nthread 40: real 0m 21.36s nthread 48: real 0m 20.95s nthread 56: real 0m 20.53s nthread 64: real 0m 19.42s Oliver |
|
|
|
|
|
|
#30 | |||
|
∂2ω=0
Sep 2002
República de California
5·17·137 Posts |
Many thanks for the timings, Oliver. Let's have a look inside:
Quote:
Quote:
Quote:
1. The 'natural' way to partition a length-n FFT into large independently executable sub-chunks, which depends on the leading radix (a.k.a. radix0) in my implementation - that makes the optimal threadcount for processing of the resulting chunks be a divisor of radix0 for Fermat-mod and of radix0/2 for Mersenne-mod. 2. Each of the independently executable sub-chunks in [1] is itself a power of 2 in length - this makes the optimal threadcount for the fused final-iFFT-radix0/carry/initial fFFT-radix0 step be a divisor of that power of 2. In practice, were I running on (say) a 6-core system I would consider running one job on 4 of the cores and another on the remaining 2, or having the remaining 2 do some other task. Your 57-core system is, as you note, bad for the carry step because it's just a little less than a power of 2. Still, being able to get a 23x speedup using 32 of those cores is pretty good - were one doing "production work" on such a system one could use the other cores for something else. It'll be interesting to see what kind of core counts the AVX512-capable Phis will have. |
|||
|
|
|
|
|
#31 |
|
∂2ω=0
Sep 2002
República de California
1164510 Posts |
Friend just sent me this nVidia marketing blurb, let me comment on the 2 main assertions:
[1] "FACT: A GPU is significantly faster than Intel's Xeon Phi on real HPC applications." Based on data seen around this forum, true ... for now. [2] "FACT: Programming for a GPU and Xeon Phi require similar effort — but theresults are significantly better on a GPU." False. Oliver and I were able to get a working build of my scalar-double pthreaded FFT code with only trivial header-file changes, and no special recoding of the FFT source. OTOH, once AVX512 comes to Phi, that should significantly change the HPC-throughput-comparison in [1], but mainly for folks who take advantage of the vector SIMD capability, which *does* require nontrivial effort if one has not put in place such code targeted at x86 CPUs. Perhaps the most telling part of the blurb is that nVidia feels compelled to even publish such stuff. Competition can only be good in this arean, I say. |
|
|
|
|
|
#32 |
|
Jan 2008
France
10001001102 Posts |
I have talked with several people who have a Phi. The feedback is the same for all of them: it's easy to get code to run on it but the performance is very low, and that is the experience you had, right? Some of them had issues with intrinsics for code already tuned for Intel CPU. The end result is that total dev time is no lower than for GPU. So nVidia point #2 would be correct for the whole project.
Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong? |
|
|
|
|
|
#33 | |
|
"Ben"
Feb 2007
22×3×293 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Xeon vs. Quad CPU (775) | EdH | Hardware | 19 | 2017-06-08 22:06 |
| Motherboard for Xeon | ATH | Hardware | 7 | 2015-10-10 02:13 |
| Intel® Xeon Phi | pinhodecarlos | Hardware | 2 | 2015-02-10 18:42 |
| New Xeon | firejuggler | Hardware | 8 | 2014-09-10 06:37 |
| Dual Xeon Help | euphrus | Software | 12 | 2005-07-21 14:47 |