![]() |
Xeon Phi
60 LL tests running side-by-side, anyone?
[url]http://www.tomshardware.com/news/Intel-Xeon-Phi-Coprocessor-CPU,22700.html[/url] |
I don't know if P95 can handle more than 32 core. I recall a thread that stated that even with 2 instance of P95 running, it wouldn't use the 48 core of the machine.
|
I may be wrong, but I believe the 64-bit version of Prime95 could handle 64 threads.
|
that would not be a problem, virtual box (or equivalent), ten copies of the guest OS running with 32 cores allocated to each, P95 running there... just give me the wheelbarrow with the 320 cores... For the phi, the real problem will be [URL="http://www.intel.co.uk/content/dam/www/public/us/en/documents/product-briefs/xeon-phi-datasheet.pdf"]its common caches[/URL], in fact.. because no matter how fast your dog can eat, you still must be able to feed him so fast...
So, in spite of having 7-8 times more cores then a "top" 8-cores Xeon E5-2670, it can only get [URL="http://www.intel.co.uk/content/dam/www/public/us/en/documents/performance-briefs/xeon-phi-product-family-performance-brief.pdf"]3-times more[/URL] DP performance. That would be about 5- to 6-times faster compared with a i7-2600k (as everybody know what a 2600k is, but not everybody knows what a e5-2xxx is :razz:). edit: and this is for clock per clock, but don't forget the 2600k has a faster clock, and runs much overclocked too, in water-cooled rigs. The pessimist in me says we won't see a more than 5 times performance increase, and that is already reachable with cudaLucas. Next big hit would be a FFT library for Radeons... |
[QUOTE=firejuggler;341305]I don't know if P95 can handle more than 32 core. I recall a thread that stated that even with 2 instance of P95 running, it wouldn't use the 48 core of the machine.[/QUOTE]
TObject has it right as far as maximizing overall throughput - 1 LL test per core, or perhaps one LL test per 2-4 physically 'clustered' cores. Would be interesting to experiment and see what the optimal test/core matching strategy is here. |
[QUOTE=TObject;341298]60 LL tests running side-by-side, anyone?
[url]http://www.tomshardware.com/news/Intel-Xeon-Phi-Coprocessor-CPU,22700.html[/url][/QUOTE] Yepp, I've tried in Feb 2013... [QUOTE=ixfd64;341307]I may be wrong, but I believe the 64-bit version of Prime95 could handle 64 threads.[/QUOTE] Yes, but Prime95/mprime wouldn't run on Xeon Phi... Xeon Phi[LIST][*]has no MMX[*]has no SSE[*]has no AVX[*]is x86 compatible[*]is [B]not[/B] binary compatible, you [B]must[/B] recompile your code for Xeon Phi[/LIST] I've tried ewmeyers mlucas. I did compile, It runs singlethreaded (only SSE code is multithread which I can't use), and it is horrible slow: mlucas singlethreaded on a [B]E5-2670 was ~50 times faster[/B] than a single core of Xeon Phi in my test. I failed to run glucas on Xeon Phi. Oliver P.S. My feeling tells me that you'll need to code an LL test explicit for Xeon Phi to achive good performance... but how many users would run it? There are no consumer cards for Xeon Phi. |
The Xeon + Xeon Phi powered supercomputer takes the top spot. Titan moves to the second.
[url]http://www.anandtech.com/show/7075/june-2013-top500-list-published-xeon-phi-takes-top-spot[/url] |
[QUOTE=TheJudger;341387]Xeon Phi[LIST][*]has no MMX[*]has no SSE[*]has no AVX[*]is x86 compatible[*]is [B]not[/B] binary compatible, you [B]must[/B] recompile your code for Xeon Phi[/LIST][/QUOTE]
The above ma be true for current releases, but the [url=http://en.wikipedia.org/wiki/Intel_MIC]roadmap here[/url] looks very promising: Within 2 years Xeon Phi will feature next-gen AVX with 512-bit SIMD registers [versus 256-bit for current AVX], twice as many of these [32 versus the current 16], in a GPU-like manycore compute fabric. It sounds frickin' awesome. Oliver, I have just finished work on a beta version of Mlucas with pthreading-also-for-non-SIMD-builds ... PM me if you're interested in giving it a try. It can make use of up to 32 threads in LL-test mode [and up to 64 in Fermat-mod mode]. |
I've played with the Xeon Phi. As previously mentioned, you need to recompile the code for the Phi without ASM, and to sufficiently hide memory latency you need to run 240 threads.
|
Does anyone know if Xeon Phi would be useful for trial factoring?
|
[QUOTE=ixfd64;355152]Does anyone know if Xeon Phi would be useful for trial factoring?[/QUOTE]
I imagine it would be much worse than a comparably-priced GPU -- the extra power of these cores wouldn't help much for sieving. |
[QUOTE=frmky;354570]I've played with the Xeon Phi. As previously mentioned, you need to recompile the code for the Phi without ASM, and to sufficiently hide memory latency you need to run 240 threads.[/QUOTE]
Again, that is for the *current* implementations - that picture looks set to change quite radically in the next 2 years. While the AVX512-capabale Phi may have slightly different "flavor" of ASM than the mainline Intel CPUs, semi-automated ASM translation should be feasible. I'm hoping emulators for the AVX512 version will be available by this time next year. |
[QUOTE=ewmayer;354552]Oliver, I have just finished work on a beta version of Mlucas with pthreading-also-for-non-SIMD-builds ... PM me if you're interested in giving it a try. It can make use of up to 32 threads in LL-test mode [and up to 64 in Fermat-mod mode].[/QUOTE]
Just got some very-preliminary timings from Oliver - not sure if these are 'suck!' or 'promising', but at least we are getting some potentially useful data - btw, the code I pm'ed Oliver has since been posted to the web, Mlucas README has links/build notes. The slow build times are an icc 'feature' - GCC builds-for-x86 are very fast for me. The timings below are very interesting ... scaling poor for <= 4 threads, but going from 4 to 12 threads more than triples the speed [i.e the opposite of what I see on CPU-multicores]. I wonder if there's some core-clutering there which we need to take account of in the thread-affinity code. [quote]Hi Ernst, next round. Additionally to my previous email I've patched platform.h and threadpool.c (see attachments, I don't recommend to a my quick&dirty hacks to your official source tree!). After that I've compiled Mlucas using the compile script (see attachment). On a dual Xeon E5-2670 this takes nearly 10 minutes.... Intel C Compiler 14.0.0.080, compiled for Xeon Phi (-mmic): # time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X [...] Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.252008929. MaxErr = 0.312500000. Program: E3.0x Res mod 2^36 = 837935290 Res mod 2^35 - 1 = 6238131189 Res mod 2^36 - 1 = 41735145962 Clocks = 00:00:00.000 Done ... Xeon Phi 3120A (57 Cores @ 1100MHz, 6GiB RAM (384bit bus width, 240GB/sec) 1 thread (1 core): real 2m 17.93s 2 threads (1 core): real 1m 49.07s 4 threads (1 core): real 1m 35.22s 8 threads (2 cores): real 0m 48.94s 12 threads (3 cores): real 0m 27.91s 16 threads (4 cores): real 0m 25.95s Xeon Phi 5110P (60 Cores @ 1053MHz, 8GiB RAM (512bit bus width, 320GB/sec) 1 thread (1 core): real 2m 23.98s 2 threads (1 core): real 1m 53.29s 4 threads (1 core): real 1m 38.56s 8 threads (2 cores): real 0m 50.99s 12 threads (3 cores): real 0m 28.99s 16 threads (4 cores): real 0m 27.04s[/quote] |
[QUOTE=ewmayer;355253]
I'm hoping emulators for the AVX512 version will be available by this time next year.[/QUOTE] The AVX-512 manual and emulator is available now: [url]http://software.intel.com/en-us/intel-isa-extensions[/url] |
Hello!
Parts of the email conversation with Ernst: Scaling 1..4 threads: remember that this for threads are on the same physical core. Scaling 4 -> 12 threads: I just noticed that it is actually using 12 + 16 threads (compared to 4 + 4, 8 + 8 or 16 + 16) -- cut here -- # time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread 12 [...] NTHREADS = 12 [...] Using 16 threads in carry step [...] -- cut here -- I can't turn off hyperthreading (4way) for Xeon Phi. I need to hack the affinity code to spread the threads (I already did because Intel has some strange core mapping anyway...) Spread each thread to its own core (threadpool.c): -- cut here -- i = (4 * my_id + 1) % pool->num_of_cores; // get cpu mask using sequential thread ID modulo #available cores -- cut here -- Xeon Phi 3120A [CODE]time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X 1 thread (1 core): real 2m 17.93s 2 threads (2 cores): real 1m 10.21s 3 threads (3 cores): real 0m 38.94s # using 4 threads in carry step 4 threads (4 cores): real 0m 36.94s 5 threads (5 cores): real 0m 21.95s # using 8 threads in carry step 6 threads (6 cores): real 0m 20.96s # using 8 threads in carry step 7 threads (7 cores): real 0m 20.84s # using 8 threads in carry step 8 threads (8 cores): real 0m 19.96s 9 threads (9 cores): real 0m 13.96s # using 16 threads in carry step 10 threads (10 cores): real 0m 13.94s # using 16 threads in carry step 11 threads (11 cores): real 0m 12.90s # using 16 threads in carry step 12 threads (12 cores): real 0m 12.97s # using 16 threads in carry step 13 threads (13 cores): real 0m 12.82s # using 16 threads in carry step 14 threads (14 cores): real 0m 12.94s # using 16 threads in carry step 15 threads (15 cores): real 0m 12.86s # using 16 threads in carry step 16 threads (16 cores): real 0m 11.91s[/CODE] Now (with each thread on its own core) scaling from 1 -> 2 -> 4 -> 8 -> 16 threads is quiet good. Xeon Phi 3120A [CODE]time ./Mlucas.mic -fftlen 4096 -iters 100 -radset 0 -nthread X 4 threads (4 cores): real 2m 45.58s 8 threads (8 cores): real 1m 18.21s 16 threads (16 cores): real 0m 42.32s 32 threads (32 cores): real 0m 25.20s[/CODE] Oliver |
How does that compare to your Haswell?
|
[QUOTE=Prime95;355739]The AVX-512 manual and emulator is available now: [url]http://software.intel.com/en-us/intel-isa-extensions[/url][/QUOTE]
Thanks - that is going to be fun. Oliver - it occurred to me before dropping off to sleep last night that I had forgotten to ask you whether you had a way to assign 1-thread-per-core and thus better test the parallelism, I see you've addressed that above - nice. I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead: [i] time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X [/i] That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number: [i] time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X [/i] Note that for a set of FFT radices starting with radix0, Fermat-mod can use up to radix0 threads, but Mers-mod can use at most radix0/2 threads. The general rule for the FFT-length-versus-F-number index is that each increment in F-index needs a doubling of FFT length. Thus to test F27 you'd need ~8192K [but 7168 and 7680K also work there]. There is also a more-gradual loss of accuracy as lengths get larger, so e.g. 14336K is no longer sufficient to test F28, although a 100-or-1000-iteration timing test might pass w/o fatal RO errors. |
[QUOTE=ldesnogu;355772]How does that compare to your Haswell?[/QUOTE]
just don't ask... compared to Prime95 on any recent desktop CPU it is... sloooow, very sloooow! Scalar double Mlucas build (no SSE assembly code) on host CPU (Xeon E5-2670): # time ./Mlucas -fftlen 1024 -iters 100 -radset 0 -nthread 1 thread (1 core): real 0m8.657s 2 threads (2 cores): real 0m4.563s 4 threads (4 cores): real 0m2.415s 8 threads (8 cores): real 0m1.394s 16 threads (16 cores): real 0m0.949s SSE enabled build is faster Prime95 is even faster Oliver |
Ouch... I guess a lot of work will be needed to get decent speed from Phi. Note I'm not surprised, I never bought Intel claims about the supposed ease of getting high perf from such a beast :smile:
|
I fully expect current Phi to be slow - pure-C code, no SIMD, and we have just begun playing with thread-management related optimizations.
More interesting is where we will be in ~18 months - I fully expect at least a 10x per-[core*cycle] speedup from using 8-way-double SIMD assembler on the AVX512 Phi, as a conservative lower bound. Geez, how many times does one need to say "this Phi here is a very different beast from what is coming in 2015" before people get it? This is "laying the foundations" work here, people. |
I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?
|
[QUOTE=ldesnogu;355827]I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi[/QUOTE]
How much time does Ernst want to spent on Xeon Phi coding, even if its ten times faster than a highend consumer grade GPU for LL, how many people will run LL tests on a Xeon Phi? [B]For now[/B] I guess a 1% improvement in Prime95 has a higher impact for GIMPS than a superfast LL test for Xeon Phi. But as usual it is hard to predict the future. On the other hand it is good to have a superfast and independed LL test for doublechecking Mersenne Primes. [QUOTE=ldesnogu;355827]While the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?[/QUOTE] The maximum number of threads depends on the FFT length, (at least the double-scalar built) can't run 32 threads on 2M FFT while it can do on 4M FFT length. Current Xeon Phi has 57, 60 or 61 cores (228, 240 or 244 hardware threads since each core has 4 hardware threads), similar to GPUs, multithreading is needed to keep the cores busy. Check [URL="http://mersenneforum.org/showpost.php?p=355716&postcount=13"]this[/URL] post for scaling on one physical core. This *might* change with more optimized code but I'm not sure. Oliver |
[QUOTE=ldesnogu;355827]I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?[/QUOTE]
The AVX512 stuff is something I planned to do anyway for the next-gen x86 CPUs which we knew were coming - as long as the work to support that can be leveraged for both CPUs and Phi-PUs, that's a good use of coding effort. The way I have the || stuff now makes it transparent to support scalar-double and SIMD in the same threading model. [QUOTE=TheJudger;355852]How much time does Ernst want to spent on Xeon Phi coding, even if its ten times faster than a highend consumer grade GPU for LL, how many people will run LL tests on a Xeon Phi?[/QUOTE] Can't tell now - but if the AVX512 version is as fast as one might hope, with Intel behind it, the numbers could quickly exceed those of current high-end GPUs. |
[QUOTE=ewmayer;355875]Can't tell now - but if the AVX512 version is as fast as one might hope, with Intel behind it, the numbers could quickly exceed those of current high-end GPUs.[/QUOTE]
For Phi perhaps. But do we know if future AVX-512 standard chips will have enough memory bandwidth? |
[QUOTE=ldesnogu;355883]For Phi perhaps. But do we know if future AVX-512 standard chips will have enough memory bandwidth?[/QUOTE]
Surely the Intel engineers are keenly aware that cranking up the theoretical throughput of the processor is of little use if the thing is data-starved. While overall memory bandwidth might not be increasing fast enough to keep us Xtreme bandwidth addicts here at GIMPS happy, it is roughly keeping pace with CPU appetites. And on the RAM side the GPUs, with their truly massive appetites, are helping that trend. George and I may be somewhat disappointed that we didn't get a 2x throughput boost from the SSE2->AVX transition with our respective codes, we nonetheless got enough of a boost to make the coding effort worthwhile. And for my part, once Haswell came out, I got another nice per-cycle boost without any added coding whatsoever, simply due to Haswell's larger caches and overall system-bandwidth improvements. Anyway, if you're trying to dissuade me from looking at AVX512, it ain't gonna work. :) |
I'm certainly not trying to dissuade you to have fun with that beast, quite the contrary: I'm jealous, I'd like to play with such a toy :smile:
I'm just questioning some of the Intel moves. Their x86 everywhere motto has become utterly stupid with segmentation, castrated ISA (Quark), or simply a SIMD extension that will be used in a single product (Phi). Even my Haswell is lacking some features such as TSX. Compatibility doesn't mean anything anymore, so I'd like them to innovate more agressively in the instruction set department for some markets. But I still love the brutal speed of my 4770K which is twice faster than my i7 920 for gmp. As far as your Haswell speedup goes, isn't it the result of cache bandwidth increase? My understanding is that external memory BW didn't increase. |
Just bear in mind that the next gen memory architecture DDR4 is only a couple of years away. It wouldn't surprise me if that swings the balance the way of cpus needing more throughput rather than memory bandwidth limiting us.
If the large L4 caches on the cpus with iris pro gpus become commonplace than that could prove valuable as well(assuming that they aren't too small to be taken advantage of). |
[QUOTE=ldesnogu;355915]As far as your Haswell speedup goes, isn't it the result of cache bandwidth increase? My understanding is that external memory BW didn't increase.[/QUOTE]
Sure - but my point is that the various pieces of the memory hierarchy keep advancing, not necessarily in perfect sync, but the long-term effect is roughly keeping pace with CPU data appetites. When ddr4 becomes widespread that will likely be the next big speedup in external memory, the chipmakers will also keep boosting closer-to-chip data rates, etc. |
Hi Ernst,
here we go: [QUOTE=ewmayer;355777]I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead: [i] time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X [/i] That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number: [i] time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X [/i] [/QUOTE] Xeon Phi 3120A (57cores @ 1100MHz), using the latest source you sent me including my hacks in platform.h and threadpool.c, Intel compiler 14.0.0.080. [CODE]time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X[/CODE] [B]1 thread per core *[SUP]1[/SUP][/B], leaving remaining cores idle! [CODE]nthread 1: real 8m 40.76s nthread 2: real 4m 22.96s nthread 4: real 2m 23.40s nthread 6: real 1m 14.24s nthread 8: real 1m 9.27s nthread 10: real 0m 41.30s nthread 12: real 0m 39.35s nthread 14: real 0m 39.78s nthread 16: real 0m 37.30s nthread 20: real 0m 25.26s nthread 24: real 0m 23.29s nthread 28: real 0m 22.96s nthread 32: real 0m 22.40s [COLOR="Gray"]nthread 40: real 0m 20.21s nthread 48: real 0m 19.84s nthread 56: real 0m 19.27s nthread 64: real 0m 19.28s[/COLOR][/CODE] *[SUP]1[/SUP] (nthread = 40, 48, 56 and 64 are using 64 threads in carry step, some cores have to work on 2 threads!) [B]2 threads per core[/B], leaving remaining cores idle! [CODE]nthread 2: real 6m 55.63s nthread 4: real 3m 30.98s nthread 6: real 1m 54.23s nthread 8: real 1m 48.25s nthread 10: real 1m 2.28s nthread 12: real 1m 1.39s nthread 14: real 0m 58.63s nthread 16: real 0m 57.25s nthread 20: real 0m 34.31s nthread 24: real 0m 32.99s nthread 28: real 0m 32.63s nthread 32: real 0m 31.29s nthread 40: real 0m 21.36s nthread 48: real 0m 20.95s nthread 56: real 0m 20.53s nthread 64: real 0m 19.42s[/CODE] 56 and 60 thread tests with 3584k or 3840k FFT size aren't better than 64 threads for these FFT sizses. Seems that the carry step (which is always a power of 2?) is the limiting part on Xeon Phi. Oliver |
Many thanks for the timings, Oliver. Let's have a look inside:
[QUOTE=TheJudger;356096][CODE]time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X[/CODE] [B]1 thread per core *[SUP]1[/SUP][/B], leaving remaining cores idle! [CODE]nthread 1: real 8m 40.76s nthread 2: real 4m 22.96s nthread 4: real 2m 23.40s nthread 6: real 1m 14.24s nthread 8: real 1m 9.27s nthread 10: real 0m 41.30s nthread 12: real 0m 39.35s nthread 14: real 0m 39.78s nthread 16: real 0m 37.30s nthread 20: real 0m 25.26s nthread 24: real 0m 23.29s nthread 28: real 0m 22.96s nthread 32: real 0m 22.40s [COLOR="Gray"]nthread 40: real 0m 20.21s nthread 48: real 0m 19.84s nthread 56: real 0m 19.27s nthread 64: real 0m 19.28s[/COLOR][/CODE] *[SUP]1[/SUP] (nthread = 40, 48, 56 and 64 are using 64 threads in carry step, some cores have to work on 2 threads!)[/quote] So e.g. for 8/16/32/64-threads we get speedups of 7.5, 13.9, 23.2 and 26.9x, respectively. [quote][B]2 threads per core[/B], leaving remaining cores idle! [CODE]nthread 2: real 6m 55.63s nthread 4: real 3m 30.98s nthread 6: real 1m 54.23s nthread 8: real 1m 48.25s nthread 10: real 1m 2.28s nthread 12: real 1m 1.39s nthread 14: real 0m 58.63s nthread 16: real 0m 57.25s nthread 20: real 0m 34.31s nthread 24: real 0m 32.99s nthread 28: real 0m 32.63s nthread 32: real 0m 31.29s nthread 40: real 0m 21.36s nthread 48: real 0m 20.95s nthread 56: real 0m 20.53s nthread 64: real 0m 19.42s[/CODE][/quote] For 8/16/32/64-threads we get speedups of 3.8, 7.3, 13.3 and 21.4x, respectively. The absolute max-throughput, though, is the same as for 1-thread-per-core. The baseline 1-core throughput is alas rather too dismal to make current Phis attractive for GIMPS LL-test work. [quote]56 and 60 thread tests with 3584k or 3840k FFT size aren't better than 64 threads for these FFT sizes. Seems that the carry step (which is always a power of 2?) is the limiting part on Xeon Phi.[/QUOTE] The ||ization of the code is a compromise between two major factors: 1. The 'natural' way to partition a length-n FFT into large independently executable sub-chunks, which depends on the leading radix (a.k.a. radix0) in my implementation - that makes the optimal threadcount for processing of the resulting chunks be a divisor of radix0 for Fermat-mod and of radix0/2 for Mersenne-mod. 2. Each of the independently executable sub-chunks in [1] is itself a power of 2 in length - this makes the optimal threadcount for the fused final-iFFT-radix0/carry/initial fFFT-radix0 step be a divisor of that power of 2. In practice, were I running on (say) a 6-core system I would consider running one job on 4 of the cores and another on the remaining 2, or having the remaining 2 do some other task. Your 57-core system is, as you note, bad for the carry step because it's just a little less than a power of 2. Still, being able to get a 23x speedup using 32 of those cores is pretty good - were one doing "production work" on such a system one could use the other cores for something else. It'll be interesting to see what kind of core counts the AVX512-capable Phis will have. |
Friend just sent me [url=www.nvidia.com/object/justthefacts.html]this nVidia marketing blurb[/url], let me comment on the 2 main assertions:
[1] [i]"FACT: A GPU is significantly faster than Intel's Xeon Phi on real HPC applications."[/i] Based on data seen around this forum, true ... for now. [2] [i]"FACT: Programming for a GPU and Xeon Phi require similar effort — but theresults are significantly better on a GPU."[/i] False. Oliver and I were able to get a working build of my scalar-double pthreaded FFT code with only trivial header-file changes, and no special recoding of the FFT source. OTOH, once AVX512 comes to Phi, that should significantly change the HPC-throughput-comparison in [1], but mainly for folks who take advantage of the vector SIMD capability, which *does* require nontrivial effort if one has not put in place such code targeted at x86 CPUs. Perhaps the most telling part of the blurb is that nVidia feels compelled to even publish such stuff. Competition can only be good in this arean, I say. |
I have talked with several people who have a Phi. The feedback is the same for all of them: it's easy to get code to run on it but the performance is very low, and that is the experience you had, right? Some of them had issues with intrinsics for code already tuned for Intel CPU. The end result is that total dev time is no lower than for GPU. So nVidia point #2 would be correct for the whole project.
Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong? |
[QUOTE=ldesnogu;356509]I have talked with several people who have a Phi. The feedback is the same for all of them: it's easy to get code to run on it but the performance is very low, and that is the experience you had, right? Some of them had issues with intrinsics for code already tuned for Intel CPU. The end result is that total dev time is no lower than for GPU. So nVidia point #2 would be correct for the whole project.
Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong?[/QUOTE] If the code is written correctly, then Intel's compiler's together with a few carefully considered #pragma's to guide vectorization can improve performance a *lot*. Not to the point that a hand-tuned intrinsics implementation can yield, which might take similar effort to a GPU port of the codebase, but there can be a nice middle ground on the performance/effort curve using only compiler auto-vectorization and #pragma guidance beyond the original code. |
This is actually something I'm in the middle of too.
Porting GNFS polynomial selection to use a GPU was several months of spare-time work, even though the main primitive involved (radix sorting) had a very fast library implementation ready-made, which was used as a black box. Most of the effort involved retooling the host code to issue sorting calls in parallel, and also retooling the support code that generated sort data so that it ran on the GPU and didn't take tons of time. It already ran on the GPU thanks to jrk's work, so doing that from scratch would have been more effort. In the case of poly selection it paid off handsomely, the speedup is on the order of 70x. I've also been porting the sparse linear algebra in Msieve to use a GPU. This is going to be a lot harder, because there are no ready-made GPU library routines that do sparse matrix multiplies in a Galois field. One can use a segmented scan (available in library form) to implement a sparse matrix multiply, but that still would require support code to assemble scan problems and would triple the memory use so that one would have to use a GPU cluster to assemble enough GPU memory for even medium-size problems. My initial proof-of-concept code was a complete rewrite of the original matrix multiply source in order to use an algorithm specifically made for vector processors, and the performance on my low-end card is only a little better than running the same thing on my low-end CPU. Meanwhile, Greg tried tuning the current CPU code a little bit to run on the Phi; it worked out of the box, and was 8x slower than running on a Haswell. After using MPI plus threads he got it down to 2.5x slower. I suspect improving on that will also take major work. |
[QUOTE=ldesnogu;356509]Ernst, Oliver, is it really that harder to get a working program on a GPU? Of course there are some changes to do, but they seem rather small if all you want is something working (that is something similar to the initial porting effort to Phi). Am I completely wrong?[/QUOTE]
I'll let you know once I have successfully done so. :) My main point here is this: Many/most HPC codes already have been parallelized using one of the small number of widespread threading APIs. It is really annoying for GPU vendors to not permit such to "work out of the box", i.e. to have done the work needed to map that extant parallelism to their particular architectures during the long process of developing their own compilers and APIs. As a developer, I fully expect to have to work to *tune* the code to the particularities of the given architecture, but I should not have to completely-rewrite the parallelization interface. This part of the game Intel has right. |
| All times are UTC. The time now is 11:24. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.