![]() |
|
|
#12 | |
|
∂2ω=0
Sep 2002
República de California
5·17·137 Posts |
Quote:
I'm hoping emulators for the AVX512 version will be available by this time next year. |
|
|
|
|
|
|
#13 | ||
|
∂2ω=0
Sep 2002
República de California
5·17·137 Posts |
Quote:
The timings below are very interesting ... scaling poor for <= 4 threads, but going from 4 to 12 threads more than triples the speed [i.e the opposite of what I see on CPU-multicores]. I wonder if there's some core-clutering there which we need to take account of in the thread-affinity code. Quote:
|
||
|
|
|
|
|
#14 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
22·7·269 Posts |
Quote:
|
|
|
|
|
|
|
#15 |
|
"Oliver"
Mar 2005
Germany
11·101 Posts |
Hello!
Parts of the email conversation with Ernst: Scaling 1..4 threads: remember that this for threads are on the same physical core. Scaling 4 -> 12 threads: I just noticed that it is actually using 12 + 16 threads (compared to 4 + 4, 8 + 8 or 16 + 16) -- cut here -- # time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread 12 [...] NTHREADS = 12 [...] Using 16 threads in carry step [...] -- cut here -- I can't turn off hyperthreading (4way) for Xeon Phi. I need to hack the affinity code to spread the threads (I already did because Intel has some strange core mapping anyway...) Spread each thread to its own core (threadpool.c): -- cut here -- i = (4 * my_id + 1) % pool->num_of_cores; // get cpu mask using sequential thread ID modulo #available cores -- cut here -- Xeon Phi 3120A Code:
time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X 1 thread (1 core): real 2m 17.93s 2 threads (2 cores): real 1m 10.21s 3 threads (3 cores): real 0m 38.94s # using 4 threads in carry step 4 threads (4 cores): real 0m 36.94s 5 threads (5 cores): real 0m 21.95s # using 8 threads in carry step 6 threads (6 cores): real 0m 20.96s # using 8 threads in carry step 7 threads (7 cores): real 0m 20.84s # using 8 threads in carry step 8 threads (8 cores): real 0m 19.96s 9 threads (9 cores): real 0m 13.96s # using 16 threads in carry step 10 threads (10 cores): real 0m 13.94s # using 16 threads in carry step 11 threads (11 cores): real 0m 12.90s # using 16 threads in carry step 12 threads (12 cores): real 0m 12.97s # using 16 threads in carry step 13 threads (13 cores): real 0m 12.82s # using 16 threads in carry step 14 threads (14 cores): real 0m 12.94s # using 16 threads in carry step 15 threads (15 cores): real 0m 12.86s # using 16 threads in carry step 16 threads (16 cores): real 0m 11.91s 16 threads is quiet good. Xeon Phi 3120A Code:
time ./Mlucas.mic -fftlen 4096 -iters 100 -radset 0 -nthread X 4 threads (4 cores): real 2m 45.58s 8 threads (8 cores): real 1m 18.21s 16 threads (16 cores): real 0m 42.32s 32 threads (32 cores): real 0m 25.20s |
|
|
|
|
|
#16 |
|
Jan 2008
France
55010 Posts |
How does that compare to your Haswell?
|
|
|
|
|
|
#17 | |
|
∂2ω=0
Sep 2002
República de California
5×17×137 Posts |
Quote:
Oliver - it occurred to me before dropping off to sleep last night that I had forgotten to ask you whether you had a way to assign 1-thread-per-core and thus better test the parallelism, I see you've addressed that above - nice. I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead: time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number: time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X Note that for a set of FFT radices starting with radix0, Fermat-mod can use up to radix0 threads, but Mers-mod can use at most radix0/2 threads. The general rule for the FFT-length-versus-F-number index is that each increment in F-index needs a doubling of FFT length. Thus to test F27 you'd need ~8192K [but 7168 and 7680K also work there]. There is also a more-gradual loss of accuracy as lengths get larger, so e.g. 14336K is no longer sufficient to test F28, although a 100-or-1000-iteration timing test might pass w/o fatal RO errors. |
|
|
|
|
|
|
#18 |
|
"Oliver"
Mar 2005
Germany
21278 Posts |
just don't ask... compared to Prime95 on any recent desktop CPU it is... sloooow, very sloooow!
Scalar double Mlucas build (no SSE assembly code) on host CPU (Xeon E5-2670): # time ./Mlucas -fftlen 1024 -iters 100 -radset 0 -nthread 1 thread (1 core): real 0m8.657s 2 threads (2 cores): real 0m4.563s 4 threads (4 cores): real 0m2.415s 8 threads (8 cores): real 0m1.394s 16 threads (16 cores): real 0m0.949s SSE enabled build is faster Prime95 is even faster Oliver Last fiddled with by TheJudger on 2013-10-09 at 21:47 |
|
|
|
|
|
#19 |
|
Jan 2008
France
2×52×11 Posts |
Ouch... I guess a lot of work will be needed to get decent speed from Phi. Note I'm not surprised, I never bought Intel claims about the supposed ease of getting high perf from such a beast
|
|
|
|
|
|
#20 |
|
∂2ω=0
Sep 2002
República de California
5·17·137 Posts |
I fully expect current Phi to be slow - pure-C code, no SIMD, and we have just begun playing with thread-management related optimizations.
More interesting is where we will be in ~18 months - I fully expect at least a 10x per-[core*cycle] speedup from using 8-way-double SIMD assembler on the AVX512 Phi, as a conservative lower bound. Geez, how many times does one need to say "this Phi here is a very different beast from what is coming in 2015" before people get it? This is "laying the foundations" work here, people. Last fiddled with by ewmayer on 2013-10-09 at 22:10 |
|
|
|
|
|
#21 |
|
Jan 2008
France
2×52×11 Posts |
I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?
|
|
|
|
|
|
#22 | ||
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Quote:
For now I guess a 1% improvement in Prime95 has a higher impact for GIMPS than a superfast LL test for Xeon Phi. But as usual it is hard to predict the future. On the other hand it is good to have a superfast and independed LL test for doublechecking Mersenne Primes. Quote:
Current Xeon Phi has 57, 60 or 61 cores (228, 240 or 244 hardware threads since each core has 4 hardware threads), similar to GPUs, multithreading is needed to keep the cores busy. Check this post for scaling on one physical core. This *might* change with more optimized code but I'm not sure. Oliver |
||
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Xeon vs. Quad CPU (775) | EdH | Hardware | 19 | 2017-06-08 22:06 |
| Motherboard for Xeon | ATH | Hardware | 7 | 2015-10-10 02:13 |
| Intel® Xeon Phi | pinhodecarlos | Hardware | 2 | 2015-02-10 18:42 |
| New Xeon | firejuggler | Hardware | 8 | 2014-09-10 06:37 |
| Dual Xeon Help | euphrus | Software | 12 | 2005-07-21 14:47 |