mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   Xeon Phi (https://www.mersenneforum.org/showthread.php?t=18223)

ewmayer 2013-10-04 20:31

[QUOTE=frmky;354570]I've played with the Xeon Phi. As previously mentioned, you need to recompile the code for the Phi without ASM, and to sufficiently hide memory latency you need to run 240 threads.[/QUOTE]

Again, that is for the *current* implementations - that picture looks set to change quite radically in the next 2 years. While the AVX512-capabale Phi may have slightly different "flavor" of ASM than the mainline Intel CPUs, semi-automated ASM translation should be feasible.

I'm hoping emulators for the AVX512 version will be available by this time next year.

ewmayer 2013-10-09 06:26

[QUOTE=ewmayer;354552]Oliver, I have just finished work on a beta version of Mlucas with pthreading-also-for-non-SIMD-builds ... PM me if you're interested in giving it a try. It can make use of up to 32 threads in LL-test mode [and up to 64 in Fermat-mod mode].[/QUOTE]

Just got some very-preliminary timings from Oliver - not sure if these are 'suck!' or 'promising', but at least we are getting some potentially useful data - btw, the code I pm'ed Oliver has since been posted to the web, Mlucas README has links/build notes. The slow build times are an icc 'feature' - GCC builds-for-x86 are very fast for me.

The timings below are very interesting ... scaling poor for <= 4 threads, but going from 4 to 12 threads more than triples the speed [i.e the opposite of what I see on CPU-multicores]. I wonder if there's some core-clutering there which we need to take account of in the thread-affinity code.

[quote]Hi Ernst,

next round.

Additionally to my previous email I've patched platform.h and
threadpool.c (see attachments, I don't recommend to a my quick&dirty
hacks to your official source tree!). After that I've compiled Mlucas
using the compile script (see attachment). On a dual Xeon E5-2670 this
takes nearly 10 minutes....

Intel C Compiler 14.0.0.080, compiled for Xeon Phi (-mmic):

# time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X
[...]
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.252008929. MaxErr = 0.312500000.
Program: E3.0x
Res mod 2^36 = 837935290
Res mod 2^35 - 1 = 6238131189
Res mod 2^36 - 1 = 41735145962
Clocks = 00:00:00.000

Done ...

Xeon Phi 3120A (57 Cores @ 1100MHz, 6GiB RAM (384bit bus width,
240GB/sec)
1 thread (1 core): real 2m 17.93s
2 threads (1 core): real 1m 49.07s
4 threads (1 core): real 1m 35.22s
8 threads (2 cores): real 0m 48.94s
12 threads (3 cores): real 0m 27.91s
16 threads (4 cores): real 0m 25.95s

Xeon Phi 5110P (60 Cores @ 1053MHz, 8GiB RAM (512bit bus width,
320GB/sec)
1 thread (1 core): real 2m 23.98s
2 threads (1 core): real 1m 53.29s
4 threads (1 core): real 1m 38.56s
8 threads (2 cores): real 0m 50.99s
12 threads (3 cores): real 0m 28.99s
16 threads (4 cores): real 0m 27.04s[/quote]

Prime95 2013-10-09 14:34

[QUOTE=ewmayer;355253]
I'm hoping emulators for the AVX512 version will be available by this time next year.[/QUOTE]

The AVX-512 manual and emulator is available now: [url]http://software.intel.com/en-us/intel-isa-extensions[/url]

TheJudger 2013-10-09 19:10

Hello!

Parts of the email conversation with Ernst:

Scaling 1..4 threads: remember that this for threads are on the same
physical core.
Scaling 4 -> 12 threads: I just noticed that it is actually using 12 +
16 threads (compared to 4 + 4, 8 + 8 or 16 + 16)
-- cut here --
# time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread 12
[...]
NTHREADS = 12
[...]
Using 16 threads in carry step
[...]
-- cut here --

I can't turn off hyperthreading (4way) for Xeon Phi. I need to hack the
affinity code to spread the threads (I already did because Intel has
some strange core mapping anyway...)

Spread each thread to its own core (threadpool.c):
-- cut here --
i = (4 * my_id + 1) % pool->num_of_cores; // get cpu mask
using sequential thread ID modulo #available cores
-- cut here --

Xeon Phi 3120A
[CODE]time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X

1 thread (1 core): real 2m 17.93s

2 threads (2 cores): real 1m 10.21s

3 threads (3 cores): real 0m 38.94s # using 4 threads in carry step
4 threads (4 cores): real 0m 36.94s

5 threads (5 cores): real 0m 21.95s # using 8 threads in carry step
6 threads (6 cores): real 0m 20.96s # using 8 threads in carry step
7 threads (7 cores): real 0m 20.84s # using 8 threads in carry step
8 threads (8 cores): real 0m 19.96s

9 threads (9 cores): real 0m 13.96s # using 16 threads in carry step
10 threads (10 cores): real 0m 13.94s # using 16 threads in carry
step
11 threads (11 cores): real 0m 12.90s # using 16 threads in carry
step
12 threads (12 cores): real 0m 12.97s # using 16 threads in carry
step
13 threads (13 cores): real 0m 12.82s # using 16 threads in carry
step
14 threads (14 cores): real 0m 12.94s # using 16 threads in carry
step
15 threads (15 cores): real 0m 12.86s # using 16 threads in carry
step
16 threads (16 cores): real 0m 11.91s[/CODE]

Now (with each thread on its own core) scaling from 1 -> 2 -> 4 -> 8 ->
16 threads is quiet good.

Xeon Phi 3120A
[CODE]time ./Mlucas.mic -fftlen 4096 -iters 100 -radset 0 -nthread X
4 threads (4 cores): real 2m 45.58s
8 threads (8 cores): real 1m 18.21s
16 threads (16 cores): real 0m 42.32s
32 threads (32 cores): real 0m 25.20s[/CODE]

Oliver

ldesnogu 2013-10-09 20:13

How does that compare to your Haswell?

ewmayer 2013-10-09 20:34

[QUOTE=Prime95;355739]The AVX-512 manual and emulator is available now: [url]http://software.intel.com/en-us/intel-isa-extensions[/url][/QUOTE]

Thanks - that is going to be fun.

Oliver - it occurred to me before dropping off to sleep last night that I had forgotten to ask you whether you had a way to assign 1-thread-per-core and thus better test the parallelism, I see you've addressed that above - nice.

I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead:
[i]
time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X
[/i]
That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number:
[i]
time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X
time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X
[/i]
Note that for a set of FFT radices starting with radix0, Fermat-mod can use up to radix0 threads, but Mers-mod can use at most radix0/2 threads.

The general rule for the FFT-length-versus-F-number index is that each increment in F-index needs a doubling of FFT length. Thus to test F27 you'd need ~8192K [but 7168 and 7680K also work there]. There is also a more-gradual loss of accuracy as lengths get larger, so e.g. 14336K is no longer sufficient to test F28, although a 100-or-1000-iteration timing test might pass w/o fatal RO errors.

TheJudger 2013-10-09 21:06

[QUOTE=ldesnogu;355772]How does that compare to your Haswell?[/QUOTE]

just don't ask... compared to Prime95 on any recent desktop CPU it is... sloooow, very sloooow!

Scalar double Mlucas build (no SSE assembly code) on host CPU (Xeon E5-2670):
# time ./Mlucas -fftlen 1024 -iters 100 -radset 0 -nthread
1 thread (1 core): real 0m8.657s
2 threads (2 cores): real 0m4.563s
4 threads (4 cores): real 0m2.415s
8 threads (8 cores): real 0m1.394s
16 threads (16 cores): real 0m0.949s

SSE enabled build is faster
Prime95 is even faster

Oliver

ldesnogu 2013-10-09 21:35

Ouch... I guess a lot of work will be needed to get decent speed from Phi. Note I'm not surprised, I never bought Intel claims about the supposed ease of getting high perf from such a beast :smile:

ewmayer 2013-10-09 22:04

I fully expect current Phi to be slow - pure-C code, no SIMD, and we have just begun playing with thread-management related optimizations.

More interesting is where we will be in ~18 months - I fully expect at least a 10x per-[core*cycle] speedup from using 8-way-double SIMD assembler on the AVX512 Phi, as a conservative lower bound.

Geez, how many times does one need to say "this Phi here is a very different beast from what is coming in 2015" before people get it? This is "laying the foundations" work here, people.

ldesnogu 2013-10-10 08:16

I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?

TheJudger 2013-10-10 16:10

[QUOTE=ldesnogu;355827]I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi[/QUOTE]

How much time does Ernst want to spent on Xeon Phi coding, even if its ten times faster than a highend consumer grade GPU for LL, how many people will run LL tests on a Xeon Phi?
[B]For now[/B] I guess a 1% improvement in Prime95 has a higher impact for GIMPS than a superfast LL test for Xeon Phi. But as usual it is hard to predict the future.
On the other hand it is good to have a superfast and independed LL test for doublechecking Mersenne Primes.

[QUOTE=ldesnogu;355827]While the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?[/QUOTE]

The maximum number of threads depends on the FFT length, (at least the double-scalar built) can't run 32 threads on 2M FFT while it can do on 4M FFT length.

Current Xeon Phi has 57, 60 or 61 cores (228, 240 or 244 hardware threads since each core has 4 hardware threads), similar to GPUs, multithreading is needed to keep the cores busy. Check [URL="http://mersenneforum.org/showpost.php?p=355716&postcount=13"]this[/URL] post for scaling on one physical core. This *might* change with more optimized code but I'm not sure.

Oliver


All times are UTC. The time now is 11:24.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.