mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2013-10-04, 20:31   #12
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Quote:
Originally Posted by frmky View Post
I've played with the Xeon Phi. As previously mentioned, you need to recompile the code for the Phi without ASM, and to sufficiently hide memory latency you need to run 240 threads.
Again, that is for the *current* implementations - that picture looks set to change quite radically in the next 2 years. While the AVX512-capabale Phi may have slightly different "flavor" of ASM than the mainline Intel CPUs, semi-automated ASM translation should be feasible.

I'm hoping emulators for the AVX512 version will be available by this time next year.
ewmayer is offline   Reply With Quote
Old 2013-10-09, 06:26   #13
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Quote:
Originally Posted by ewmayer View Post
Oliver, I have just finished work on a beta version of Mlucas with pthreading-also-for-non-SIMD-builds ... PM me if you're interested in giving it a try. It can make use of up to 32 threads in LL-test mode [and up to 64 in Fermat-mod mode].
Just got some very-preliminary timings from Oliver - not sure if these are 'suck!' or 'promising', but at least we are getting some potentially useful data - btw, the code I pm'ed Oliver has since been posted to the web, Mlucas README has links/build notes. The slow build times are an icc 'feature' - GCC builds-for-x86 are very fast for me.

The timings below are very interesting ... scaling poor for <= 4 threads, but going from 4 to 12 threads more than triples the speed [i.e the opposite of what I see on CPU-multicores]. I wonder if there's some core-clutering there which we need to take account of in the thread-affinity code.

Quote:
Hi Ernst,

next round.

Additionally to my previous email I've patched platform.h and
threadpool.c (see attachments, I don't recommend to a my quick&dirty
hacks to your official source tree!). After that I've compiled Mlucas
using the compile script (see attachment). On a dual Xeon E5-2670 this
takes nearly 10 minutes....

Intel C Compiler 14.0.0.080, compiled for Xeon Phi (-mmic):

# time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X
[...]
Res64: DD61B3E031F1E0BA. AvgMaxErr = 0.252008929. MaxErr = 0.312500000.
Program: E3.0x
Res mod 2^36 = 837935290
Res mod 2^35 - 1 = 6238131189
Res mod 2^36 - 1 = 41735145962
Clocks = 00:00:00.000

Done ...

Xeon Phi 3120A (57 Cores @ 1100MHz, 6GiB RAM (384bit bus width,
240GB/sec)
1 thread (1 core): real 2m 17.93s
2 threads (1 core): real 1m 49.07s
4 threads (1 core): real 1m 35.22s
8 threads (2 cores): real 0m 48.94s
12 threads (3 cores): real 0m 27.91s
16 threads (4 cores): real 0m 25.95s

Xeon Phi 5110P (60 Cores @ 1053MHz, 8GiB RAM (512bit bus width,
320GB/sec)
1 thread (1 core): real 2m 23.98s
2 threads (1 core): real 1m 53.29s
4 threads (1 core): real 1m 38.56s
8 threads (2 cores): real 0m 50.99s
12 threads (3 cores): real 0m 28.99s
16 threads (4 cores): real 0m 27.04s
ewmayer is offline   Reply With Quote
Old 2013-10-09, 14:34   #14
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11101011011002 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I'm hoping emulators for the AVX512 version will be available by this time next year.
The AVX-512 manual and emulator is available now: http://software.intel.com/en-us/intel-isa-extensions
Prime95 is offline   Reply With Quote
Old 2013-10-09, 19:10   #15
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Hello!

Parts of the email conversation with Ernst:

Scaling 1..4 threads: remember that this for threads are on the same
physical core.
Scaling 4 -> 12 threads: I just noticed that it is actually using 12 +
16 threads (compared to 4 + 4, 8 + 8 or 16 + 16)
-- cut here --
# time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread 12
[...]
NTHREADS = 12
[...]
Using 16 threads in carry step
[...]
-- cut here --

I can't turn off hyperthreading (4way) for Xeon Phi. I need to hack the
affinity code to spread the threads (I already did because Intel has
some strange core mapping anyway...)

Spread each thread to its own core (threadpool.c):
-- cut here --
i = (4 * my_id + 1) % pool->num_of_cores; // get cpu mask
using sequential thread ID modulo #available cores
-- cut here --

Xeon Phi 3120A
Code:
time ./Mlucas.mic -fftlen 1024 -iters 100 -radset 0 -nthread X

1 thread (1 core): real    2m 17.93s

2 threads (2 cores): real    1m 10.21s

3 threads (3 cores): real    0m 38.94s # using 4 threads in carry step
4 threads (4 cores): real    0m 36.94s

5 threads (5 cores): real    0m 21.95s # using 8 threads in carry step
6 threads (6 cores): real    0m 20.96s # using 8 threads in carry step
7 threads (7 cores): real    0m 20.84s # using 8 threads in carry step
8 threads (8 cores): real    0m 19.96s

9 threads (9 cores): real    0m 13.96s # using 16 threads in carry step
10 threads (10 cores): real    0m 13.94s # using 16 threads in carry
step
11 threads (11 cores): real    0m 12.90s # using 16 threads in carry
step
12 threads (12 cores): real    0m 12.97s # using 16 threads in carry
step
13 threads (13 cores): real    0m 12.82s # using 16 threads in carry
step
14 threads (14 cores): real    0m 12.94s # using 16 threads in carry
step
15 threads (15 cores): real    0m 12.86s # using 16 threads in carry
step
16 threads (16 cores): real    0m 11.91s
Now (with each thread on its own core) scaling from 1 -> 2 -> 4 -> 8 ->
16 threads is quiet good.

Xeon Phi 3120A
Code:
time ./Mlucas.mic -fftlen 4096 -iters 100 -radset 0 -nthread X
4 threads (4 cores): real    2m 45.58s
8 threads (8 cores): real    1m 18.21s
16 threads (16 cores): real    0m 42.32s
32 threads (32 cores): real    0m 25.20s
Oliver
TheJudger is offline   Reply With Quote
Old 2013-10-09, 20:13   #16
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×52×11 Posts
Default

How does that compare to your Haswell?
ldesnogu is offline   Reply With Quote
Old 2013-10-09, 20:34   #17
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The AVX-512 manual and emulator is available now: http://software.intel.com/en-us/intel-isa-extensions
Thanks - that is going to be fun.

Oliver - it occurred to me before dropping off to sleep last night that I had forgotten to ask you whether you had a way to assign 1-thread-per-core and thus better test the parallelism, I see you've addressed that above - nice.

I am working on the code needed to support > 32-threads in Mersenne-mod mode, but for now the way to test more threads is to switch to Fermat-mod mode, e.g. your 4096K test could be done on F26 instead:

time ./Mlucas.mic -f 26 -fftlen 4096 -iters 100 -radset 0 -nthread X

That will allow up to 64 threads, but you can also test 56 and 60-threaded on the same F-number:

time ./Mlucas.mic -f 26 -fftlen 3584 -iters 100 -radset 0 -nthread X
time ./Mlucas.mic -f 26 -fftlen 3840 -iters 100 -radset 0 -nthread X

Note that for a set of FFT radices starting with radix0, Fermat-mod can use up to radix0 threads, but Mers-mod can use at most radix0/2 threads.

The general rule for the FFT-length-versus-F-number index is that each increment in F-index needs a doubling of FFT length. Thus to test F27 you'd need ~8192K [but 7168 and 7680K also work there]. There is also a more-gradual loss of accuracy as lengths get larger, so e.g. 14336K is no longer sufficient to test F28, although a 100-or-1000-iteration timing test might pass w/o fatal RO errors.
ewmayer is offline   Reply With Quote
Old 2013-10-09, 21:06   #18
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
How does that compare to your Haswell?
just don't ask... compared to Prime95 on any recent desktop CPU it is... sloooow, very sloooow!

Scalar double Mlucas build (no SSE assembly code) on host CPU (Xeon E5-2670):
# time ./Mlucas -fftlen 1024 -iters 100 -radset 0 -nthread
1 thread (1 core): real 0m8.657s
2 threads (2 cores): real 0m4.563s
4 threads (4 cores): real 0m2.415s
8 threads (8 cores): real 0m1.394s
16 threads (16 cores): real 0m0.949s

SSE enabled build is faster
Prime95 is even faster

Oliver

Last fiddled with by TheJudger on 2013-10-09 at 21:47
TheJudger is offline   Reply With Quote
Old 2013-10-09, 21:35   #19
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2·52·11 Posts
Default

Ouch... I guess a lot of work will be needed to get decent speed from Phi. Note I'm not surprised, I never bought Intel claims about the supposed ease of getting high perf from such a beast
ldesnogu is offline   Reply With Quote
Old 2013-10-09, 22:04   #20
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·17·137 Posts
Default

I fully expect current Phi to be slow - pure-C code, no SIMD, and we have just begun playing with thread-management related optimizations.

More interesting is where we will be in ~18 months - I fully expect at least a 10x per-[core*cycle] speedup from using 8-way-double SIMD assembler on the AVX512 Phi, as a conservative lower bound.

Geez, how many times does one need to say "this Phi here is a very different beast from what is coming in 2015" before people get it? This is "laying the foundations" work here, people.

Last fiddled with by ewmayer on 2013-10-09 at 22:10
ewmayer is offline   Reply With Quote
Old 2013-10-10, 08:16   #21
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

10468 Posts
Default

I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi, while the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?
ldesnogu is offline   Reply With Quote
Old 2013-10-10, 16:10   #22
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11×101 Posts
Default

Quote:
Originally Posted by ldesnogu View Post
I guess for AVX-512 most of the SIMD work will translate without any trouble from Phi
How much time does Ernst want to spent on Xeon Phi coding, even if its ten times faster than a highend consumer grade GPU for LL, how many people will run LL tests on a Xeon Phi?
For now I guess a 1% improvement in Prime95 has a higher impact for GIMPS than a superfast LL test for Xeon Phi. But as usual it is hard to predict the future.
On the other hand it is good to have a superfast and independed LL test for doublechecking Mersenne Primes.

Quote:
Originally Posted by ldesnogu View Post
While the multi-threading changes won't be required, given that mlucas already supports 32 threads. Is that correct?
The maximum number of threads depends on the FFT length, (at least the double-scalar built) can't run 32 threads on 2M FFT while it can do on 4M FFT length.

Current Xeon Phi has 57, 60 or 61 cores (228, 240 or 244 hardware threads since each core has 4 hardware threads), similar to GPUs, multithreading is needed to keep the cores busy. Check this post for scaling on one physical core. This *might* change with more optimized code but I'm not sure.

Oliver
TheJudger is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Xeon vs. Quad CPU (775) EdH Hardware 19 2017-06-08 22:06
Motherboard for Xeon ATH Hardware 7 2015-10-10 02:13
Intel® Xeon Phi pinhodecarlos Hardware 2 2015-02-10 18:42
New Xeon firejuggler Hardware 8 2014-09-10 06:37
Dual Xeon Help euphrus Software 12 2005-07-21 14:47

All times are UTC. The time now is 11:24.


Tue Jul 27 11:24:41 UTC 2021 up 4 days, 5:53, 0 users, load averages: 1.51, 1.83, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.