![]() |
phi-ecm
Lately I've been playing around with a Xeon Phi. I've managed to dust off an old copy of YAFU's ECM and vectorize it for the Phi. On a C150 it runs 3840 curves in parallel with b1=1e7 in 8 minutes and 37 seconds, or 134 milliseconds per curve.
In comparison GMP-ECM runs one curve on this input (b1 only) in about 25 seconds, so phi-ecm is about 185 times faster than a single thread of a Xeon E5-4650 (Ivy Bridge EP). This single Phi card appears to be equivalent to almost 3 entire $3600 server chips. I know there is a CUDA-ECM floating around out there but I haven't run it... anyone know how this compares to that? |
What C150? I can try it out on the CUDA-ECM version I have. I suspect yours is a good bit faster though, especially if it's running both stages.
|
The cofactor of: 9^229-7^229
[CODE]377572270651617779506493206108874260118436109446199768359730032040155799061028584986245368745094941584241610763448528815511885067505602648309255344301[/CODE] Oh, and no, it's just running stage 1. |
Ah, ok. I just grabbed a 150-digit composite from factordb, and it took 34 minutes to complete 480 curves of Stage 1 at B1=1e7 (~4 secs per curve). So, yeah, yours is crazy fast.
|
[QUOTE=wombatman;378841]Ah, ok. I just grabbed a 150-digit composite from factordb, and it took 34 minutes to complete 480 curves of Stage 1 at B1=1e7 (~4 secs per curve). So, yeah, yours is crazy fast.[/QUOTE]
Which video card? You've been playing with altering GPU-ECM for various size numbers; which size did you test for this? A C150 could fit into a 512-bit version of GPU-ECM, which might double (?) speed. A test of, say, a 200-digit number would produce a comparison more meaningful to me. Or, to show the best GPU-ECM has, a 300-digit candidate comparison. Even if the C150 and C300 take the same time on GPU-ECM (as they would using the binary floating around) but take twice as long on a Phi, that's still a ten-fold increase in speed! |
GTX 570. I haven't checked how the size of the GPU-ECM program affects the speed (that is, 512-bit vs 1024 bit vs 2048 vs 4096 and so on), but I don't know that it would. [STRIKE]That said, there is some difference in speed as the size of the number increases.[/STRIKE] Again, though, I don't have any hard data for it. Maybe I could do some short test (with B1=1e6 or something) and see how it scales.
[B]EDIT: Looks like I was completely wrong about the speed vs. composite size bit. I just ran a few numbers of increasing size on a 4096-bit and 8192-bit enabled CUDA-ECM. Using B1=1e6, they all take ~9-11 seconds of CPU time and 203-205 seconds of GPU time. For 4096-bit, I tested a C150 and C200. For 8192-bit, I tested C150, C200, C250, C400, C1234, and C2465. I don't have a 512-bit handy at the moment, but I'll build one and see if there's any difference. Double-secret ninja edit: The 512-bit GPU-ECM performs exactly the same (10s CPU, 203-205s GPU), so there's no downside to using the highest bit version.[/B] |
In my case the size of the input won't matter, as long as it fits within the max allowed bit size (a compile time option as of now). The above data was with 576 bits max. I re-ran the benchmark with a 1024 bit max and got 1506 seconds for 3840 curves at B1=1e7 or 395 milliseconds/curve. The time will be the same for any input up to 1024 bits.
10x faster than a GTX570 is better than I hoped! On a CPU, GMP-ECM scales with increasing sized inputs. At 1000 bits one curve at 1e7 takes 78 seconds, so at that size the Phi is almost 200x faster than a cpu-thread! I'll continue to tinker with it but initial evidence says I'm bandwidth limited, so I don't know if there are any further speed gains to be had. |
Very impressive work!
|
Impressive indeed :smile:
|
Data for a GTX 460
[code] pcl@anubis ~ $ ecm -gpu -v 1000000 0 GMP-ECM 7.0-dev [configured with GMP 5.1.3, --enable-asm-redc, --enable-gpu, --enable-assert] [ECM] Running on anubis 377572270651617779506493206108874260118436109446199768359730032040155799061028584986245368745094941584241610763448528815511885067505602648309255344301 Input number is 377572270651617779506493206108874260118436109446199768359730032040155799061028584986245368745094941584241610763448528815511885067505602648309255344301 (150 digits) Using MODMULN [mulredc:0, sqrredc:2] Computing batch product (of 1442099 bits) of primes below B1=1000000 took 20ms GPU: compiled for a NVIDIA GPU with compute capability 2.0. GPU: will use device 0: GeForce GTX 460, compute capability 2.1, 7 MPs. GPU: Selection and initialization of the device took 10ms Using B1=1000000, B2=0, sigma=3:3259007427-3:3259007650 (224 curves) dF=0, k=0, d=140209989058952, d2=0, i0=0 Expected number of curves to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 17880 221980 3168483 5.1e+07 9.4e+08 6.6e+09 Inf Inf Inf Inf Computing 224 Step 1 took 7950ms of CPU time / 233467ms of GPU time Throughput: 0.959 curves by second (on average 1042.26ms by Step 1) Expected time to find a factor of n digits: 35 40 45 50 55 60 65 70 75 80 5.18h 2.68d 38.21d 1.70y 31.18y 218.13y Inf Inf Inf Inf [/code] |
Thank you Paul.
So the next question is... does anyone have access to a Phi card besides me? I should note that the card is my employer's - cost likely prevents most individuals from owning one. |
| All times are UTC. The time now is 15:39. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.