mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GMP-ECM (https://www.mersenneforum.org/forumdisplay.php?f=55)
-   -   ECM for CUDA GPUs in latest GMP-ECM ? (https://www.mersenneforum.org/showthread.php?t=16480)

debrouxl 2012-02-12 10:07

Another data point, for numbers between C144 and C29x: C237 is slower on the GPU, but obviously faster on the CPU, than C29x:
[code]$ echo 472367364481324943429608990380363865230376899949857658144588096283146783114430372207621802600829155058766951167631153619328587819346877117165453306904995816614534365740792256712736351604580048562330248528078693598071309876495244264859329 | ./gpu_ecm -vv -n 64 -save 80009_248_3e6_1 3000000
#Compiled for a NVIDIA GPU with compute capability 1.3.
#Will use device 0 : GeForce GT 540M, compute capability 2.1, 2 MPs.
#s has 4328086 bits
Precomputation of s took 0.256s
Input number is 472367364481324943429608990380363865230376899949857658144588096283146783114430372207621802600829155058766951167631153619328587819346877117165453306904995816614534365740792256712736351604580048562330248528078693598071309876495244264859329 (237 digits)
Using B1=3000000, firstinvd=563947071, with 64 curves
[snip]
gpu_ecm took : 1637.614s (0.000+1637.610+0.004)
Throughput : 0.039


$ echo 472367364481324943429608990380363865230376899949857658144588096283146783114430372207621802600829155058766951167631153619328587819346877117165453306904995816614534365740792256712736351604580048562330248528078693598071309876495244264859329 | ./ecm -c 1 3000000
bash: ./ecm: Aucun fichier ou dossier de ce type
debrouxl@asus2:~/ecm/gpu/gpu_ecm$ echo 472367364481324943429608990380363865230376899949857658144588096283146783114430372207621802600829155058766951167631153619328587819346877117165453306904995816614534365740792256712736351604580048562330248528078693598071309876495244264859329 | ecm -c 1 3000000
GMP-ECM 6.5-dev [configured with GMP 5.0.90, --enable-asm-redc, --enable-assert] [ECM]
Input number is 472367364481324943429608990380363865230376899949857658144588096283146783114430372207621802600829155058766951167631153619328587819346877117165453306904995816614534365740792256712736351604580048562330248528078693598071309876495244264859329 (237 digits)
Using B1=3000000, B2=5706890290, polynomial Dickson(6), sigma=379651352
Step 1 took 42974ms
Step 2 took 12981ms[/code]

On that number, the core busy with gpu_ecm is spending a bit more than half of its time in "system" state. I guess that it's what xilman mentioned above ?
[quote]A third is to reduce the (presently extortionate IMO) amount of cpu time used by busy-waiting for the kernels to complete.[/quote]

lorgix 2012-02-12 12:58

[QUOTE=xilman;289019]I screwed up computing the time per curve :redface:


1792 curves took 141 hours to run. I evaluated (1792 * 141 / 3600) to obtain the quoted figure of 70 seconds per curve.

The correct expression is (141 * 3600 / 1792), which evaluates to 283 seconds per curve.
Although this is four times worse than the initial figure, it is still 2.4 times faster than a singe core.

Sorry about that.[/QUOTE]
Still a very interesting development. Congrats on the factors.

xilman 2012-02-12 13:18

Since my first experiments, I've been playing with a version which uses 512-bit arithmetic (fudged with CFLAGS+=-DNB_DIGITS=16 in the relevant line of Makefile). As expected, ECM runs around 3 times faster on ~500 bit numbers with this change.

One of the things on my to-do list is to add greater flexibility to the choice of bignum sizes.

Experiments with both 1024 and 512-bit arithmetic indicate that running more than the default number of curves is a Good Thing, presumably by hiding memory latency. The downside, of course, is that the display stays rather sluggish for a proportionately long time. I'm trying to estimate how long a run will take and then kick it off overnight when display latency is likely to be unimportant.


Paul

frmky 2012-02-12 21:26

[QUOTE=xilman;289134]I'm trying to estimate how long a run will take and then kick it off overnight when display latency is likely to be unimportant.[/QUOTE]
I added a percent complete counter in the for loop launching the kernels in cudautil.cu. I don't think adding an ETA would be difficult.

R.D. Silverman 2012-02-12 22:07

[QUOTE=xilman;289065]In case it is not clear to bystanders, this code is [b]not[/b] fire and forget. It is [b]not[/b] production quality.

If you want to use it, you will need to get your hands dirty. I'm prepared to help as best I can [b]after[/b] you've followed the instructions in the svn distro and after you've made a sincere effort to get things working by yourself. I am not prepared to bottle-feed, to wipe noses or to change {nappies,diapers}.

That may sound harsh but it's the way the world of alpha-code development works and you'll need to get used to it if you want to play with the big boys and girls. Once you pass the audition you'll find most developers are very friendly and helpful.

Neither am I addressing these remarks to any particular individuals who may, or may not, have posted in this thread.

Paul[/QUOTE]

I totally agree with you.

However, allow me to point out that when I present a similar attitude
toward the learning of the algorithms discussed herein and the mathematics
behind them, I am lambasted for my efforts.

Participants should be willing to put in the effort or they should leave.

xilman 2012-02-12 23:40

[QUOTE=R.D. Silverman;289203]I totally agree with you.

However, allow me to point out that when I present a similar attitude
toward the learning of the algorithms discussed herein and the mathematics
behind them, I am lambasted for my efforts.

Participants should be willing to put in the effort or they should leave.[/QUOTE]It seems to me that one difference is that there is a large amount of fire-and-forget code available and that code is suited to the majority of the people here. [b]Only[/b] those who prepared for all the frustrations of working at the bleeding edge have any great need to be able to build, debug and install alpha code from a subversion repository.

Much of the mathematics discussed here is [b]not[/b] at the bleeding edge, IMO. It is closer in spirit to oft-times cranky but nonetheless well understood and supported applications such as mainstream gmp-ecm.

IMO, your diatribes against those wishing to perform bleeding edge mathematics are fully justified. They are less appropriate, again IMO, further away from the bleeding edge. I hope I would never feel the urge to issue my earlier warnings to those who only wish to use gmp-ecm and are confused by its jargon and multitudinous options.

R.D. Silverman 2012-02-13 00:40

[QUOTE=xilman;289218]It seems to me that one difference is that there is a large amount of fire-and-forget code available and that code is suited to the majority of the people here. [b]Only[/b] those who prepared for all the frustrations of working at the bleeding edge have any great need to be able to build, debug and install alpha code from a subversion repository.
[/QUOTE]

We agree.

Indeed. I have even heard one of the people (whom I hold in contempt)
admit that he does not even know how to use a compiler.

[QUOTE]

Much of the mathematics discussed here is [b]not[/b] at the bleeding edge, IMO. It is closer in spirit to oft-times cranky but nonetheless well understood and supported applications such as mainstream gmp-ecm.
[/QUOTE]

And from my point of view too many of the participants herein do
not understand things even at that level. Nor do they seem willing
to make the attempt. They don't even understand mathematics that
was known 150+ years ago. Nor do they want to make the effort.

xilman 2012-02-14 19:12

[QUOTE=xilman;289134]Experiments with both 1024 and 512-bit arithmetic indicate that running more than the default number of curves is a Good Thing[/QUOTE]Another data point shows that even choosing the correct default number is significant.

Out of the box (well, my box anyway) the default build appears to use parameters suitable for a CC1.3 system, despite there being a Fermi card installed. A run on a C302 with these parameters chooses 112 curves arranged 32x16 x 7x1x1 and takes 3845.428 seconds. Rebuilding with "make cc=2" and re-running took 5539.049 seconds for 224 curves arranged 32x32 x 7x1x1. The ratio (224/112) * (3845.428 / 5539.049) is 1.388.

I suggest a 39% speed-up is worth having.

Ralf Recker 2012-02-14 22:02

A few quick tests with a small B1 value
 
CC 2.0 card (GTX 470, stock clocks), 512 bit arithmetic, CUDA SDK 4.0. The c151 was taken from the Aliquot sequence 890460:i898

[CODE]ralf@quadriga:~/dev/gpu_ecm$ LD_LIBRARY_PATH=/usr/local/cuda/lib64/ ./gpu_ecm -d 0 -save c151.save 250000 < c151
Precomputation of s took 0.004s
Input number is 4355109846524047003246531292211765742521128216321735054909228664961069056051308281896789359834792526662067203883345116753066761522281210568477760081509 (151 digits)
Using B1=250000, firstinvd=24351435, with 448 curves
gpu_ecm took : 116.363s (0.000+116.355+0.008)
Throughput : 3.850[/CODE]Doubling the number of curves improves the throughput:

[CODE]ralf@quadriga:~/dev/gpu_ecm$ LD_LIBRARY_PATH=/usr/local/cuda/lib64/ ./gpu_ecm -d 0 -n 896 -save c151.save 250000 < c151
Precomputation of s took 0.004s
Input number is 4355109846524047003246531292211765742521128216321735054909228664961069056051308281896789359834792526662067203883345116753066761522281210568477760081509 (151 digits)
Using B1=250000, firstinvd=1471710578, with 896 curves
gpu_ecm took : 179.747s (0.000+179.731+0.016)
Throughput : 4.985[/CODE]32 curves less and the throughput increases by another 30%
[CODE]ralf@quadriga:~/dev/gpu_ecm$ LD_LIBRARY_PATH=/usr/local/cuda/lib64/ ./gpu_ecm -d 0 -n 864 -save c151.save 250000 < c151
Precomputation of s took 0.004s
Input number is 4355109846524047003246531292211765742521128216321735054909228664961069056051308281896789359834792526662067203883345116753066761522281210568477760081509 (151 digits)
Using B1=250000, firstinvd=1374804691, with 864 curves
gpu_ecm took : 130.964s (0.000+130.948+0.016)
Throughput : 6.597
[/CODE]The throughput on a CC 2.1 card (GTX 460, 725 MHz factory OC) for the same number:

[CODE] 224 curves - Throughput : 2.289
416 curves - Throughput : 4.223
448 curves - Throughput : 4.547
480 curves - Throughput : 3.039
672 curves - Throughput : 4.233
896 curves - Throughput : 4.638
1792 curves - Throughput : 4.753[/CODE]

ET_ 2012-02-15 19:19

gpu_ecm ready to work
 
OK, I downloaded the source code with cc=1.3, and successfully compiled it :smile:

Sadly, I see differences between the Xilman and Ralf Recker outputs.

The executable passes the test.

What represents the (needed) parameter N in the command line? All I can see is that it has to do with the xfin, zfin and xunif parameters, and should be odd...

I also tried ./gpu_ecm 9699691 11000 -n 1 <in where in contains the number 65798732165875434667. I got the factor 347 that is not a factor of the number in input...

To testify my good will:

[code]
./gpu_ecm 9699691 11000 -n 1 <in
#Compiled for a NVIDIA GPU with compute capability 1.3.
#Will use device 0 : GeForce GTX 275, compute capability 1.3, 30 MPs.
#gpu_ecm launched with :
N=9699691
B1=11000
curves=1
firstsigma=11
#used seed 1329332970 to generate sigma

#Begin GPU computation...
#All kernels launched, waiting for results...
#All kernels finished, analysing results...
#Looking for factors for the curves with sigma=11
xfin=3111202
zfin=7720056
#Factor found : 347 (with z)
#Results : 1 factor found

#Temps gpu : 15.080 init&copy=0.040 computation=15.040
[/code]

Now, I understand that the program is not "fire ad forget", and I would really, REALLY know more about it, but the interface is not documented, the use of gmp-ecm is different and in the link posted by Jason there is no indication that a README file is present anywhere in the trunk.

Would you mind (now that my hands have been contaminated by bits and compilers) shedding some light to this obscure valley? Even a link explaining what N means in this context would suffice... :smile:

Many thanks...

Luigi

P.S. after some more fiddling, I noticed that 347 is a factor of 9699691, so I think I got the meaning of N after all... :redface:

With N3 and 448 curves, my GTX275 has the same speed of my Intel I5-750.

jasonp 2012-02-15 19:44

That directory was not the trunk, [url="https://gforge.inria.fr/scm/viewvc.php/trunk/?root=ecm"]this[/url] is, complete with lots of readme files.


All times are UTC. The time now is 20:44.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.