-   Msieve (
-   -   To CPU or to GPU: stage1 polyselect research (

Karl M Johnson 2011-05-20 09:27

To CPU or to GPU: stage1 polyselect research
1 Attachment(s)
As the topic suggests, I did this little research on stage one of polynomial selection. Jason suggests doing only stage 1 on GPU in the latest Msieve NFS manual, instead of combined search.

You will find all the data in the table below(which is actually a picture).

Now, here are the numbers used:
[CODE]c90: 69547303316492118149408022547436991785350544259733986981963408781389041467965565432357913
c100: 776701169123392332156259995503621920821842983186779441829856164218194104805531020576372063083759743
c110: 9852238794222235462151171006209225112694534204051316346901796414666349372723740059709872068581469631350831777
c120: 227026401879198463827454062763506967433946202931364283264305901265796403304094997028726671231101499391384899872706432273
c130: 1142067639655668244421675659299472228892561443039817892634518492620626571274348874167990804267595223355216327568784472374242262183
c140: 2333435938791497921844814519438006187812846866005520174850758490445958886619248051390445682925499819800889470401806186619479812725646985161[/CODE]And the ranges for all of them:
[CODE]1. 1,10000
2. 1000000,1010000
3. 1000000000,1000010000[/CODE]Now, apart from measuring the running time, I also measured how much strings(lines) were produced by searching all 3 ranges on a certain number, for both GPU and CPU, in msieve.dat.m file. The more the better. I did a sort -u before counting.

Some specs:
[CODE]CPU: Core 2 Quad Q6600 @ 3.0 Ghz
GPU: Nvidia GeForce GTX 480 @ 800/950 Mhz with 270.61 ForceWare
OS: W7 x64 with SP1[/CODE]Msieve versions used:
[CODE]CPU - 64 bit binary from latest SVN(thanks Jeff!)
GPU - 64 bit CUDA-enabled binary from latest SVN(thanks Brian!)[/CODE]I used timethis from windows 2000 resource kit to measure the running time of binaries for each range.

Some conclusions:
1. On C90 and C100, CUDA-enabled Msieve performs better than the one which uses the CPU only.
2. Using some kind of script to utilize all available hardware would be the best way to do stage1 of polynomial selection(sort of obvious).

Jeff Gilchrist 2011-05-20 12:06

Interesting, so even at a C140 level, the GPU version still produces more candidates than the CPU version in the range but takes a lot longer.

When you get up to a C232 (for the RSA-768) the GPU version produces many less candidates in the range (50k range at 90M). In this case the GPU and CPU took the same amount of time.

using GPU 1 (Tesla T10 Processor)
elapsed time 25:47:14
500 candidates found

using a Core i5 760
elapsed time 25:47:11
11456 candidates found

Karl M Johnson 2011-05-20 15:27

[QUOTE=Jeff Gilchrist;261827]When you get up to a C232 (for the RSA-768) the GPU version produces many less candidates in the range (50k range at 90M).[/QUOTE]
I wonder why.

Intresting, then using even the latest and greatest GPUs for composites like C232 is a bad idea due to poor candidates generation.

jasonp 2011-05-20 17:30

Kleinjung's algorithm for stage 1 poly selection can be expressed as either memory-bound or compute-bound. The memory-bound version uses the birthday paradox to become efficient as the problem size goes up, and the compute-bound version uses GPU hardware to perform a large number of tests at once but the amount of work to do increases quadratically as the problem size increases. For small problems, the memory bound version cannot build a large enough hashtable to become efficient, and the GPU code gets more results.

Incidentally, you may also want to include the number of stage 1 hits per second as a more accurate measure of efficiency, because msieve checks for time limit exceeded less often in the GPU code, so that especially for small problems you spend much more time running the GPU version. You can also just run stage 1 and count the number of lines in the .m file that are produced, to avoid being misled by the large amount of debug output the library currently produces. Running stage 2 will confuse the timings, because the CPU and GPU versions will both switch off at random times to possibly spend quite a while running stage 2. The two versions will not produce the same answers.

RichD 2011-11-10 03:48

What is the root (no pun intended) cause/concern why a polynomial optimization technique can be employed for poly selection? Or is this more of a TSP (Traveling Salesman Problem)?

jasonp 2011-11-10 12:05

NFS polynomial selection is divided into two phases; the first is a combinatorial search that is essentially clever brute force, the second involves a range of techniques from applied mathematics, including multivariate minimization. Each of the stages is an active research area, and we're going over techniques for the first stage.

Incidentally, the cuda11_test branch of the msieve SVN repository has a lot of work by jrk that uses GPUs for high-performance sorting, and this makes GPUs drastically better for large problems than they used to be.

RichD 2011-11-11 04:14

[QUOTE=jasonp;277782]... the first is a combinatorial search that is essentially clever brute force, ...[/QUOTE]

That would have been my follow-up question. Thanks for your reply.

Now, back to the [strike]drawing[/strike] thinking board. :-)
Oh, what's the use...

sleigher 2011-11-12 19:35

Curious about the cuda11_test branch in svn. I have the regular msieve-149 version running on a
GTX 465 Nvidia card and has been running for about 24 hours. So far that has not really given me much.
About 20 poly candidates.

So I compiled the cuda11_test brach on linux 64-bit and am running that on a GT8800 Nvidia card.
That version gave me about 80 candidates in 2 hours with the best shown below. It's way faster.

Just curious why this branch is so much faster.

[root@msieve-linux]# cat number.poly-1
n: 8570085955425991975416682393322092672178022622762559843295498103055560639891369587
# norm 6.077880e-15 alpha -6.591518 e 3.228e-12 rroots 5
skew: 14813998.57
c0: 271830330224374546052564805914266259840
c1: -928034253578711160424890240906
c2: -26681915219460396213323881
c3: 680113183828316262
c4: 141225440666
c5: 84
Y0: -2521959333809067094313594330127
Y1: 53520081350873741
type: gnfs

jasonp 2011-11-12 19:46

jrk has made a great deal of progress improving the GPU code in the last few months. That branch will get folded into the trunk when he feels it's ready.

All times are UTC. The time now is 09:07.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.