mersenneforum.org  

Go Back   mersenneforum.org > Prime Search Projects > No Prime Left Behind

Reply
 
Thread Tools
Old 2010-10-05, 16:49   #1
henryzz
Just call me Henry
 
henryzz's Avatar
 
"David"
Sep 2007
Cambridge (GMT/BST)

5,743 Posts
Lightbulb Bigger and better GPU sieving drive: Discussion

Admin edit (Max): split off from the main Bigger and better GPU sieving drive: k<10000 n<2M thread. Normally we prefer to keep discussion and reservations in a central location for each drive, but in this case the hoopla surrounding the drive was seriously overwhelming the actual drive content and making it a little hard to sort out.

Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then.

Last fiddled with by mdettweiler on 2010-10-06 at 16:08
henryzz is offline   Reply With Quote
Old 2010-10-05, 17:38   #2
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by henryzz View Post
Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then.
Good question. I'll drop a note in the PST forum asking if we can do this. (Since their reservation thread is the "primary" one, we can't open it up until they put the range on their books.)
mdettweiler is offline   Reply With Quote
Old 2010-10-05, 19:03   #3
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17·251 Posts
Default

I don't own a CUDA-ready GPU, so unfortunately I can't efficiently participate in this. Out of curiosity, about how fast (for this or other ppsieve sieves) is a GPU compared to a quad on a 32-bit OS? Does the GPU's speed differ significantly between 32- and 64-bit versions?
Also: Wow, GPUs are really changing the landscape of sieving. That is an incredibly enormous range. It will take quite a while to test, unless a program is released to run LLR tests on a GPU, but it looks like before too long, my n~=1.3M primes, currently ranked ~200, won't be nearly as impressive.

Last fiddled with by Mini-Geek on 2010-10-05 at 19:10
Mini-Geek is online now   Reply With Quote
Old 2010-10-05, 19:12   #4
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
I don't own a CUDA-ready GPU. Out of curiosity, about how fast is a GPU compared to a quad on a 32-bit OS?
It depends on the sieve. I haven't run a comparison on this particular sieve but would expect a GPU to be a number of times faster than 4 CPU cores working together. I'll run some comparisons shortly--stay tuned.

Quote:
Does the GPU's speed differ significantly between 32- and 64-bit versions?
On the earlier k<=1001 sieve, ppsieve was mostly GPU bound; that is, it needed very little CPU (.06 cores) to keep the GPU busy. I haven't tried it (yet) with this sieve, but would expect it to be similarly nonreliant on the CPU. Thus, while 64-bit would cause the CPU portion of the code to run faster (in the above example, I would expect 32-bit to take .12 CPU cores), it's not going to affect the GPU's speed at all. So it's not going to make the sieve run faster, but will leave more free CPU for other applications.

Note that again, this depends on the sieve. For instance, on TPS's variable-n sieve with tpsieve-CUDA, the program is CPU bound on a fast GPU like Gary's; thus 64-bit will run significantly faster than 32-bit.

Last fiddled with by mdettweiler on 2010-10-05 at 19:13
mdettweiler is offline   Reply With Quote
Old 2010-10-05, 23:30   #5
gd_barnes
 
gd_barnes's Avatar
 
May 2007
Kansas; USA

1023910 Posts
Default

Tim,

To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve. BUT...the sieve must be for all k's below a certain limit and must be for base 2. There is no advantage to sieving one or a few k's at a time. Also, it made no sense to only sieve k=300-1001, which is what we originally started doing until Lennart's offer. We could add k<300 for free so we did k<=1001. The speed is based on the highest k in the sieve and P-range and virtually nothing else. At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS.

We are greatful to Lennart for picking up the effort at PrimeGrid. And to think: It was all spearheaded by Max after I bought my first GPU, which I knew little about.


Gary

Last fiddled with by gd_barnes on 2010-10-05 at 23:30
gd_barnes is online now   Reply With Quote
Old 2010-10-06, 00:24   #6
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

10000101010112 Posts
Default

Quote:
Originally Posted by gd_barnes View Post
To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve.
Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G. It would take around 29 days with 4 cores to finish 1T. I don't know how ppsieve-CUDA compares to 32-bit sr2sieve or ppsieve, but that'd probably put a GPU's time for the range anywhere from 14 to 2.9 days. Can we just get a p/sec number on some GPU?
Quote:
Originally Posted by gd_barnes View Post
BUT...the sieve must be for all k's below a certain limit and must be for base 2. ... At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS.
Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful.
Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs.

Last fiddled with by Mini-Geek on 2010-10-06 at 00:33
Mini-Geek is online now   Reply With Quote
Old 2010-10-06, 00:30   #7
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful.
Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs.
ppsieve was originally designed by PrimeGrid to be optimized for the very big base 2 sieves they (and we) do. It uses an entirely different algorithm than the sr*sieve programs and as such does not work well with few k's.

I actually did try ppsieve-CUDA on the Prime Sierpinski Project's sieve a couple of weeks back and it did NOT work well. Firstly, ppsieve actually couldn't allocate a bitmap big enough to hold an n<50M sieve with kmax as high as theirs is. I tried then to split it up into smaller files (5M at a time, IIRC) and while it worked, it was painfully slow--a few orders of magnitude slower than sr2sieve.

Gary mentioned to me a little while back that it might be worthwhile to try Riesel or Sierp. base 256 (one of those, I forget which) with ppsieve-CUDA. Its kmax is very low relative to the number of k's remaining, so it might well be in ppsieve's "sweet spot".
mdettweiler is offline   Reply With Quote
Old 2010-10-06, 01:35   #8
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

22×33×19 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G.
The GTX 480 is about 50x faster than a single core:

Code:
./ppsieve-cuda-x86_64-linux -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -q        
ppsieve version cuda-0.2.1a (testing)
Compiled Oct  5 2010 with GCC 4.3.3
nstart=72, nstep=30
ppsieve initialized: 3 <= k <= 10001, 72 <= n < 2000000
Sieve started: 10000000000000 <= p < 10001000000000
Thread 0 starting
Detected GPU 0: GeForce GTX 480
Detected compute capability: 2.0
Detected 15 multiprocessors.
nstep changed to 22
p=10000918814721, 5.103M p/sec, 0.07 CPU cores, 91.9% done. ETA 05 Oct 18:27  
Thread 0 completed
Waiting for threads to exit
Sieve complete: 10000000000000 <= p < 10001000000000
Found 3617 factors
count=33405006,sum=0x1c1b8d0e01e3e0ea
Elapsed time: 196.30 sec. (0.01 init + 196.29 sieve) at 5094920 p/sec.

Last fiddled with by frmky on 2010-10-06 at 01:35
frmky is offline   Reply With Quote
Old 2010-10-06, 02:16   #9
Mini-Geek
Account Deleted
 
Mini-Geek's Avatar
 
"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17×251 Posts
Default

Quote:
Originally Posted by frmky View Post
The GTX 480 is about 50x faster than a single core:
Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be $36 to compare.

Last fiddled with by Mini-Geek on 2010-10-06 at 02:23
Mini-Geek is online now   Reply With Quote
Old 2010-10-06, 02:45   #10
mdettweiler
A Sunny Moo
 
mdettweiler's Avatar
 
Aug 2007
USA (GMT-5)

792 Posts
Default

Quote:
Originally Posted by Mini-Geek View Post
Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be $36 to compare.
FYI, from what I've heard the GTX 460 is only a little slower than the GTX 480; however, it's a lot cheaper ($170). Gary has one of these that's slightly factory overclocked (Newegg has these mixed in the same price range as the non-overclocked GTX 460s--go figure).

I'll get a range reserved for and started on Gary's GPU sometime today or tomorrow--at that point I'll be able to give some exact figures for p/sec.
mdettweiler is offline   Reply With Quote
Old 2010-10-06, 03:38   #11
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

6128 Posts
Default

From what I can tell, a stock-clocked GTX 460 will only be about half as fast as a stock-clocked GTX 480. It seems to have to do with the 460 needing instruction-level parallelism. The only two good ways I can think of to provide that are to either recompile the client with the latest CUDA SDK or to vectorize the work like I did for AMD. Neither option is all that easy, but compiling the latest SDK on a clean VM is probably easier.

The good news is that you're only at 10T right now. Somewhere between 21T and 40T you'll get a significant speed boost.
Ken_g6 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
New PRPnet drive discussion mdettweiler Conjectures 'R Us 89 2011-08-10 09:01
Sieving drive Riesel base 6 n=1M-2M gd_barnes Conjectures 'R Us 40 2011-01-22 08:10
Bigger and better GPU sieving drive: k<10000 n<2M mdettweiler No Prime Left Behind 61 2010-10-29 18:48
GPU sieving drive for k<=1001 n=1M-2M mdettweiler No Prime Left Behind 11 2010-10-04 22:45
Sieving drive for k=301-400 n=1M-2M MyDogBuster No Prime Left Behind 42 2010-03-21 01:14

All times are UTC. The time now is 00:49.

Tue Nov 24 00:49:55 UTC 2020 up 74 days, 22 hrs, 4 users, load averages: 3.36, 3.02, 2.67

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.