mersenneforum.org Bigger and better GPU sieving drive: Discussion
 Register FAQ Search Today's Posts Mark Forums Read

 2010-10-05, 16:49 #1 henryzz Just call me Henry     "David" Sep 2007 Cambridge (GMT/BST) 5×19×61 Posts Bigger and better GPU sieving drive: Discussion Admin edit (Max): split off from the main Bigger and better GPU sieving drive: k<10000 n<2M thread. Normally we prefer to keep discussion and reservations in a central location for each drive, but in this case the hoopla surrounding the drive was seriously overwhelming the actual drive content and making it a little hard to sort out. Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then. Last fiddled with by mdettweiler on 2010-10-06 at 16:08
2010-10-05, 17:38   #2
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

11000011010012 Posts

Quote:
 Originally Posted by henryzz Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then.
Good question. I'll drop a note in the PST forum asking if we can do this. (Since their reservation thread is the "primary" one, we can't open it up until they put the range on their books.)

 2010-10-05, 19:03 #3 Mini-Geek Account Deleted     "Tim Sorbera" Aug 2006 San Antonio, TX USA 426710 Posts I don't own a CUDA-ready GPU, so unfortunately I can't efficiently participate in this. Out of curiosity, about how fast (for this or other ppsieve sieves) is a GPU compared to a quad on a 32-bit OS? Does the GPU's speed differ significantly between 32- and 64-bit versions? Also: Wow, GPUs are really changing the landscape of sieving. That is an incredibly enormous range. It will take quite a while to test, unless a program is released to run LLR tests on a GPU, but it looks like before too long, my n~=1.3M primes, currently ranked ~200, won't be nearly as impressive. Last fiddled with by Mini-Geek on 2010-10-05 at 19:10
2010-10-05, 19:12   #4
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by Mini-Geek I don't own a CUDA-ready GPU. Out of curiosity, about how fast is a GPU compared to a quad on a 32-bit OS?
It depends on the sieve. I haven't run a comparison on this particular sieve but would expect a GPU to be a number of times faster than 4 CPU cores working together. I'll run some comparisons shortly--stay tuned.

Quote:
 Does the GPU's speed differ significantly between 32- and 64-bit versions?
On the earlier k<=1001 sieve, ppsieve was mostly GPU bound; that is, it needed very little CPU (.06 cores) to keep the GPU busy. I haven't tried it (yet) with this sieve, but would expect it to be similarly nonreliant on the CPU. Thus, while 64-bit would cause the CPU portion of the code to run faster (in the above example, I would expect 32-bit to take .12 CPU cores), it's not going to affect the GPU's speed at all. So it's not going to make the sieve run faster, but will leave more free CPU for other applications.

Note that again, this depends on the sieve. For instance, on TPS's variable-n sieve with tpsieve-CUDA, the program is CPU bound on a fast GPU like Gary's; thus 64-bit will run significantly faster than 32-bit.

Last fiddled with by mdettweiler on 2010-10-05 at 19:13

 2010-10-05, 23:30 #5 gd_barnes     May 2007 Kansas; USA 7·13·113 Posts Tim, To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve. BUT...the sieve must be for all k's below a certain limit and must be for base 2. There is no advantage to sieving one or a few k's at a time. Also, it made no sense to only sieve k=300-1001, which is what we originally started doing until Lennart's offer. We could add k<300 for free so we did k<=1001. The speed is based on the highest k in the sieve and P-range and virtually nothing else. At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS. We are greatful to Lennart for picking up the effort at PrimeGrid. And to think: It was all spearheaded by Max after I bought my first GPU, which I knew little about. Gary Last fiddled with by gd_barnes on 2010-10-05 at 23:30
2010-10-06, 00:24   #6
Mini-Geek
Account Deleted

"Tim Sorbera"
Aug 2006
San Antonio, TX USA

426710 Posts

Quote:
 Originally Posted by gd_barnes To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve.
Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G. It would take around 29 days with 4 cores to finish 1T. I don't know how ppsieve-CUDA compares to 32-bit sr2sieve or ppsieve, but that'd probably put a GPU's time for the range anywhere from 14 to 2.9 days. Can we just get a p/sec number on some GPU?
Quote:
 Originally Posted by gd_barnes BUT...the sieve must be for all k's below a certain limit and must be for base 2. ... At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS.
Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful.
Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs.

Last fiddled with by Mini-Geek on 2010-10-06 at 00:33

2010-10-06, 00:30   #7
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3×2,083 Posts

Quote:
 Originally Posted by Mini-Geek Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful. Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs.
ppsieve was originally designed by PrimeGrid to be optimized for the very big base 2 sieves they (and we) do. It uses an entirely different algorithm than the sr*sieve programs and as such does not work well with few k's.

I actually did try ppsieve-CUDA on the Prime Sierpinski Project's sieve a couple of weeks back and it did NOT work well. Firstly, ppsieve actually couldn't allocate a bitmap big enough to hold an n<50M sieve with kmax as high as theirs is. I tried then to split it up into smaller files (5M at a time, IIRC) and while it worked, it was painfully slow--a few orders of magnitude slower than sr2sieve.

Gary mentioned to me a little while back that it might be worthwhile to try Riesel or Sierp. base 256 (one of those, I forget which) with ppsieve-CUDA. Its kmax is very low relative to the number of k's remaining, so it might well be in ppsieve's "sweet spot".

2010-10-06, 01:35   #8
frmky

Jul 2003
So Cal

23·257 Posts

Quote:
 Originally Posted by Mini-Geek Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G.
The GTX 480 is about 50x faster than a single core:

Code:
./ppsieve-cuda-x86_64-linux -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -q
ppsieve version cuda-0.2.1a (testing)
Compiled Oct  5 2010 with GCC 4.3.3
nstart=72, nstep=30
ppsieve initialized: 3 <= k <= 10001, 72 <= n < 2000000
Sieve started: 10000000000000 <= p < 10001000000000
Detected GPU 0: GeForce GTX 480
Detected compute capability: 2.0
Detected 15 multiprocessors.
nstep changed to 22
p=10000918814721, 5.103M p/sec, 0.07 CPU cores, 91.9% done. ETA 05 Oct 18:27
Sieve complete: 10000000000000 <= p < 10001000000000
Found 3617 factors
count=33405006,sum=0x1c1b8d0e01e3e0ea
Elapsed time: 196.30 sec. (0.01 init + 196.29 sieve) at 5094920 p/sec.

Last fiddled with by frmky on 2010-10-06 at 01:35

2010-10-06, 02:16   #9
Mini-Geek
Account Deleted

"Tim Sorbera"
Aug 2006
San Antonio, TX USA

17×251 Posts

Quote:
 Originally Posted by frmky The GTX 480 is about 50x faster than a single core:
Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be$36 to compare.

Last fiddled with by Mini-Geek on 2010-10-06 at 02:23

2010-10-06, 02:45   #10
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by Mini-Geek Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be$36 to compare.
FYI, from what I've heard the GTX 460 is only a little slower than the GTX 480; however, it's a lot cheaper (\$170). Gary has one of these that's slightly factory overclocked (Newegg has these mixed in the same price range as the non-overclocked GTX 460s--go figure).

I'll get a range reserved for and started on Gary's GPU sometime today or tomorrow--at that point I'll be able to give some exact figures for p/sec.

 2010-10-06, 03:38 #11 Ken_g6     Jan 2005 Caught in a sieve 18B16 Posts From what I can tell, a stock-clocked GTX 460 will only be about half as fast as a stock-clocked GTX 480. It seems to have to do with the 460 needing instruction-level parallelism. The only two good ways I can think of to provide that are to either recompile the client with the latest CUDA SDK or to vectorize the work like I did for AMD. Neither option is all that easy, but compiling the latest SDK on a clean VM is probably easier. The good news is that you're only at 10T right now. Somewhere between 21T and 40T you'll get a significant speed boost.

 Similar Threads Thread Thread Starter Forum Replies Last Post mdettweiler Conjectures 'R Us 89 2011-08-10 09:01 gd_barnes Conjectures 'R Us 40 2011-01-22 08:10 mdettweiler No Prime Left Behind 61 2010-10-29 18:48 mdettweiler No Prime Left Behind 11 2010-10-04 22:45 MyDogBuster No Prime Left Behind 42 2010-03-21 01:14

All times are UTC. The time now is 09:02.

Sun Jan 24 09:02:37 UTC 2021 up 52 days, 5:13, 0 users, load averages: 1.81, 1.86, 1.74