20101005, 16:49  #1 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
13340_{8} Posts 
Bigger and better GPU sieving drive: Discussion
Admin edit (Max): split off from the main Bigger and better GPU sieving drive: k<10000 n<2M thread. Normally we prefer to keep discussion and reservations in a central location for each drive, but in this case the hoopla surrounding the drive was seriously overwhelming the actual drive content and making it a little hard to sort out.
Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then. Last fiddled with by mdettweiler on 20101006 at 16:08 
20101005, 17:38  #2 
A Sunny Moo
Aug 2007
USA (GMT5)
14151_{8} Posts 
Good question. I'll drop a note in the PST forum asking if we can do this. (Since their reservation thread is the "primary" one, we can't open it up until they put the range on their books.)

20101005, 19:03  #3 
Account Deleted
"Tim Sorbera"
Aug 2006
San Antonio, TX USA
10AB_{16} Posts 
I don't own a CUDAready GPU, so unfortunately I can't efficiently participate in this. Out of curiosity, about how fast (for this or other ppsieve sieves) is a GPU compared to a quad on a 32bit OS? Does the GPU's speed differ significantly between 32 and 64bit versions?
Also: Wow, GPUs are really changing the landscape of sieving. That is an incredibly enormous range. It will take quite a while to test, unless a program is released to run LLR tests on a GPU, but it looks like before too long, my n~=1.3M primes, currently ranked ~200, won't be nearly as impressive. Last fiddled with by MiniGeek on 20101005 at 19:10 
20101005, 19:12  #4  
A Sunny Moo
Aug 2007
USA (GMT5)
3×2,083 Posts 
Quote:
Quote:
Note that again, this depends on the sieve. For instance, on TPS's variablen sieve with tpsieveCUDA, the program is CPU bound on a fast GPU like Gary's; thus 64bit will run significantly faster than 32bit. Last fiddled with by mdettweiler on 20101005 at 19:13 

20101005, 23:30  #5 
May 2007
Kansas; USA
10100001100011_{2} Posts 
Tim,
To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32bit machine with sr2sieve. BUT...the sieve must be for all k's below a certain limit and must be for base 2. There is no advantage to sieving one or a few k's at a time. Also, it made no sense to only sieve k=3001001, which is what we originally started doing until Lennart's offer. We could add k<300 for free so we did k<=1001. The speed is based on the highest k in the sieve and Prange and virtually nothing else. At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed nonbase2 sieving, it still would not be effective at CRUS. We are greatful to Lennart for picking up the effort at PrimeGrid. And to think: It was all spearheaded by Max after I bought my first GPU, which I knew little about. Gary Last fiddled with by gd_barnes on 20101005 at 23:30 
20101006, 00:24  #6  
Account Deleted
"Tim Sorbera"
Aug 2006
San Antonio, TX USA
17×251 Posts 
Quote:
Quote:
Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs. Last fiddled with by MiniGeek on 20101006 at 00:33 

20101006, 00:30  #7  
A Sunny Moo
Aug 2007
USA (GMT5)
3×2,083 Posts 
Quote:
I actually did try ppsieveCUDA on the Prime Sierpinski Project's sieve a couple of weeks back and it did NOT work well. Firstly, ppsieve actually couldn't allocate a bitmap big enough to hold an n<50M sieve with kmax as high as theirs is. I tried then to split it up into smaller files (5M at a time, IIRC) and while it worked, it was painfully slowa few orders of magnitude slower than sr2sieve. Gary mentioned to me a little while back that it might be worthwhile to try Riesel or Sierp. base 256 (one of those, I forget which) with ppsieveCUDA. Its kmax is very low relative to the number of k's remaining, so it might well be in ppsieve's "sweet spot". 

20101006, 01:35  #8  
Jul 2003
So Cal
2·3·347 Posts 
Quote:
Code:
./ppsievecudax86_64linux R k3 K10000 N2e6 p10000e9 P10001e9 q ppsieve version cuda0.2.1a (testing) Compiled Oct 5 2010 with GCC 4.3.3 nstart=72, nstep=30 ppsieve initialized: 3 <= k <= 10001, 72 <= n < 2000000 Sieve started: 10000000000000 <= p < 10001000000000 Thread 0 starting Detected GPU 0: GeForce GTX 480 Detected compute capability: 2.0 Detected 15 multiprocessors. nstep changed to 22 p=10000918814721, 5.103M p/sec, 0.07 CPU cores, 91.9% done. ETA 05 Oct 18:27 Thread 0 completed Waiting for threads to exit Sieve complete: 10000000000000 <= p < 10001000000000 Found 3617 factors count=33405006,sum=0x1c1b8d0e01e3e0ea Elapsed time: 196.30 sec. (0.01 init + 196.29 sieve) at 5094920 p/sec. Last fiddled with by frmky on 20101006 at 01:35 

20101006, 02:16  #9 
Account Deleted
"Tim Sorbera"
Aug 2006
San Antonio, TX USA
10AB_{16} Posts 
Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be $36 to compare.
Last fiddled with by MiniGeek on 20101006 at 02:23 
20101006, 02:45  #10  
A Sunny Moo
Aug 2007
USA (GMT5)
3·2,083 Posts 
Quote:
I'll get a range reserved for and started on Gary's GPU sometime today or tomorrowat that point I'll be able to give some exact figures for p/sec. 

20101006, 03:38  #11 
Jan 2005
Caught in a sieve
5·79 Posts 
From what I can tell, a stockclocked GTX 460 will only be about half as fast as a stockclocked GTX 480. It seems to have to do with the 460 needing instructionlevel parallelism. The only two good ways I can think of to provide that are to either recompile the client with the latest CUDA SDK or to vectorize the work like I did for AMD. Neither option is all that easy, but compiling the latest SDK on a clean VM is probably easier.
The good news is that you're only at 10T right now. Somewhere between 21T and 40T you'll get a significant speed boost. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
New PRPnet drive discussion  mdettweiler  Conjectures 'R Us  89  20110810 09:01 
Sieving drive Riesel base 6 n=1M2M  gd_barnes  Conjectures 'R Us  40  20110122 08:10 
Bigger and better GPU sieving drive: k<10000 n<2M  mdettweiler  No Prime Left Behind  61  20101029 18:48 
GPU sieving drive for k<=1001 n=1M2M  mdettweiler  No Prime Left Behind  11  20101004 22:45 
Sieving drive for k=301400 n=1M2M  MyDogBuster  No Prime Left Behind  42  20100321 01:14 