20100926, 18:42  #1 
A Sunny Moo
Aug 2007
USA (GMT5)
14151_{8} Posts 
GPU sieving drive for k<=1001 n=1M2M
Update (10/5/10): in expanding this sieve to improve efficiency, we decided to start a new reservation thread rather than continuing this one. See here.
Hi all, With the recent advances in the development of sieve programs for CUDA and OpenCL GPUs, and Gary's recent purchase of a GTX 460 GPU for the testing and use of such, we got to thinking it's high time to start putting all those GPUs to work that many of you have been waiting to help out at NPLB with for a while. We were originally going to start our next big sieving effort, that of k=4001001 for n=1M2M, some months down the road, but after seeing the wealth of GPU resources currently available for sieving, we decided to move it up to now. The program we'll be using is ppsieve, written by Ken Brazier, which is a sieve for k*2^n+1 numbers optimized for the kheavy ranges that projects like NPLB and PrimeGrid do. PrimeGrid has already been using this program to great effect on their Proth Prime Search Extended (k<10000) subproject, as it is rather faster than sr2sieve for kheavy sieves. And that's just on CPUs; GPU versions of ppsieve for CUDA (nVidia) and OpenCL (ATI/AMD) have been developed, and they're many times faster than the CPU version of ppsieve (and thus even faster relative to sr2sieve)! One interesting consequence of using ppsieve is that since it scales based on the highest k in the sieve, rather than the number of k's as does sr2sieve. (For you geekheads out there who understand BigO notation, ppsieve is O((nmaxnmin)/log2(p/kmax)) and sr2sieve is O(k_count*sqrt(nmaxnmin)).) This means that we can include k<400 in the sieve effectively for free. So we'll sieve the entire range of k=31001, n=1M2M. For k<300, which is primarily searched by Riesel Prime Search (RPS) and individual searchers within both RPS and NPLB, the plan is to make the sieve file publicly available once we've reached optimal depth. For k=300400, we'll continue on past the p=140T sieve depth that range is currently at and switch to the bettersieved file for the individualk drive. (Again, this is included effectively for free, so it's easier just to sieve the entire range from scratch rather than skipping k=300400 until p=140T and merging at that point.) To get started: Download ppsieveCUDA (if you have an nVidia GPU) or ppsieveOpenCL (if you have an ATI/AMD GPU) from https://sites.google.com/site/kenscode/primeprograms. Note that you will need to have the respective CUDA or OpenCL toolkits installed on your computer before these will work. If you've done GPU crunching in the past, then you're probably all set; if not, post in this thread or email/PM me and we'll be glad to help you get set up. (Note that I personally have no experience with crunching on ATI GPUs, only nVidia ones; so if you have a question about the OpenCL app, you'll be better off posting it in this thread so someone else can answer it.) Open the file ppconfig.txt and add a line like this somewhere in the file: mthreads=2048 This is equivalent to m on the command line and sets the number of multiprocessor threads that ppsieve will use on your GPU. The optimal setting for this on a particular sieve varies from GPU to GPU and may require a bit of experimentation to determine. If you have an nVidia GTX 460, I've already done this part for you: 2048 is what you want. For other GPUs, I'd recommend starting at 2048 and playing around with it from there. As we try this sieve on more GPUs, I'll summarize the determined optimal settings for various GPUs in a table in the next post. Also add a line somewhere in ppconfig.txt such that the word "riesel" is on a line all by itself. This is very important; if you forget this, you will be cranking out totally useless factors on the Proth (k*2^n+1) side. You can also tweak checkpoint= and report= as desired. Additionally, early in the sieve factors will be produced very quickly; if they are coming at a rate of more than ~1 factor/sec., you'll want to uncomment the line "quiet" farther down the file to keep factors from being printed. (Lots of factors being printed to the screen can slow down the program.) Download the sieve file from http://www.noprimeleftbehind.net/sie...M_135G.txt.bz2 and extract it to the same folder you put ppsieve in. Disclaimer: this file is very big, especially when uncompressed (~80 MB). Reserve a range in this thread. A p=50G range is probably a good size to start with to get a feel for how fast your GPU will crunch (GPU speeds can vary quite a bit from low to highend models). Run the range with a command line like this: ppsievecudax86_64* i sieve_k31001_n1M2M_135G.txt p 135G P 500G Replace * with windows or linux as appropriate. If you are on a 32bit operating system, use "ppsievecudax86*" instead. In the above example, the range is p=135G500G; modify this as necessary for your range. K=10e3, M=10e6, G=10e9, T=10e12, and P=10e15 are all accepted by the program as abbreviations, as is XeY notation (for X*10^Y). When the range completes, zip up ppfactors.txt and email it to me at max@noprimeleftbehind.net. It also wouldn't hurt to CC in Gary (gbarnes017 at gmail dot com or gary@noprimeleftbehind.net) so we have a backup copy of the factors. Rinse, lather, and repeat! FYI, the GPU versions of ppsieve do use a little bit of CPU time for the small prime sieve portion of the algorithm alongside the main algorithm running on the GPU; for this sieve, the amount of CPU used is very small (around 56% for a fast GTX 460, less if you have a slower GPU). You can therefore continue to run other crunching applications (such as LLRnet or PRPnet) on your CPU as usual alongside ppsieve on the GPU. We'll be aiming for GPUoptimal sieve depth here; that is, the point at which it takes the same amount of time to find a factor on a GPU (Gary's GTX 460 will be used as reference) as it takes to run an LLR test on a CPU (again, an Intel C2Q Q6600 of Gary's, the same machine the GPU is on). Note that because of this, it is EXTREMELY INEFFICIENT to run a CPU on this sieve. If you really want to, you can download the CPU version of ppsieve and chip in, but be forewarned: your CPU would be put to much better use doing LLR tests. It will sieve so much more slowly than a GPU that it's a total waste of time to use it when we're going for GPUoptimal depth. Let's see what those GPUs can do! Max Reservations: Code:
Prange reserved by status est. completion date 0G135G gd_barnes complete 135G500G gd_barnes in progress ? 9/27: RESERVATIONS CLOSED Last fiddled with by mdettweiler on 20101005 at 16:02 Reason: see new thread 
20100926, 18:45  #2 
A Sunny Moo
Aug 2007
USA (GMT5)
3×2,083 Posts 
Here's a summary of the empirically determined optimal m values for various GPUs on this sieve:
Code:
Model Optimal m value  nVidia GTX 460 2048 
20100926, 18:51  #3 
Aug 2010
2·7·47 Posts 
As you may already know, I'm testing k=161.
If I use the sieve file that you will provide for that k, should I include NPLB in my prover code (along with RPS, of course)? Last fiddled with by MooMoo2 on 20100926 at 18:52 
20100926, 18:56  #4 
A Sunny Moo
Aug 2007
USA (GMT5)
3×2,083 Posts 
Headsup: ppsieveCUDA 0.2.0 alpha
The main "stable" version of ppsieveCUDA, which you'll get if you click on the link at https://sites.google.com/site/kenscode/primeprograms, is version 0.6.1a. However, just a couple of days ago Ken released a new version, 0.2.0 alpha, which is much fastersometimes as much as 2 times faster in testing over at PrimeGrid.
0.2.0 alpha can be downloaded at: https://sites.google.com/site/kensco...edirects=0&d=1 Note that as of now, the only binaries available for it are the "BOINC version". It can still be used for standalone sieves like here, but note that it prints all of its stderr output to the file stderr.txt instead of to the screen. It also does not print the estimated time of completion, since under BOINC that is automatically calculated by the BOINC software. I'm using this version right now for Gary's currently reserved range of 135G500G; hence why the est. completion date is marked as "?". ATI GPU owners stay tuned: the improvements of this version are expected to be implemented for OpenCL soon as well. 
20100926, 18:59  #5  
A Sunny Moo
Aug 2007
USA (GMT5)
3·2,083 Posts 
Quote:
Note that for now, you can of course still keep reporting primes without NPLB; it's only once you start using the sieve file that the stipulation takes effect. 

20100926, 20:11  #6 
May 2007
Kansas; USA
2^{2}×19×137 Posts 
k<300 is intended for future doublecheck efforts. How such k's are originally reported to the top5000 site by whomever found them is a nonissue. It's up to the person who tested the k. Since it takes no extra time to sieve it, we may as well save ourselves the future effort.
Last fiddled with by gd_barnes on 20100926 at 20:13 
20100927, 06:38  #7 
A Sunny Moo
Aug 2007
USA (GMT5)
14151_{8} Posts 
Reservations closed (for now)
Hi all,
We discovered today that we can gain quite a bit of efficiency by doing this drive a little diferently. I can't go into much detail for now since various details are yet to be finalized; all I can tell you for now is that it will be bigger and better. The new sieve has been started and in fact it passed this one's depth rather quicklyso reservations are closed for now, since any work done here at this point would be wasted. Stay tuned...hopefully we'll have something concrete within a week or so. Max 
20101002, 12:52  #8 
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2·3^{3}·109 Posts 
Are you sure that the runtime for sr2sieve is not O(sqrt(k_count)*sqrt(nmaxnmin))? Doubling the number of ks in sr2sieve doesn't multiply the runtime by 2. I am pretty certain I have read sqrt(k_count) before.

20101002, 15:12  #9 
Jan 2005
Caught in a sieve
5·79 Posts 
Having looked into the math, I'm convinced that Max is correct. (Though he may have gotten that figure from me.)
Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way? 
20101002, 19:26  #10  
Just call me Henry
"David"
Sep 2007
Cambridge (GMT/BST)
2·3^{3}·109 Posts 
Quote:
I sieved ks(1,3,5 etc excluding squares because of algebraics) to 1e8 with srsieve and then sieved a range of 2e8 starting from 1e12. A spreadsheet gave me a regression line with formula 23.76*k_count^0.51 which is very close to 1/2 which is sqrt(k_count). Code:
k_count seconds 1 27 4 42 9 67 12 80 16 94 18 101 20 118 24 129 30 148 Last fiddled with by henryzz on 20101002 at 19:27 

20101002, 21:30  #11  
A Sunny Moo
Aug 2007
USA (GMT5)
3·2,083 Posts 
Quote:
Of course, note that BigO notation gives only the worst case runtime; often the actual runtime will be lower. I suspect that something of this sort is happening here. From the various large sieves that NPLB's done with sr2sieve, we've empirically determined that there is a significant speed benefit as you add additional k's to a sieve. 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Team drive #14: k=6001001 n=1M2M  mdettweiler  No Prime Left Behind  10  20210313 22:32 
Team drive #7 k=8001001 n=600K1M  gd_barnes  No Prime Left Behind  127  20110715 14:25 
Doublecheck drive #1: k=31001 n=100K260K  mdettweiler  No Prime Left Behind  266  20101221 01:41 
Team drive #1: k=4001001 n=333.2K600K  gd_barnes  No Prime Left Behind  675  20090224 16:37 
Team drive #2: k=4001001 n=260K333.2K  gd_barnes  No Prime Left Behind  154  20080331 02:59 