View Single Post
Old 2010-09-26, 18:42   #1
A Sunny Moo
mdettweiler's Avatar
Aug 2007

624910 Posts
Cool GPU sieving drive for k<=1001 n=1M-2M

Update (10/5/10): in expanding this sieve to improve efficiency, we decided to start a new reservation thread rather than continuing this one. See here.

Hi all,

With the recent advances in the development of sieve programs for CUDA and OpenCL GPUs, and Gary's recent purchase of a GTX 460 GPU for the testing and use of such, we got to thinking it's high time to start putting all those GPUs to work that many of you have been waiting to help out at NPLB with for a while. We were originally going to start our next big sieving effort, that of k=400-1001 for n=1M-2M, some months down the road, but after seeing the wealth of GPU resources currently available for sieving, we decided to move it up to now.

The program we'll be using is ppsieve, written by Ken Brazier, which is a sieve for k*2^n+-1 numbers optimized for the k-heavy ranges that projects like NPLB and PrimeGrid do. PrimeGrid has already been using this program to great effect on their Proth Prime Search Extended (k<10000) subproject, as it is rather faster than sr2sieve for k-heavy sieves. And that's just on CPUs; GPU versions of ppsieve for CUDA (nVidia) and OpenCL (ATI/AMD) have been developed, and they're many times faster than the CPU version of ppsieve (and thus even faster relative to sr2sieve)!

One interesting consequence of using ppsieve is that since it scales based on the highest k in the sieve, rather than the number of k's as does sr2sieve. (For you geekheads out there who understand Big-O notation, ppsieve is O((nmax-nmin)/log2(p/kmax)) and sr2sieve is O(k_count*sqrt(nmax-nmin)).) This means that we can include k<400 in the sieve effectively for free. So we'll sieve the entire range of k=3-1001, n=1M-2M. For k<300, which is primarily searched by Riesel Prime Search (RPS) and individual searchers within both RPS and NPLB, the plan is to make the sieve file publicly available once we've reached optimal depth. For k=300-400, we'll continue on past the p=140T sieve depth that range is currently at and switch to the better-sieved file for the individual-k drive. (Again, this is included effectively for free, so it's easier just to sieve the entire range from scratch rather than skipping k=300-400 until p=140T and merging at that point.)

To get started:

-Download ppsieve-CUDA (if you have an nVidia GPU) or ppsieve-OpenCL (if you have an ATI/AMD GPU) from Note that you will need to have the respective CUDA or OpenCL toolkits installed on your computer before these will work. If you've done GPU crunching in the past, then you're probably all set; if not, post in this thread or email/PM me and we'll be glad to help you get set up. (Note that I personally have no experience with crunching on ATI GPUs, only nVidia ones; so if you have a question about the OpenCL app, you'll be better off posting it in this thread so someone else can answer it.)

-Open the file ppconfig.txt and add a line like this somewhere in the file:
This is equivalent to -m on the command line and sets the number of multiprocessor threads that ppsieve will use on your GPU. The optimal setting for this on a particular sieve varies from GPU to GPU and may require a bit of experimentation to determine. If you have an nVidia GTX 460, I've already done this part for you: 2048 is what you want. For other GPUs, I'd recommend starting at 2048 and playing around with it from there. As we try this sieve on more GPUs, I'll summarize the determined optimal settings for various GPUs in a table in the next post.

-Also add a line somewhere in ppconfig.txt such that the word "riesel" is on a line all by itself. This is very important; if you forget this, you will be cranking out totally useless factors on the Proth (k*2^n+1) side. You can also tweak checkpoint= and report= as desired. Additionally, early in the sieve factors will be produced very quickly; if they are coming at a rate of more than ~1 factor/sec., you'll want to uncomment the line "quiet" farther down the file to keep factors from being printed. (Lots of factors being printed to the screen can slow down the program.)

-Download the sieve file from and extract it to the same folder you put ppsieve in. Disclaimer: this file is very big, especially when uncompressed (~80 MB).

-Reserve a range in this thread. A p=50G range is probably a good size to start with to get a feel for how fast your GPU will crunch (GPU speeds can vary quite a bit from low- to high-end models).

-Run the range with a command line like this:
ppsieve-cuda-x86_64-* -i sieve_k3-1001_n1M-2M_135G.txt -p 135G -P 500G
Replace * with windows or linux as appropriate. If you are on a 32-bit operating system, use "ppsieve-cuda-x86-*" instead. In the above example, the range is p=135G-500G; modify this as necessary for your range. K=10e3, M=10e6, G=10e9, T=10e12, and P=10e15 are all accepted by the program as abbreviations, as is XeY notation (for X*10^Y).

-When the range completes, zip up ppfactors.txt and email it to me at It also wouldn't hurt to CC in Gary (gbarnes017 at gmail dot com or so we have a backup copy of the factors.

-Rinse, lather, and repeat!

FYI, the GPU versions of ppsieve do use a little bit of CPU time for the small prime sieve portion of the algorithm alongside the main algorithm running on the GPU; for this sieve, the amount of CPU used is very small (around 5-6% for a fast GTX 460, less if you have a slower GPU). You can therefore continue to run other crunching applications (such as LLRnet or PRPnet) on your CPU as usual alongside ppsieve on the GPU.

We'll be aiming for GPU-optimal sieve depth here; that is, the point at which it takes the same amount of time to find a factor on a GPU (Gary's GTX 460 will be used as reference) as it takes to run an LLR test on a CPU (again, an Intel C2Q Q6600 of Gary's, the same machine the GPU is on). Note that because of this, it is EXTREMELY INEFFICIENT to run a CPU on this sieve. If you really want to, you can download the CPU version of ppsieve and chip in, but be forewarned: your CPU would be put to much better use doing LLR tests. It will sieve so much more slowly than a GPU that it's a total waste of time to use it when we're going for GPU-optimal depth.

Let's see what those GPUs can do!


P-range         reserved by     status       est. completion date
    0G-135G     gd_barnes       complete
  135G-500G     gd_barnes       in progress  ?

Last fiddled with by mdettweiler on 2010-10-05 at 16:02 Reason: see new thread
mdettweiler is offline