mersenneforum.org GPU sieving drive for k<=1001 n=1M-2M
 Register FAQ Search Today's Posts Mark Forums Read

 2010-09-26, 18:42 #1 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3·2,083 Posts GPU sieving drive for k<=1001 n=1M-2M Update (10/5/10): in expanding this sieve to improve efficiency, we decided to start a new reservation thread rather than continuing this one. See here. Hi all, With the recent advances in the development of sieve programs for CUDA and OpenCL GPUs, and Gary's recent purchase of a GTX 460 GPU for the testing and use of such, we got to thinking it's high time to start putting all those GPUs to work that many of you have been waiting to help out at NPLB with for a while. We were originally going to start our next big sieving effort, that of k=400-1001 for n=1M-2M, some months down the road, but after seeing the wealth of GPU resources currently available for sieving, we decided to move it up to now. The program we'll be using is ppsieve, written by Ken Brazier, which is a sieve for k*2^n+-1 numbers optimized for the k-heavy ranges that projects like NPLB and PrimeGrid do. PrimeGrid has already been using this program to great effect on their Proth Prime Search Extended (k<10000) subproject, as it is rather faster than sr2sieve for k-heavy sieves. And that's just on CPUs; GPU versions of ppsieve for CUDA (nVidia) and OpenCL (ATI/AMD) have been developed, and they're many times faster than the CPU version of ppsieve (and thus even faster relative to sr2sieve)! One interesting consequence of using ppsieve is that since it scales based on the highest k in the sieve, rather than the number of k's as does sr2sieve. (For you geekheads out there who understand Big-O notation, ppsieve is O((nmax-nmin)/log2(p/kmax)) and sr2sieve is O(k_count*sqrt(nmax-nmin)).) This means that we can include k<400 in the sieve effectively for free. So we'll sieve the entire range of k=3-1001, n=1M-2M. For k<300, which is primarily searched by Riesel Prime Search (RPS) and individual searchers within both RPS and NPLB, the plan is to make the sieve file publicly available once we've reached optimal depth. For k=300-400, we'll continue on past the p=140T sieve depth that range is currently at and switch to the better-sieved file for the individual-k drive. (Again, this is included effectively for free, so it's easier just to sieve the entire range from scratch rather than skipping k=300-400 until p=140T and merging at that point.) To get started: -Download ppsieve-CUDA (if you have an nVidia GPU) or ppsieve-OpenCL (if you have an ATI/AMD GPU) from https://sites.google.com/site/kenscode/prime-programs. Note that you will need to have the respective CUDA or OpenCL toolkits installed on your computer before these will work. If you've done GPU crunching in the past, then you're probably all set; if not, post in this thread or email/PM me and we'll be glad to help you get set up. (Note that I personally have no experience with crunching on ATI GPUs, only nVidia ones; so if you have a question about the OpenCL app, you'll be better off posting it in this thread so someone else can answer it.) -Open the file ppconfig.txt and add a line like this somewhere in the file: mthreads=2048 This is equivalent to -m on the command line and sets the number of multiprocessor threads that ppsieve will use on your GPU. The optimal setting for this on a particular sieve varies from GPU to GPU and may require a bit of experimentation to determine. If you have an nVidia GTX 460, I've already done this part for you: 2048 is what you want. For other GPUs, I'd recommend starting at 2048 and playing around with it from there. As we try this sieve on more GPUs, I'll summarize the determined optimal settings for various GPUs in a table in the next post. -Also add a line somewhere in ppconfig.txt such that the word "riesel" is on a line all by itself. This is very important; if you forget this, you will be cranking out totally useless factors on the Proth (k*2^n+1) side. You can also tweak checkpoint= and report= as desired. Additionally, early in the sieve factors will be produced very quickly; if they are coming at a rate of more than ~1 factor/sec., you'll want to uncomment the line "quiet" farther down the file to keep factors from being printed. (Lots of factors being printed to the screen can slow down the program.) -Download the sieve file from http://www.noprimeleftbehind.net/sie...M_135G.txt.bz2 and extract it to the same folder you put ppsieve in. Disclaimer: this file is very big, especially when uncompressed (~80 MB). -Reserve a range in this thread. A p=50G range is probably a good size to start with to get a feel for how fast your GPU will crunch (GPU speeds can vary quite a bit from low- to high-end models). -Run the range with a command line like this: ppsieve-cuda-x86_64-* -i sieve_k3-1001_n1M-2M_135G.txt -p 135G -P 500G Replace * with windows or linux as appropriate. If you are on a 32-bit operating system, use "ppsieve-cuda-x86-*" instead. In the above example, the range is p=135G-500G; modify this as necessary for your range. K=10e3, M=10e6, G=10e9, T=10e12, and P=10e15 are all accepted by the program as abbreviations, as is XeY notation (for X*10^Y). -When the range completes, zip up ppfactors.txt and email it to me at max@noprimeleftbehind.net. It also wouldn't hurt to CC in Gary (gbarnes017 at gmail dot com or gary@noprimeleftbehind.net) so we have a backup copy of the factors. -Rinse, lather, and repeat! FYI, the GPU versions of ppsieve do use a little bit of CPU time for the small prime sieve portion of the algorithm alongside the main algorithm running on the GPU; for this sieve, the amount of CPU used is very small (around 5-6% for a fast GTX 460, less if you have a slower GPU). You can therefore continue to run other crunching applications (such as LLRnet or PRPnet) on your CPU as usual alongside ppsieve on the GPU. We'll be aiming for GPU-optimal sieve depth here; that is, the point at which it takes the same amount of time to find a factor on a GPU (Gary's GTX 460 will be used as reference) as it takes to run an LLR test on a CPU (again, an Intel C2Q Q6600 of Gary's, the same machine the GPU is on). Note that because of this, it is EXTREMELY INEFFICIENT to run a CPU on this sieve. If you really want to, you can download the CPU version of ppsieve and chip in, but be forewarned: your CPU would be put to much better use doing LLR tests. It will sieve so much more slowly than a GPU that it's a total waste of time to use it when we're going for GPU-optimal depth. Let's see what those GPUs can do! Max Reservations: Code: P-range reserved by status est. completion date 0G-135G gd_barnes complete 135G-500G gd_barnes in progress ? 9/27: RESERVATIONS CLOSED Last fiddled with by mdettweiler on 2010-10-05 at 16:02 Reason: see new thread
 2010-09-26, 18:45 #2 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3×2,083 Posts Here's a summary of the empirically determined optimal -m values for various GPUs on this sieve: Code: Model Optimal -m value ----------------------------------- nVidia GTX 460 2048 If you have a GPU that's not in this list, please experiment with various values (2048 is probably a good starting point) and let us know what works best! Note that all -m values must be multiples of 128.
 2010-09-26, 18:51 #3 MooMoo2     Aug 2010 56410 Posts As you may already know, I'm testing k=161. If I use the sieve file that you will provide for that k, should I include NPLB in my prover code (along with RPS, of course)? Last fiddled with by MooMoo2 on 2010-09-26 at 18:52
 2010-09-26, 18:56 #4 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3×2,083 Posts Heads-up: ppsieve-CUDA 0.2.0 alpha The main "stable" version of ppsieve-CUDA, which you'll get if you click on the link at https://sites.google.com/site/kenscode/prime-programs, is version 0.6.1a. However, just a couple of days ago Ken released a new version, 0.2.0 alpha, which is much faster--sometimes as much as 2 times faster in testing over at PrimeGrid. 0.2.0 alpha can be downloaded at: https://sites.google.com/site/kensco...edirects=0&d=1 Note that as of now, the only binaries available for it are the "BOINC version". It can still be used for standalone sieves like here, but note that it prints all of its stderr output to the file stderr.txt instead of to the screen. It also does not print the estimated time of completion, since under BOINC that is automatically calculated by the BOINC software. I'm using this version right now for Gary's currently reserved range of 135G-500G; hence why the est. completion date is marked as "?". ATI GPU owners stay tuned: the improvements of this version are expected to be implemented for OpenCL soon as well.
2010-09-26, 18:59   #5
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

3·2,083 Posts

Quote:
 Originally Posted by MooMoo2 As you may already know, I'm testing k=161. If I use the sieve file that you will provide for that k, should I include NPLB in my prover code (along with RPS, of course)?
Yes--when we finish the sieve and post an announcement of the sieve file's availability, I was going to go into more detail about this. The only requirement of using the sieve file is that you include NPLB alongside any other projects when reporting primes; and we'd also prefer if you could send us your residuals at the end, since it will save us some work down the road when we doublecheck this range.

Note that for now, you can of course still keep reporting primes without NPLB; it's only once you start using the sieve file that the stipulation takes effect.

 2010-09-26, 20:11 #6 gd_barnes     May 2007 Kansas; USA 236768 Posts k<300 is intended for future doublecheck efforts. How such k's are originally reported to the top-5000 site by whomever found them is a non-issue. It's up to the person who tested the k. Since it takes no extra time to sieve it, we may as well save ourselves the future effort. Last fiddled with by gd_barnes on 2010-09-26 at 20:13
 2010-09-27, 06:38 #7 mdettweiler A Sunny Moo     Aug 2007 USA (GMT-5) 3·2,083 Posts Reservations closed (for now) Hi all, We discovered today that we can gain quite a bit of efficiency by doing this drive a little diferently. I can't go into much detail for now since various details are yet to be finalized; all I can tell you for now is that it will be bigger and better. The new sieve has been started and in fact it passed this one's depth rather quickly--so reservations are closed for now, since any work done here at this point would be wasted. Stay tuned...hopefully we'll have something concrete within a week or so. Max
2010-10-02, 12:52   #8
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT)

22×1,423 Posts

Quote:
 Originally Posted by mdettweiler For you geekheads out there who understand Big-O notation, ppsieve is O((nmax-nmin)/log2(p/kmax)) and sr2sieve is O(k_count*sqrt(nmax-nmin)).
Are you sure that the runtime for sr2sieve is not O(sqrt(k_count)*sqrt(nmax-nmin))? Doubling the number of ks in sr2sieve doesn't multiply the runtime by 2. I am pretty certain I have read sqrt(k_count) before.

 2010-10-02, 15:12 #9 Ken_g6     Jan 2005 Caught in a sieve 6128 Posts Having looked into the math, I'm convinced that Max is correct. (Though he may have gotten that figure from me.) Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way?
2010-10-02, 19:26   #10
henryzz
Just call me Henry

"David"
Sep 2007
Cambridge (GMT)

22×1,423 Posts

Quote:
 Originally Posted by Ken_g6 Having looked into the math, I'm convinced that Max is correct. (Though he may have gotten that figure from me.) Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way?
Well with testing sr2sieve seems to be sqrt(k_count)
I sieved ks(1,3,5 etc excluding squares because of algebraics) to 1e8 with srsieve and then sieved a range of 2e8 starting from 1e12. A spreadsheet gave me a regression line with formula 23.76*k_count^0.51 which is very close to 1/2 which is sqrt(k_count).
Code:
k_count      seconds
1              27
4              42
9              67
12             80
16             94
18             101
20             118
24             129
30             148

Last fiddled with by henryzz on 2010-10-02 at 19:27

2010-10-02, 21:30   #11
mdettweiler
A Sunny Moo

Aug 2007
USA (GMT-5)

624910 Posts

Quote:
 Originally Posted by Ken_g6 Having looked into the math, I'm convinced that Max is correct. (Though he may have gotten that figure from me.) Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way?
Yes, I got that figure from you. I haven't ever actually looked at the sr*sieve source code myself.

Of course, note that Big-O notation gives only the worst case runtime; often the actual runtime will be lower. I suspect that something of this sort is happening here. From the various large sieves that NPLB's done with sr2sieve, we've empirically determined that there is a significant speed benefit as you add additional k's to a sieve.

 Similar Threads Thread Thread Starter Forum Replies Last Post mdettweiler No Prime Left Behind 9 2014-09-02 01:21 gd_barnes No Prime Left Behind 127 2011-07-15 14:25 mdettweiler No Prime Left Behind 266 2010-12-21 01:41 gd_barnes No Prime Left Behind 675 2009-02-24 16:37 gd_barnes No Prime Left Behind 154 2008-03-31 02:59

All times are UTC. The time now is 07:50.

Wed Aug 5 07:50:22 UTC 2020 up 19 days, 3:37, 1 user, load averages: 1.21, 1.36, 1.30