GPU sieving drive for k<=1001 n=1M2M
[B]Update (10/5/10): in expanding this sieve to improve efficiency, we decided to start a new reservation thread rather than continuing this one. See [url=http://www.mersenneforum.org/showthread.php?p=232614]here[/url].[/B]
Hi all, With the recent advances in the development of sieve programs for CUDA and OpenCL GPUs, and Gary's recent purchase of a GTX 460 GPU for the testing and use of such, we got to thinking it's high time to start putting all those GPUs to work that many of you have been waiting to help out at NPLB with for a while. We were originally going to start our next big sieving effort, that of k=4001001 for n=1M2M, some months down the road, but after seeing the wealth of GPU resources currently available for sieving, we decided to move it up to now. The program we'll be using is ppsieve, written by Ken Brazier, which is a sieve for k*2^n+1 numbers optimized for the kheavy ranges that projects like NPLB and PrimeGrid do. PrimeGrid has already been using this program to great effect on their Proth Prime Search Extended (k<10000) subproject, as it is rather faster than sr2sieve for kheavy sieves. And that's just on CPUs; GPU versions of ppsieve for CUDA (nVidia) and OpenCL (ATI/AMD) have been developed, and they're many times faster than the CPU version of ppsieve (and thus even faster relative to sr2sieve)! One interesting consequence of using ppsieve is that since it scales based on the highest k in the sieve, rather than the number of k's as does sr2sieve. (For you geekheads out there who understand BigO notation, ppsieve is O((nmaxnmin)/log2(p/kmax)) and sr2sieve is O(k_count*sqrt(nmaxnmin)).) This means that we can include k<400 in the sieve effectively for free. So we'll sieve the entire range of k=31001, n=1M2M. For k<300, which is primarily searched by Riesel Prime Search (RPS) and individual searchers within both RPS and NPLB, the plan is to make the sieve file publicly available once we've reached optimal depth. For k=300400, we'll continue on past the p=140T sieve depth that range is currently at and switch to the bettersieved file for the individualk drive. (Again, this is included effectively for free, so it's easier just to sieve the entire range from scratch rather than skipping k=300400 until p=140T and merging at that point.) [B]To get started:[/B] Download ppsieveCUDA (if you have an nVidia GPU) or ppsieveOpenCL (if you have an ATI/AMD GPU) from [URL]https://sites.google.com/site/kenscode/primeprograms[/URL]. Note that you will need to have the respective CUDA or OpenCL toolkits installed on your computer before these will work. If you've done GPU crunching in the past, then you're probably all set; if not, post in this thread or email/PM me and we'll be glad to help you get set up. (Note that I personally have no experience with crunching on ATI GPUs, only nVidia ones; so if you have a question about the OpenCL app, you'll be better off posting it in this thread so someone else can answer it.) Open the file ppconfig.txt and add a line like this somewhere in the file: [I]mthreads=2048[/I] This is equivalent to m on the command line and sets the number of multiprocessor threads that ppsieve will use on your GPU. The optimal setting for this on a particular sieve varies from GPU to GPU and may require a bit of experimentation to determine. If you have an nVidia GTX 460, I've already done this part for you: 2048 is what you want. For other GPUs, I'd recommend starting at 2048 and playing around with it from there. As we try this sieve on more GPUs, I'll summarize the determined optimal settings for various GPUs in a table in the next post. Also add a line somewhere in ppconfig.txt such that the word "riesel" is on a line all by itself. This is very important; if you forget this, you will be cranking out totally useless factors on the Proth (k*2^n+1) side. You can also tweak checkpoint= and report= as desired. Additionally, early in the sieve factors will be produced very quickly; if they are coming at a rate of more than ~1 factor/sec., you'll want to uncomment the line "quiet" farther down the file to keep factors from being printed. (Lots of factors being printed to the screen can slow down the program.) Download the sieve file from [URL]http://www.noprimeleftbehind.net/sieve/sieve_k31001_n1M2M_135G.txt.bz2[/URL] and extract it to the same folder you put ppsieve in. Disclaimer: this file is very big, especially when uncompressed (~80 MB). Reserve a range in this thread. A p=50G range is probably a good size to start with to get a feel for how fast your GPU will crunch (GPU speeds can vary quite a bit from low to highend models). Run the range with a command line like this: [I]ppsievecudax86_64* i sieve_k31001_n1M2M_135G.txt p 135G P 500G[/I] Replace * with windows or linux as appropriate. If you are on a 32bit operating system, use "ppsievecudax86*" instead. In the above example, the range is p=135G500G; modify this as necessary for your range. K=10e3, M=10e6, G=10e9, T=10e12, and P=10e15 are all accepted by the program as abbreviations, as is XeY notation (for X*10^Y). When the range completes, zip up ppfactors.txt and email it to me at [EMAIL="max@noprimeleftbehind.net"]max@noprimeleftbehind.net[/EMAIL]. It also wouldn't hurt to CC in Gary (gbarnes017 at gmail dot com or [EMAIL="gary@noprimeleftbehind.net"]gary@noprimeleftbehind.net[/EMAIL]) so we have a backup copy of the factors. Rinse, lather, and repeat! :smile: FYI, the GPU versions of ppsieve do use a little bit of CPU time for the small prime sieve portion of the algorithm alongside the main algorithm running on the GPU; for this sieve, the amount of CPU used is very small (around 56% for a fast GTX 460, less if you have a slower GPU). You can therefore continue to run other crunching applications (such as LLRnet or PRPnet) on your CPU as usual alongside ppsieve on the GPU. We'll be aiming for GPUoptimal sieve depth here; that is, the point at which it takes the same amount of time to find a factor on a GPU (Gary's GTX 460 will be used as reference) as it takes to run an LLR test on a CPU (again, an Intel C2Q Q6600 of Gary's, the same machine the GPU is on). Note that because of this, it is EXTREMELY INEFFICIENT to run a CPU on this sieve. If you really want to, you can download the CPU version of ppsieve and chip in, but be forewarned: your CPU would be put to much better use doing LLR tests. It will sieve so much more slowly than a GPU that it's a total waste of time to use it when we're going for GPUoptimal depth. Let's see what those GPUs can do! :grin: Max :smile: Reservations: [code] Prange reserved by status est. completion date 0G135G gd_barnes complete 135G500G gd_barnes in progress ? [B]9/27: RESERVATIONS CLOSED[/B] [/code] 
Here's a summary of the empirically determined optimal m values for various GPUs on this sieve:
[code] Model Optimal m value  nVidia GTX 460 2048 [/code] If you have a GPU that's not in this list, please experiment with various values (2048 is probably a good starting point) and let us know what works best! Note that all m values must be [b]multiples of 128[/b]. 
As you may already know, I'm testing k=161.
If I use the sieve file that you will provide for that k, should I include NPLB in my prover code (along with RPS, of course)? 
Headsup: ppsieveCUDA 0.2.0 alpha
The main "stable" version of ppsieveCUDA, which you'll get if you click on the link at [URL]https://sites.google.com/site/kenscode/primeprograms[/URL], is version 0.6.1a. However, just a couple of days ago Ken released a new version, 0.2.0 alpha, which is much fastersometimes as much as 2 times faster in testing over at PrimeGrid.
0.2.0 alpha can be downloaded at: [URL]https://sites.google.com/site/kenscode/primeprograms/ppscudaa1.zip?attredirects=0&d=1[/URL] Note that as of now, the only binaries available for it are the "BOINC version". It can still be used for standalone sieves like here, but note that it prints all of its stderr output to the file stderr.txt instead of to the screen. It also does not print the estimated time of completion, since under BOINC that is automatically calculated by the BOINC software. I'm using this version right now for Gary's currently reserved range of 135G500G; hence why the est. completion date is marked as "?". ATI GPU owners stay tuned: the improvements of this version are expected to be implemented for OpenCL soon as well. 
[QUOTE=MooMoo2;231540]As you may already know, I'm testing k=161.
If I use the sieve file that you will provide for that k, should I include NPLB in my prover code (along with RPS, of course)?[/QUOTE] Yeswhen we finish the sieve and post an announcement of the sieve file's availability, I was going to go into more detail about this. The only requirement of using the sieve file is that you include NPLB alongside any other projects when reporting primes; and we'd also prefer if you could send us your residuals at the end, since it will save us some work down the road when we doublecheck this range. Note that for now, you can of course still keep reporting primes without NPLB; it's only once you start using the sieve file that the stipulation takes effect. 
k<300 is intended for future doublecheck efforts. How such k's are originally reported to the top5000 site by whomever found them is a nonissue. It's up to the person who tested the k. Since it takes no extra time to sieve it, we may as well save ourselves the future effort.

Reservations closed (for now)
Hi all,
We discovered today that we can gain quite a bit of efficiency by doing this drive a little diferently. I can't go into much detail for now since various details are yet to be finalized; all I can tell you for now is that it will be bigger and better. :smile: The new sieve has been started and in fact it passed this one's depth rather quicklyso [b]reservations are closed for now[/b], since any work done here at this point would be wasted. Stay tuned...hopefully we'll have something concrete within a week or so. Max :smile: 
[QUOTE=mdettweiler;231536]For you geekheads out there who understand BigO notation, ppsieve is O((nmaxnmin)/log2(p/kmax)) and sr2sieve is O([COLOR=Red]k_count[/COLOR]*sqrt(nmaxnmin)).[/QUOTE]
Are you sure that the runtime for sr2sieve is not O([COLOR=Red]sqrt(k_count)[/COLOR]*sqrt(nmaxnmin))? Doubling the number of ks in sr2sieve doesn't multiply the runtime by 2. I am pretty certain I have read sqrt(k_count) before. 
Having looked into [url=http://www.freedc.org/forum/showthread.php?10262Mathinprogramsiwouldliketolearn]the math[/url], I'm convinced that Max is correct. (Though he may have gotten that figure from me.)
Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way? 
[QUOTE=Ken_g6;232336]Having looked into [URL="http://www.freedc.org/forum/showthread.php?10262Mathinprogramsiwouldliketolearn"]the math[/URL], I'm convinced that Max is correct. (Though he may have gotten that figure from me.)
Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way?[/QUOTE] Well with testing sr2sieve seems to be sqrt(k_count) I sieved ks(1,3,5 etc excluding squares because of algebraics) to 1e8 with srsieve and then sieved a range of 2e8 starting from 1e12. A spreadsheet gave me a regression line with formula 23.76*k_count^0.51 which is very close to 1/2 which is sqrt(k_count). [CODE]k_count seconds 1 27 4 42 9 67 12 80 16 94 18 101 20 118 24 129 30 148 [/CODE] 
[QUOTE=Ken_g6;232336]Having looked into [URL="http://www.freedc.org/forum/showthread.php?10262Mathinprogramsiwouldliketolearn"]the math[/URL], I'm convinced that Max is correct. (Though he may have gotten that figure from me.)
Although going from one to two k's in sr2sieve doesn't double the runtime, going from two to three should increase the runtime as much as going from one to two did. Or maybe I'm wrong and Geoff found a better way?[/QUOTE] Yes, I got that figure from you. I haven't ever actually looked at the sr*sieve source code myself. Of course, note that BigO notation gives only the [i]worst case[/i] runtime; often the actual runtime will be lower. I suspect that something of this sort is happening here. From the various large sieves that NPLB's done with sr2sieve, we've empirically determined that there is a significant speed benefit as you add additional k's to a sieve. 
All times are UTC. The time now is 07:40. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2020, Jelsoft Enterprises Ltd.