![]() |
Bigger and better GPU sieving drive: Discussion
[B]Admin edit (Max): split off from the main [url=http://www.mersenneforum.org/showthread.php?t=14022]Bigger and better GPU sieving drive: k<10000 n<2M[/url] thread. Normally we prefer to keep discussion and reservations in a central location for each drive, but in this case the hoopla surrounding the drive was seriously overwhelming the actual drive content and making it a little hard to sort out. :smile:[/B]
Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then. |
[QUOTE=henryzz;232624]Why can't we start sieving now for >10T here without a sievefile? There will be no difference to doing it then.[/QUOTE]
Good question. I'll drop a note in the PST forum asking if we can do this. (Since their reservation thread is the "primary" one, we can't open it up until they put the range on their books.) |
I don't own a CUDA-ready GPU, so unfortunately I can't efficiently participate in this. Out of curiosity, about how fast (for this or other ppsieve sieves) is a GPU compared to a quad on a 32-bit OS? Does the GPU's speed differ significantly between 32- and 64-bit versions?
Also: Wow, GPUs are really changing the landscape of sieving. That is an incredibly [I]enormous[/I] range. It will take quite a while to test, unless a program is released to run LLR tests on a GPU, but it looks like before too long, my n~=1.3M primes, currently ranked ~200, won't be nearly as impressive. |
[QUOTE=Mini-Geek;232638]I don't own a CUDA-ready GPU. Out of curiosity, about how fast is a GPU compared to a quad on a 32-bit OS?[/QUOTE]
It depends on the sieve. I haven't run a comparison on this particular sieve but would expect a GPU to be a number of times faster than 4 CPU cores working together. I'll run some comparisons shortly--stay tuned. [quote]Does the GPU's speed differ significantly between 32- and 64-bit versions?[/quote] On the earlier k<=1001 sieve, ppsieve was mostly GPU bound; that is, it needed very little CPU (.06 cores) to keep the GPU busy. I haven't tried it (yet) with this sieve, but would expect it to be similarly nonreliant on the CPU. Thus, while 64-bit would cause the CPU portion of the code to run faster (in the above example, I would expect 32-bit to take .12 CPU cores), it's not going to affect the GPU's speed at all. So it's not going to make the sieve run faster, but will leave more free CPU for other applications. :smile: Note that again, this depends on the sieve. For instance, on TPS's variable-n sieve with tpsieve-CUDA, the program is CPU bound on a fast GPU like Gary's; thus 64-bit will run significantly faster than 32-bit. |
Tim,
To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve. BUT...the sieve must be for all k's below a certain limit and must be for base 2. There is no advantage to sieving one or a few k's at a time. Also, it made no sense to only sieve k=300-1001, which is what we originally started doing until Lennart's offer. We could add k<300 for free so we did k<=1001. The speed is based on the highest k in the sieve and P-range and virtually nothing else. At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS. We are greatful to Lennart for picking up the effort at PrimeGrid. And to think: It was all spearheaded by Max after I bought my first GPU, which I knew little about. :smile: Gary |
[QUOTE=gd_barnes;232657]To be more specific. By what Max indicated to me by Email, sieving on a GPU would be likely > 10 times faster than sieving on a 32-bit machine with sr2sieve.[/QUOTE]
Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G. It would take around 29 days with 4 cores to finish 1T. I don't know how ppsieve-CUDA compares to 32-bit sr2sieve or ppsieve, but that'd probably put a GPU's time for the range anywhere from 14 to 2.9 days. Can we just get a p/sec number on some GPU? :smile: [QUOTE=gd_barnes;232657]BUT...the sieve must be for all k's below a certain limit and must be for base 2. ... At this moment in time, the extreme speed gain that you get from the sieve is very restrictive. Fortunately such a sieve is highly effective for NPLB because of the way we search across large swaths of k. But even if GPU sieving allowed non-base-2 sieving, it still would not be effective at CRUS.[/QUOTE] Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful. Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs. |
[QUOTE=Mini-Geek;232661]Oh, I didn't realize how restricted ppsieve/ppsieve-CUDA is. It seems that NPLB (and the like, both for Riesel and Sierp sides) is one of the very few projects that can really (efficiently) use it. For small enough k's, and before CUDA versions of other sievers are released, it may still be faster to use ppsieve-CUDA. I wonder just where the cutoff is for that...probably too low to be useful. But then, if your GPU would idle otherwise, and the CPU usage is low enough, it might be useful to higher bounds. But if this sieving effort brings it high enough, and you're working on the Riesel side, then you go right back to it not being useful.
Hopefully, more CUDA sievers (sr2sieve and sr1sieve or equivalents) will be released eventually, and then CRUS could take advantage of GPUs.[/QUOTE] ppsieve was originally designed by PrimeGrid to be optimized for the very big base 2 sieves they (and we) do. It uses an entirely different algorithm than the sr*sieve programs and as such does not work well with few k's. I actually did try ppsieve-CUDA on the Prime Sierpinski Project's sieve a couple of weeks back and it did NOT work well. Firstly, ppsieve actually couldn't allocate a bitmap big enough to hold an n<50M sieve with kmax as high as theirs is. I tried then to split it up into smaller files (5M at a time, IIRC) and while it worked, it was painfully slow--a few orders of magnitude slower than sr2sieve. Gary mentioned to me a little while back that it might be worthwhile to try Riesel or Sierp. base 256 (one of those, I forget which) with ppsieve-CUDA. Its kmax is very low relative to the number of k's remaining, so it might well be in ppsieve's "sweet spot". |
[QUOTE=Mini-Geek;232661]Hm, ok. For comparison purposes, I benchmarked my quad (on 32-bit) on ppsieve on this sieving drive, at 10T. (I used "ppsieve -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -t4" and the default blocksize, etc. settings; it reported that it was searching "3 <= k <= 10001, 72 <= n <= 2000000") It is just under 100K p/sec per core. It takes about 3 core seconds per factor. That means it would take about 10,000 sec per billion p per core, or 167 minutes per G. [/QUOTE]
The GTX 480 is about 50x faster than a single core: [CODE]./ppsieve-cuda-x86_64-linux -R -k3 -K10000 -N2e6 -p10000e9 -P10001e9 -q ppsieve version cuda-0.2.1a (testing) Compiled Oct 5 2010 with GCC 4.3.3 nstart=72, nstep=30 ppsieve initialized: 3 <= k <= 10001, 72 <= n < 2000000 Sieve started: 10000000000000 <= p < 10001000000000 Thread 0 starting Detected GPU 0: GeForce GTX 480 Detected compute capability: 2.0 Detected 15 multiprocessors. nstep changed to 22 p=10000918814721, 5.103M p/sec, 0.07 CPU cores, 91.9% done. ETA 05 Oct 18:27 Thread 0 completed Waiting for threads to exit Sieve complete: 10000000000000 <= p < 10001000000000 Found 3617 factors count=33405006,sum=0x1c1b8d0e01e3e0ea Elapsed time: 196.30 sec. (0.01 init + 196.29 sieve) at 5094920 p/sec. [/CODE] |
[QUOTE=frmky;232666]The GTX 480 is about 50x faster than a single core:[/QUOTE]
Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be $36 to compare. |
[QUOTE=Mini-Geek;232670]Wow! :surprised That makes it 55 hours/T (or 2.3 days/T; with GPU) vs 29 days/T (with quad). And that was with an entire quad working on it... Granted, this is an expensive GPU ($450 at a quick glance at Newegg), but that's still a huge difference in speed. Roughly 12.5 times the throughput of a 32-bit quad. The effectiveness for the cost (ignoring electricity, MB, and other overhead) would mean the quad would have to be $36 to compare.[/QUOTE]
FYI, from what I've heard the GTX 460 is only a little slower than the GTX 480; however, it's a lot cheaper ($170). Gary has one of these that's slightly factory overclocked (Newegg has these [URL="http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=100007709%20600030348%20600007323&IsNodeId=1&name=GeForce%20GTX%20400%20series"]mixed in[/URL] the same price range as the non-overclocked GTX 460s--go figure). I'll get a range reserved for and started on Gary's GPU sometime today or tomorrow--at that point I'll be able to give some exact figures for p/sec. |
From what I can tell, a stock-clocked GTX 460 will only be about half as fast as a stock-clocked GTX 480. It seems to have to do with the 460 needing instruction-level parallelism. The only two good ways I can think of to provide that are to either recompile the client with the latest CUDA SDK or to vectorize the work like I did for AMD. Neither option is all that easy, but compiling the latest SDK on a clean VM is probably easier.
The good news is that you're only at 10T right now. Somewhere between 21T and 40T you'll get a significant speed boost. :smile: |
| All times are UTC. The time now is 11:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.