![]() |
|
|
#1 |
|
Tribal Bullet
Oct 2004
5×23×31 Posts |
Q: What is this?
The original prime-crunching-on-dedicated-hardware-FAQ was written in the middle of 2008, and the state of the art in high-performance graphics cards appears to be advancing at a rate that is exceeding Moore's law. At the same time, the libraries for running non-graphics code on one of those things have advanced to the point where there's a fairly large community of programmers involved, both working and playing with Nvidia cards (see here). Hell, even I'm doing it. So we're going to need a few modifications to the statements made in the original FAQ about where things are going. Q: So I can look for prime numbers on a GPU now? Indeed you can. There is an active development effort underway to modify one of the standard Lucas-Lehmer test tools to use the FFT code made available by Nvidia in their cufft library. If you have a latter-day card, i.e. a GTX 260 or better, then you can do double-precision floating point arithmetic in hardware at a rate of 1/8 what the card can do in single precision. Even that card has so much floating point firepower that it can manage respectable performance despite the handicap. Q: So how fast does it go? It's a work in progress, but with a top-of-the-line card the current speed seems to be around what one core of a high-end PC can achieve. Q: That result is not very exciting. What about the next generation of high-end hardware from Nvidia? The next generation of GPU from Nvidia promises much better double-precision performance (whitepaper here). Fermi will be quite interesting from another viewpoint: 32-bit integer multiplication looks like it will be a very fast operation on that architecture, which makes integer FFTs with respectable performance a possibility. Q: Does this mean you'll stop being a naysayer on this subject? If you read the first few followup posts to the original FAQ, and read into my tone in this one, I'm somewhat skeptical that the overall progress of any prime-finding project stands to benefit from a porting of the computational software to a graphics card. Much of the interest in doing so stems from three observations: - other projects benefit greatly from it, far out of proportion to the number of GPUs that they use - Most otherwise-idle PCs will also have an otherwise-idle graphics card, so using it would amount to essentially a 'free lunch' - if a super-powered-by-GPU version of the code existed, buying a super-powered card would make your individual contribution more valuable In the case of projects like GIMPS, what 'other projects do' is immaterial. It's not that other projects have smart programmers but we don't, it's that their hardware needs are different. Further, GPU code is a free lunch as long as resources are not diverted away from the mainstream project to tap into those resources. As long as somebody volunteers to do so, there's no harm in trying. But in all the years the crunch-on-special-hardware argument has raged, only in the last few months have GPU programming environments stabilized to the point where someone actually stepped forward to do so. As to your individual contribution, unless you have a big cluster of your own (thousands of machines) to play with, no amount of dedicated hardware is going to change the fact that 1000 strangers running Prime95 in the background are going to contribute more than you can ever hope to. It's not distributed computing if that was otherwise. So, long story short, I'm still a buzzkill on this subject. Last fiddled with by jasonp on 2011-01-02 at 19:38 Reason: add link to old FAQ |
|
|
|
|
|
#2 |
|
Jul 2009
Tokyo
11428 Posts |
Hi, jasonp
Thank you everything, |
|
|
|
|
|
#3 | |
|
Jul 2006
Calgary
52·17 Posts |
Quote:
Last fiddled with by lfm on 2009-11-16 at 04:08 |
|
|
|
|
|
|
#4 | |
|
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
Quote:
I think large registers with complex instructions are a mistake. Top level algorithms can only be chunked into power-of-two sub-problems until you hit register size. Need 17-bit integers? Waste ~50%. Need floats with 12-bit exponent and 12-bit mantissa? Waste ~50%. Therefore, I'd rather have something really simple, say a 64x64 bit-matrix (with instructions to read/write from rows or columns), an accumulator and maybe two other registers like on the 6502 (now, that was fun programming!), a 4096-bit instruction cache and 2 cycle 8-bit instructions (Load/Store + 8 logical + 8 permute instructions + control flow), so that neighboring units can conflict-free peek/poke each other one clock out of sync. Then put 16K of these on a single chip, with a couple of F21s sitting on the edges for I/O control. |
|
|
|
|
|
|
#5 | |
|
Mar 2003
Melbourne
5×103 Posts |
Quote:
Taking some figures on the forum (and my own machine - core i7 930), I get the following measurements: PS3 2048k fft sec/iter = 0.084 4096k fft sec/iter = 0.194 GTX260 2048k fft sec/iter = 0.0106 4096k fft sec/iter = 0.0218 core i7 930 (single core) 2048k fft sec/iter = 0.0363 4096k fft sec/iter = 0.0799 GTX480 cuda 3.0 2048k fft sec/iter = 0.00547 4096k fft sec/iter = 0.0104 Dual Socket hex-core 3.33GHz 2048k fft sec/iter = 0.00470 GTX480 cuda 3.1 2048k fft sec/iter = 0.00466 4096k fft sec/iter = 0.00937 Top of the line video card appears to be roughly 8times a single core (depending on which video card/cpu combination is compared). Or you could say, a single GTX480 is similar to the using the full cpu cycles of a dual processor hex core 3.33GHz xeon. I'm extremely positive this is only going to get better. -- Craig |
|
|
|
|
|
|
#6 |
|
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
3·23·89 Posts |
Do any of those figures for GPUs use any CPU cycles as well?
|
|
|
|
|
|
#7 |
|
Jul 2003
So Cal
2,663 Posts |
Very little. About 2%.
|
|
|
|
|
|
#9 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
I am slated to get a new laptop at work by year`s end ... it would be cool if it offered the possibility of doing some GPU code-dev work on the side. The 2 different GPUs on offer are
512 MB NVidia NVS 3100M 512MB NVidia Quadro FX 1800M Note that latter is only available in the "ultra high performance" notebook, which requires justification and manager approval. Here are the minimal requirements for me to spend any nontrivial time reading-docs/installing-shit/playing-with-code, along with some questions: 1. The software development environment (SWDE) needs to be somewhat portable, in the sense that I don't want to write a whole bunch of GPU-oriented code which then only works on one model or family of GPUs. Q: Does this mean OpenCL? 2. All systems run Windows v7 professional edition. Is OpenCL available here? If so, is it integrated with Visual Studio or a linux-emulation environment (e.g. Wine), what? 3. The SWDE must support double-precision floating-point arithmetic, even if the GPU hardware does not. If DP support is via emulation, I need to have reasonable assurance that the timings of code run this way at least correlate decently well with true GPU-hardware-double performance. 4. It is preferred that the GPU have DP support - how do the above 2 GPUs compare in this regard? And yes, I *know* "all this information is available" out there on the web somewhere, but I've found no convenient FAQ-style page on the nVidia website which answers more than one of the above questions, so I figured I'd ask the experts. I simply do not have time to read multiple hundred-page PDFs in order to try to glean answers to a few basic questions. Thanks in advance, -Ernst |
|
|
|
|
|
#10 |
|
Jul 2003
So Cal
51478 Posts |
According to the CUDA programmer's guide, Appendix A, these are both Compute Capability 1.2 devices, which per Appendix G means they do not support double precision. To address your questions...
1. Non-DP CUDA code will work on any current nVidia device. The goal of OpenCL code is that the same code can be compiled for multiple devices, but for now it usually will require some tweaking. 2. nVidia releases SDKs for both Windows (Visual Studio) and Linux. 3. All DP calculations are demoted to SP for Compute 1.2 devices and below. New to version 3 of the CUDA toolkit is the elimination of DP emulation in software. This is a good thing as it wasn't very reliable anyway and it did not give realistic timings. 4. As above, no and no. Neither card, nor any of their mobile cards for that matter, can be used to develop DP code. For (relatively) inexpensive DP code development, pick up a GTX 460 or GTX 465. The GTX 460 is faster for DP code and less expensive, but generates more heat. The GTX 465 is faster for SP code. |
|
|
|
|
|
#11 | |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| New PC dedicated to Mersenne Prime Search | Taiy | Hardware | 12 | 2018-01-02 15:54 |
| How would you design a CPU/GPU for prime number crunching? | emily | Hardware | 4 | 2012-02-20 18:46 |
| DSP hardware for number crunching? | ixfd64 | Hardware | 15 | 2011-08-09 01:11 |
| The prime-crunching on dedicated hardware FAQ | jasonp | Hardware | 142 | 2009-11-15 23:20 |
| Optimal Hardware for Dedicated Crunching Computer | Angular | Hardware | 5 | 2004-01-16 12:37 |