mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Hardware (https://www.mersenneforum.org/forumdisplay.php?f=9)
-   -   The prime-crunching on dedicated hardware FAQ (II) (https://www.mersenneforum.org/showthread.php?t=12720)

 jasonp 2009-11-15 23:16

The prime-crunching on dedicated hardware FAQ (II)

[B]Q: What is this?[/B]

[URL="http://mersenneforum.org/showthread.php?t=10275"]The original [/URL]prime-crunching-on-dedicated-hardware-FAQ was written in the middle of 2008, and the state of the art in high-performance graphics cards appears to be advancing at a rate that is exceeding Moore's law. At the same time, the libraries for running non-graphics code on one of those things have advanced to the point where there's a fairly large community of programmers involved, both working and playing with Nvidia cards (see [URL="http://forums.nvidia.com"]here[/URL]). Hell, [URL="http://mersenneforum.org/showthread.php?t=12562"]even I'm doing it[/URL]. So we're going to need a few modifications to the statements made in the original FAQ about where things are going.

[B]Q: So I can look for prime numbers on a GPU now?[/B]

Indeed you can. There is [URL="http://mersenneforum.org/showthread.php?t=12576"]an active development effort underway[/URL] to modify one of the standard Lucas-Lehmer test tools to use the FFT code made available by Nvidia in their cufft library. If you have a latter-day card, i.e. [URL="http://www.nvidia.com/object/product_geforce_gtx_260_us.html"]a GTX 260[/URL] or better, then you can do double-precision floating point arithmetic in hardware at a rate of 1/8 what the card can do in single precision. Even that card has so much floating point firepower that it can manage respectable performance despite the handicap.

[B]Q: So how fast does it go?[/B]

It's a work in progress, but with a top-of-the-line card the current speed seems to be around what one core of a high-end PC can achieve.

[B]Q: That result is not very exciting. What about the next generation of high-end hardware from Nvidia?[/B]

The next generation of GPU from Nvidia promises much better double-precision performance (whitepaper [URL="http://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIAFermiComputeArchitectureWhitepaper.pdf"]here[/URL]). Fermi will be quite interesting from another viewpoint: 32-bit integer multiplication looks like it will be a very fast operation on that architecture, which makes integer FFTs with respectable performance a possibility.

[B]Q: Does this mean you'll stop being a naysayer on this subject?[/B]

If you read the first few followup posts to the original FAQ, and read into my tone in this one, I'm somewhat skeptical that the overall progress of any prime-finding project stands to benefit from a porting of the computational software to a graphics card. Much of the interest in doing so stems from three observations:

- other projects benefit greatly from it, far out of proportion to the number of GPUs that they use

- Most otherwise-idle PCs will also have an otherwise-idle graphics card, so using it would amount to essentially a 'free lunch'

- if a super-powered-by-GPU version of the code existed, buying a super-powered card would make [I]your individual contribution[/I] more valuable

In the case of projects like GIMPS, what 'other projects do' is immaterial. It's not that other projects have smart programmers but we don't, it's that their hardware needs are different. Further, GPU code is a free lunch as long as resources are not diverted away from the mainstream project to tap into those resources. As long as somebody volunteers to do so, there's no harm in trying. But in all the years the crunch-on-special-hardware argument has raged, only in the last few months have GPU programming environments stabilized to the point where someone actually stepped forward to do so.

As to your individual contribution, unless you have a big cluster of your own (thousands of machines) to play with, no amount of dedicated hardware is going to change the fact that 1000 strangers running Prime95 in the background are going to contribute more than you can ever hope to. It's not distributed computing if that was otherwise.

So, long story short, I'm still a buzzkill on this subject.

 msft 2009-11-16 02:41

Hi, jasonp

Thank you everything,

 lfm 2009-11-16 03:44

[QUOTE=jasonp;195982]
[b]Q: So I can look for prime numbers on a GPU now?[/b]

Indeed you can. There is [url="http://mersenneforum.org/showthread.php?t=12576"]an active development effort underway[/url] to modify one of the standard Lucas-Lehmer test tools to use the FFT code made available by Nvidia in their cufft library. If you have a latter-day card, i.e. [url="http://www.nvidia.com/object/product_geforce_gtx_260_us.html"]a GTX 260[/url] or better, then you can do double-precision floating point arithmetic in hardware at a rate of 1/8 what the card can do in single precision. Even that card has so much floating point firepower that it can manage respectable performance despite the handicap.
[/QUOTE]

Note that the GTX 260M and 280M (for laptops mostly, the M is important) does not have double precision and is NOT supported.

 __HRB__ 2009-11-18 05:35

[QUOTE=xkey;196257]I'd really like to see a few 8192 (or bigger) bit registers in the upcoming incarnations from Intel/Amd/IBM. I know Intel is slowly headed there with AVX, but not fast enough for some problems I need solved in a quad or octo chip box.[/QUOTE]

What kind of problems are those?

I think large registers with complex instructions are a mistake. Top level algorithms can only be chunked into power-of-two sub-problems until you hit register size. Need 17-bit integers? Waste ~50%. Need floats with 12-bit exponent and 12-bit mantissa? Waste ~50%.

Therefore, I'd rather have something really simple, say a 64x64 bit-matrix (with instructions to read/write from rows or columns), an accumulator and maybe two other registers like on the 6502 (now, that was fun programming!), a 4096-bit instruction cache and 2 cycle 8-bit instructions (Load/Store + 8 logical + 8 permute instructions + control flow), so that neighboring units can conflict-free peek/poke each other one clock out of sync.

Then put 16K of these on a single chip, with a couple of F21s sitting on the edges for I/O control.

 nucleon 2010-05-29 01:12

[QUOTE=jasonp;195982]
[b]Q: So how fast does it go?[/b]

It's a work in progress, but with a top-of-the-line card the current speed seems to be around what one core of a high-end PC can achieve.

[/QUOTE]

Umm this needs to be updated.

Taking some figures on the forum (and my own machine - core i7 930), I get the following measurements:

PS3
2048k fft sec/iter = 0.084
4096k fft sec/iter = 0.194

GTX260
2048k fft sec/iter = 0.0106
4096k fft sec/iter = 0.0218

core i7 930 (single core)
2048k fft sec/iter = 0.0363
4096k fft sec/iter = 0.0799

GTX480 cuda 3.0
2048k fft sec/iter = 0.00547
4096k fft sec/iter = 0.0104

Dual Socket hex-core 3.33GHz
2048k fft sec/iter = 0.00470

GTX480 cuda 3.1
2048k fft sec/iter = 0.00466
4096k fft sec/iter = 0.00937

Top of the line video card appears to be roughly 8times a single core (depending on which video card/cpu combination is compared).

Or you could say, a single GTX480 is similar to the using the full cpu cycles of a dual processor hex core 3.33GHz xeon.

I'm extremely positive this is only going to get better.

-- Craig

 henryzz 2010-06-01 19:13

Do any of those figures for GPUs use any CPU cycles as well?

 frmky 2010-06-01 22:58

 Commaster 2010-06-10 02:29

Speaking of GPU crunching... Anybody forgot AMD?
According to [URL="http://www.geeks3d.com/public/jegx/200910/amd_opencl_supported_devices.jpg"]this[/URL], new cards do support the required DP-FP.

 ewmayer 2010-08-09 20:15

I am slated to get a new laptop at work by year`s end ... it would be cool if it offered the possibility of doing some GPU code-dev work on the side. The 2 different GPUs on offer are

512 MB NVidia NVS 3100M

Note that latter is only available in the "ultra high performance" notebook, which requires justification and manager approval.

Here are the minimal requirements for me to spend any nontrivial time reading-docs/installing-shit/playing-with-code, along with some questions:

1. The software development environment (SWDE) needs to be somewhat portable, in the sense that I don't want to write a whole bunch of GPU-oriented code which then only works on one model or family of GPUs.

[b]Q: Does this mean OpenCL?[/b]

2. All systems run Windows v7 professional edition. Is OpenCL available here? If so, is it integrated with Visual Studio or a linux-emulation environment (e.g. Wine), what?

3. The SWDE must support double-precision floating-point arithmetic, even if the GPU hardware does not. If DP support is via emulation, I need to have reasonable assurance that the timings of code run this way at least correlate decently well with true GPU-hardware-double performance.

4. It is preferred that the GPU have DP support - how do the above 2 GPUs compare in this regard?

And yes, I *know* "all this information is available" out there on the web somewhere, but I've found no convenient FAQ-style page on the nVidia website which answers more than one of the above questions, so I figured I'd ask the experts. I simply do not have time to read multiple hundred-page PDFs in order to try to glean answers to a few basic questions.

-Ernst

 frmky 2010-08-09 22:27

[QUOTE=ewmayer;224634]
512 MB NVidia NVS 3100M
[/QUOTE]

According to the [URL="http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf"]CUDA programmer's guide[/URL], Appendix A, these are both Compute Capability 1.2 devices, which per Appendix G means they do not support double precision. To address your questions...

1. Non-DP CUDA code will work on any current nVidia device. The goal of OpenCL code is that the same code can be compiled for multiple devices, but for now it usually will require some tweaking.

2. nVidia releases SDKs for both Windows (Visual Studio) and Linux.

3. All DP calculations are demoted to SP for Compute 1.2 devices and below. New to version 3 of the CUDA toolkit is the elimination of DP emulation in software. This is a good thing as it wasn't very reliable anyway and it did not give realistic timings.

4. As above, no and no. Neither card, nor any of their mobile cards for that matter, can be used to develop DP code. For (relatively) inexpensive DP code development, pick up a GTX 460 or GTX 465. The GTX 460 is faster for DP code and less expensive, but generates more heat. The GTX 465 is faster for SP code.

 ewmayer 2010-08-09 23:09

[QUOTE=frmky;224667]4. As above, no and no. Neither card, nor any of their mobile cards for that matter, can be used to develop DP code. For (relatively) inexpensive DP code development, pick up a GTX 460 or GTX 465. The GTX 460 is faster for DP code and less expensive, but generates more heat. The GTX 465 is faster for SP code.[/QUOTE]

Many thanks, frmky. I've been told that one needs a full-sized case system with sufficient-wattage power supply to run one of these high-end cards, but I wonder: can one also get one in a standalone external format? I have 2 laptops at home: A 3-year-old Thinkpad running XP with VS2005 installed and a 1-year-old MacBook running Linux 4.2 ... I would be happy to get one of the above cards in an external add-on format, but really don't fancy the idea of buying a full-sized PC system anymore ... trying to keep the amount of compute hardware in my home restricted to a small footprint.

All times are UTC. The time now is 15:28.