![]() |
CUDA integer FFTs
I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,
Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput? The latest source is [url="https://gforge.inria.fr/scm/viewvc.php/ecm/branches/nttdisk/libntt/"]here[/url] and includes a lot of stuff already, including optimizations for a range of x86 instruction sets. Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know. |
[QUOTE=jasonp;482945]Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? [/quote]
Yes [QUOTE=jasonp;482945]Would there be interest in another LL tester for CUDA?[/QUOTE] Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas. EDIT:- You should look at implementing Gerbicz Error Check PRP version rather than LL. It is all the rage around here :-) |
1 Attachment(s)
[QUOTE=jasonp;482945]I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,
Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput? The latest source is [URL="https://gforge.inria.fr/scm/viewvc.php/ecm/branches/nttdisk/libntt/"]here[/URL] and includes a lot of stuff already, including optimizations for a range of x86 instruction sets. Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know.[/QUOTE] I have several CC 2.x gpus. Please don't go out of your way to make it incompatible with them. On the other hand, don't let compatibility with them get in the way of performance on newer designs. If you come up with something that improves throughput, there will be interest. Are you acquainted with Preda's efforts on OpenCl, with multiple transform types? [URL]http://www.mersenneforum.org/showpost.php?p=471318&postcount=224[/URL] Memory occupancy and memory controller load iare very different depending on what type calculation is being performed. Left half of the attachment is an AMD RX550 running mfakto trial factoring; right half is PRP 3 test in gpuowl on a 77M exponent on another RX550. I don't think I've ever seen an AMD or NVIDIA gpu's memory controller maxed out, whether TF, P-1, LL, or PRP3 crunching. (P-1 not available in OpenCl.) (Maybe while moving data from gpu to cpu and back in a P-1 gcd handoff.) Some of the existing software could use some maintenance & enhancement. |
[QUOTE=axn;482947]Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas.
[/QUOTE] Not necessarily to be faster. Our cards slow down to about half the speed if we set them "SP-only" (default), compared to "enable DP" (which can be done with nvidia control panel**). Moreover, there are lots of gaming cards there outside with no use for LL test due to "almost SP-only" design. Some "SP-only" or "integer" LL test comparable speed-wise with cudaLucas, or in the same ballpark (say at least 80%-90% as fast as cudaLucas, but without DP) would be, if not a "revolution", at least very useful. -------- ** in this mode the cards spit fire, but yet, using mfaktc on them spits even more fire, and the output is "viceversa", i.e. doubled when "SP-only", from which we can see that cudaLucas still have a lot of margin to improve, on the "more-SP, less-DP" side... |
I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.
|
[QUOTE=airsquirrels;482961]I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.[/QUOTE]
Have you looked at the very recent [URL="https://www.anandtech.com/show/12509/xilinx-announces-project-everest-fpga-soc-hybrid"]ACAP announcement from Xilinx[/URL] mentioned in the Science News thread, and would it be more promising than FPGA for this purpose? The [URL="https://www.xilinx.com/news/press/2018/xilinx-unveils-revolutionary-adaptable-computing-product-category.html"]press release[/URL] claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching. |
Deep learning applications mean half-precision floating point, fixed point, or support vector machines, all of which map very nicely to programmable logic.
George asked back in 2012 about the feasibility of NTT arithmetic to get around the lack of DP throughput on GPUs, and I figured at the time that there was no point because the design and implementation effort would be rendered obsolete when competition in the GPU space forced the major players to support double precision at higher rates. Six years later the joke is on me. |
[QUOTE=GP2;482980]Have you looked at the very recent [URL="https://www.anandtech.com/show/12509/xilinx-announces-project-everest-fpga-soc-hybrid"]ACAP announcement from Xilinx[/URL] mentioned in the Science News thread, and would it be more promising than FPGA for this purpose?
The [URL="https://www.xilinx.com/news/press/2018/xilinx-unveils-revolutionary-adaptable-computing-product-category.html"]press release[/URL] claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching.[/QUOTE] I find those very interesting, but the marketing hype is too abstract to fully grok capabilities. I’ll see what my Xilinx rep can do but I’ll be surprised if I can get access to a sample any time soon. |
All this "low-precision deep learning" stuff is in the opposite direction of what number crunching needs. But I'm not at all surprised at the direction the industry is going.
Double precision (high precision in general) requires too much area and too much power because of the implicit O(N^2) cost of supporting an N-bit multiplier. (and we're too small to benefit from the sub-quadratic algorithms) A lot of applications don't need such dense hardware anyway. (A number is a number, once you have enough precision, the rest is a waste.) So in an effort to increase throughput and efficiency, they (the hardware manufacturers) are trying to push everyone away from an "inefficient" use of DP and high-precision towards more efficient and specialized hardware. But by doing this, they're screwing over all the applications that legitimately need as much precision as possible. For example: large multiplication by FFT which benefits superlinearly with increased precision of the native type. When you look at this from a broader picture, much of the scientific computing space that wants this dense hardware is the same space that's being hit the hardest by the memory bottleneck. So it almost makes sense for the industry to backpedal on DP/dense-hardware until they fix the memory problem - which I doubt will happen anytime soon. |
The memory bandwidth problem is very real. We’re at the upper limits of reasonable caches and more parallelism has nowhere to drop the data unless it’s going to be used by the same core and thus cache oriented.
In thinking about building the FFTW (fastest in the West) LL/large FFT convolution hardware I wondered if instead of dealing with memory you could just have two or more hardware chips with a super wide IO bus playing catch between iterations. You can get actual (vs. theoretical) 100s of GBs/second that way. An 8K FFT needs what, 130MB of throughput per iteration? 1500 Iterations/second @ 200 GB/s bandwidth. Could do a 100M exponent in less than a day. Ultimately we just need an isPrime circuit. 32 bit exponent input wires and gives answer in 1 clock cycle. Or, just pipe those 32 bits, or the bits of the whole number into a neural net and train it to recognize primes. Surely someone has at least experimented with that. |
[QUOTE=airsquirrels;483024] An 8K FFT needs what, 130MB of throughput per iteration? [/QUOTE]
A 4M FFT needs ~135 MB of bandwidth per iteration. A 4M FFT is 32MB of data requiring two passes over the data. Thus, 32MB read + 32MB write + 32MB read + 32MB write + somewhere between 5MB and 10MB of read-only sin/cos/weights data. |
| All times are UTC. The time now is 14:50. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.