![]() |
|
|
#1 |
|
Tribal Bullet
Oct 2004
5·23·31 Posts |
I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,
Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput? The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets. Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know. |
|
|
|
|
|
#2 | |
|
Jun 2003
23·683 Posts |
Quote:
Can't speak for others, but, yes, assuming it is significantly faster (2x?) than cudaLucas. EDIT:- You should look at implementing Gerbicz Error Check PRP version rather than LL. It is all the rage around here :-) Last fiddled with by axn on 2018-03-21 at 03:08 |
|
|
|
|
|
|
#3 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
24×3×163 Posts |
Quote:
If you come up with something that improves throughput, there will be interest. Are you acquainted with Preda's efforts on OpenCl, with multiple transform types? http://www.mersenneforum.org/showpos...&postcount=224 Memory occupancy and memory controller load iare very different depending on what type calculation is being performed. Left half of the attachment is an AMD RX550 running mfakto trial factoring; right half is PRP 3 test in gpuowl on a 77M exponent on another RX550. I don't think I've ever seen an AMD or NVIDIA gpu's memory controller maxed out, whether TF, P-1, LL, or PRP3 crunching. (P-1 not available in OpenCl.) (Maybe while moving data from gpu to cpu and back in a P-1 gcd handoff.) Some of the existing software could use some maintenance & enhancement. Last fiddled with by kriesel on 2018-03-21 at 03:38 |
|
|
|
|
|
|
#4 | |
|
Romulan Interpreter
"name field"
Jun 2011
Thailand
41×251 Posts |
Quote:
-------- ** in this mode the cards spit fire, but yet, using mfaktc on them spits even more fire, and the output is "viceversa", i.e. doubled when "SP-only", from which we can see that cudaLucas still have a lot of margin to improve, on the "more-SP, less-DP" side... Last fiddled with by LaurV on 2018-03-21 at 07:09 |
|
|
|
|
|
|
#5 |
|
"David"
Jul 2015
Ohio
51710 Posts |
I’d be interested in a good integer convolution, if partially because it would be easier to reference/ port into HDL on the FPGAs I’ve been playing with as of late.
|
|
|
|
|
|
#6 | |
|
Sep 2003
2×5×7×37 Posts |
Quote:
The press release claimed a 20× improvement for deep learning applications, not sure if there's any applicable improvement for number crunching. |
|
|
|
|
|
|
#7 |
|
Tribal Bullet
Oct 2004
67558 Posts |
Deep learning applications mean half-precision floating point, fixed point, or support vector machines, all of which map very nicely to programmable logic.
George asked back in 2012 about the feasibility of NTT arithmetic to get around the lack of DP throughput on GPUs, and I figured at the time that there was no point because the design and implementation effort would be rendered obsolete when competition in the GPU space forced the major players to support double precision at higher rates. Six years later the joke is on me. |
|
|
|
|
|
#8 | |
|
"David"
Jul 2015
Ohio
11×47 Posts |
Quote:
I find those very interesting, but the marketing hype is too abstract to fully grok capabilities. I’ll see what my Xilinx rep can do but I’ll be surprised if I can get access to a sample any time soon. |
|
|
|
|
|
|
#9 |
|
Sep 2016
22·5·19 Posts |
All this "low-precision deep learning" stuff is in the opposite direction of what number crunching needs. But I'm not at all surprised at the direction the industry is going.
Double precision (high precision in general) requires too much area and too much power because of the implicit O(N^2) cost of supporting an N-bit multiplier. (and we're too small to benefit from the sub-quadratic algorithms) A lot of applications don't need such dense hardware anyway. (A number is a number, once you have enough precision, the rest is a waste.) So in an effort to increase throughput and efficiency, they (the hardware manufacturers) are trying to push everyone away from an "inefficient" use of DP and high-precision towards more efficient and specialized hardware. But by doing this, they're screwing over all the applications that legitimately need as much precision as possible. For example: large multiplication by FFT which benefits superlinearly with increased precision of the native type. When you look at this from a broader picture, much of the scientific computing space that wants this dense hardware is the same space that's being hit the hardest by the memory bottleneck. So it almost makes sense for the industry to backpedal on DP/dense-hardware until they fix the memory problem - which I doubt will happen anytime soon. |
|
|
|
|
|
#10 |
|
"David"
Jul 2015
Ohio
20516 Posts |
The memory bandwidth problem is very real. We’re at the upper limits of reasonable caches and more parallelism has nowhere to drop the data unless it’s going to be used by the same core and thus cache oriented.
In thinking about building the FFTW (fastest in the West) LL/large FFT convolution hardware I wondered if instead of dealing with memory you could just have two or more hardware chips with a super wide IO bus playing catch between iterations. You can get actual (vs. theoretical) 100s of GBs/second that way. An 8K FFT needs what, 130MB of throughput per iteration? 1500 Iterations/second @ 200 GB/s bandwidth. Could do a 100M exponent in less than a day. Ultimately we just need an isPrime circuit. 32 bit exponent input wires and gives answer in 1 clock cycle. Or, just pipe those 32 bits, or the bits of the whole number into a neural net and train it to recognize primes. Surely someone has at least experimented with that. |
|
|
|
|
|
#11 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
100000010101112 Posts |
A 4M FFT needs ~135 MB of bandwidth per iteration. A 4M FFT is 32MB of data requiring two passes over the data. Thus, 32MB read + 32MB write + 32MB read + 32MB write + somewhere between 5MB and 10MB of read-only sin/cos/weights data.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| 128 bit integer division in CUDA? | cseizert | GPU Computing | 8 | 2016-11-27 15:41 |
| Non-power-of-two FFTs | jasonp | Computer Science & Computational Number Theory | 15 | 2014-06-10 14:49 |
| P95 PrimeNet causes BSOD; small FFTs, large FFTs, and blend test don't | KarateF22 | PrimeNet | 16 | 2013-10-28 00:34 |
| In Place Large FFTs Failure | nwb | Information & Answers | 2 | 2011-07-08 16:04 |
| gmp-ecm and FFTs | dave_dm | Factoring | 9 | 2004-09-04 11:47 |