![]() |
|
|
#1 |
|
"Eric"
Jan 2018
USA
223 Posts |
Hi there, after using numerous hardware configurations to run LL tests with huge FFT sizes, I am curious exactly how could the amount of data transfered between memory and cache be calculated related to the amount of FFT iterations you do per second for a certain size, and how much data transfer bandwidth exactly does FFT at say 4096K size would consume?
|
|
|
|
|
|
#2 |
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
A typical high-perfoemance FFT implementation will do effectively 2 passes through memory per iteration. Each pass consists of effectively one full residue-array read and one write, and will access data in approximately cache-sized (L1/L2) chunks, consisting of contiguous data in pass 1 and of "overlapping combs" of strips of wide-strided data in pass 2. In addition to the residue-array data accesses, one also has reads of auxiliary data tables of FFT twiddles (complex roots of unity) and DWT weights, the bandwidth needed for those is very implementation-dependent, not sure what a "typical" figure for mem-access of those in terms of percentage of main-data-array bandwidth is, since there are vrious tradeoffs between memory bandwidth and on-the-fly computation available here.
There will also be traffic between cache levels which is probably a decent-sized multiple of the main-memory traffic, reflecting th higher available bandwidth at the between-cache and cache-to-CPU levels. But the main-memory data accesses are usually the gating factor. It also depends on what you mean by "huge FFT sizes" - much larger than the current GIMPS wavefront and you'll be looking at a minimum of 3 read/write passes through memory. Last fiddled with by ewmayer on 2019-03-14 at 02:52 |
|
|
|
|
|
#3 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
100000010101112 Posts |
A 4M FFT at 8-bytes per FFT element is 32MB. As Ernst said, we must read and write that data twice. Thus 128MB of bandwidth is needed for each iteration. Prime95 uses about 5MB of read-only data each iteration. Grand total is 133MB bandwidth per iteration.
|
|
|
|
|
|
#4 |
|
"Eric"
Jan 2018
USA
223 Posts |
These are very interesting informations. I am always a bit suspicious about the memory requirement since I realized by only doing around 900 iters/sec is maxing out my memory controller on my Titan V, which has HBM overclocked to 1040MHz yielding a theoretical bandwidth of around 800GB/s, despite upping the core clock from around 1200MHz or so to 1770MHz the GPU still remained around the same speed. For CPU is seems like it is not as memory dependent as GPU is (in terms of System RAM, maybe because workload is offloaded to L1-3 cache?). I also ran a quick benchmark on 4096K FFT again and got around 1185iters/sec, which according to the 133MB/iter figure yielded me about 154GB/s, which seems far lower than the theoretical 800GB/s bandwidth provided by the HBM2 buffers. Is there any overhead on CUDA that causes the memory to be less efficiently utilized compared to x86 Assembly?
|
|
|
|
|
|
#5 | |
|
P90 years forever!
Aug 2002
Yeehaw, FL
205716 Posts |
Quote:
CudaLucas uses nVidia's FFT library which is closed source. I expect it makes more than two passes over main memory. The reasons for this are three-fold. 1) The CPU internal cache is much larger than a GPU's. This means it takes 3 or maybe 4 passes over memory to do the forward FFT and the same for the inverse FFT. 2) Prime95/Mlucas/GPUowl has special code to finish the forward FFT, point-wise squaring, and start the inverse FFT with just one full access to main memory (Cudalucas will require three). 3) Prime95/Mlucas/GPUowl has special code to finish the inverse FFT, do carry propagation, and start the next forward FFT with just one full access to main memory (Cudalucas will require three). Summary: Prime95 requires 133MB of bandwidth, CUDALucas likely requires 520MB or 650MB of bandwidth per iteration. |
|
|
|
|
|
|
#6 | |
|
"Eric"
Jan 2018
USA
223 Posts |
Quote:
That explains a lot about the bottlenecking since I am seeing Pascal cards with terrible DP performance eating half the memory controller. If Nvidia's FFT library was open source, I would expect HUGE performance gains on GPUs that aren't bottlenecked by DP performance (Kepler series Titan, Volta and Pascal Series Quadro/Tesla P100). From what I have heard before, OpenCL inherently has a lot of overhead by itself, does that mean that under the same circumstances, AMD GPUs will perform a lot worse than Nvidia GPU or CPU simply due to the overhead? Last fiddled with by xx005fs on 2019-03-14 at 04:25 |
|
|
|
|
|
|
#7 | |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
Quote:
|
|
|
|
|
|
|
#8 | |
|
"Eric"
Jan 2018
USA
223 Posts |
Quote:
Also, I am noticing that overclocking the HBM on Vega still benefits a lot to the speed even though the DP performance is quite weak (below 1tflop), does that mean GPUOWL also uses significantly more bandwidth than Prime95 and CUDALucas |
|
|
|
|
|
|
#9 | ||
|
"Mihai Preda"
Apr 2015
101101011002 Posts |
Quote:
Examples of relatively good implementations of OpenCL are in LLVM (ROCm, amdgpu-pro). Examples of poor OpenCL implementations are coming from Nvidia -- that is where OpenCL is slower than CUDA and has many more bugs than CUDA. This, mind you, is because of Nvidia, not because of OpenCL. (also, it is tricky to write good OpenCL code, just the same as it's tricky to write good CUDA code.) Quote:
|
||
|
|
|
|
|
#10 | |
|
"Eric"
Jan 2018
USA
22310 Posts |
Quote:
Thank you! That makes a lot of sense. Is there a specific amount of memory transfer needed for each individual iteration for GPUOWL's FFT code for a certain FFT size? |
|
|
|
|
|
|
#11 | |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
Quote:
E.g. for FFT 4608K, the buffer size is 4608*8 MB (because one DP (double) has 8 bytes), about 37MB. So about 260MB R+W per iteration. For the smaller FFT 4M, only about 5x R+W of bits buffs is needed, 160MB per iteration. To explain, about 4x (or 6x) are actual reads/writes of the bits, and 1x covers the remaining reads of constant buffers: weights, and various trig "wiggles". Last fiddled with by preda on 2019-03-19 at 10:31 |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Memory Bandwidth | Fred | Hardware | 12 | 2016-02-01 18:29 |
| High Bandwidth Memory | tha | GPU Computing | 4 | 2015-07-31 00:21 |
| configuration for max memory bandwidth | smartypants | Hardware | 11 | 2015-07-26 09:16 |
| P-1 memory bandwidth | TheMawn | Hardware | 1 | 2013-06-15 23:15 |
| Parallel memory bandwidth | fivemack | Factoring | 14 | 2008-06-11 20:43 |