20190314, 02:33  #1 
"Eric"
Jan 2018
USA
2^{4}·13 Posts 
Cache and Memory Bandwidth Requirements for FFT
Hi there, after using numerous hardware configurations to run LL tests with huge FFT sizes, I am curious exactly how could the amount of data transfered between memory and cache be calculated related to the amount of FFT iterations you do per second for a certain size, and how much data transfer bandwidth exactly does FFT at say 4096K size would consume?

20190314, 02:46  #2 
∂^{2}ω=0
Sep 2002
República de California
10011000111111_{2} Posts 
A typical highperfoemance FFT implementation will do effectively 2 passes through memory per iteration. Each pass consists of effectively one full residuearray read and one write, and will access data in approximately cachesized (L1/L2) chunks, consisting of contiguous data in pass 1 and of "overlapping combs" of strips of widestrided data in pass 2. In addition to the residuearray data accesses, one also has reads of auxiliary data tables of FFT twiddles (complex roots of unity) and DWT weights, the bandwidth needed for those is very implementationdependent, not sure what a "typical" figure for memaccess of those in terms of percentage of maindataarray bandwidth is, since there are vrious tradeoffs between memory bandwidth and onthefly computation available here.
There will also be traffic between cache levels which is probably a decentsized multiple of the mainmemory traffic, reflecting th higher available bandwidth at the betweencache and cachetoCPU levels. But the mainmemory data accesses are usually the gating factor. It also depends on what you mean by "huge FFT sizes"  much larger than the current GIMPS wavefront and you'll be looking at a minimum of 3 read/write passes through memory. Last fiddled with by ewmayer on 20190314 at 02:52 
20190314, 02:53  #3 
P90 years forever!
Aug 2002
Yeehaw, FL
2×43×83 Posts 
A 4M FFT at 8bytes per FFT element is 32MB. As Ernst said, we must read and write that data twice. Thus 128MB of bandwidth is needed for each iteration. Prime95 uses about 5MB of readonly data each iteration. Grand total is 133MB bandwidth per iteration.

20190314, 03:05  #4 
"Eric"
Jan 2018
USA
2^{4}×13 Posts 
These are very interesting informations. I am always a bit suspicious about the memory requirement since I realized by only doing around 900 iters/sec is maxing out my memory controller on my Titan V, which has HBM overclocked to 1040MHz yielding a theoretical bandwidth of around 800GB/s, despite upping the core clock from around 1200MHz or so to 1770MHz the GPU still remained around the same speed. For CPU is seems like it is not as memory dependent as GPU is (in terms of System RAM, maybe because workload is offloaded to L13 cache?). I also ran a quick benchmark on 4096K FFT again and got around 1185iters/sec, which according to the 133MB/iter figure yielded me about 154GB/s, which seems far lower than the theoretical 800GB/s bandwidth provided by the HBM2 buffers. Is there any overhead on CUDA that causes the memory to be less efficiently utilized compared to x86 Assembly?

20190314, 03:58  #5  
P90 years forever!
Aug 2002
Yeehaw, FL
1101111100010_{2} Posts 
Quote:
CudaLucas uses nVidia's FFT library which is closed source. I expect it makes more than two passes over main memory. The reasons for this are threefold. 1) The CPU internal cache is much larger than a GPU's. This means it takes 3 or maybe 4 passes over memory to do the forward FFT and the same for the inverse FFT. 2) Prime95/Mlucas/GPUowl has special code to finish the forward FFT, pointwise squaring, and start the inverse FFT with just one full access to main memory (Cudalucas will require three). 3) Prime95/Mlucas/GPUowl has special code to finish the inverse FFT, do carry propagation, and start the next forward FFT with just one full access to main memory (Cudalucas will require three). Summary: Prime95 requires 133MB of bandwidth, CUDALucas likely requires 520MB or 650MB of bandwidth per iteration. 

20190314, 04:25  #6  
"Eric"
Jan 2018
USA
2^{4}·13 Posts 
Quote:
That explains a lot about the bottlenecking since I am seeing Pascal cards with terrible DP performance eating half the memory controller. If Nvidia's FFT library was open source, I would expect HUGE performance gains on GPUs that aren't bottlenecked by DP performance (Kepler series Titan, Volta and Pascal Series Quadro/Tesla P100). From what I have heard before, OpenCL inherently has a lot of overhead by itself, does that mean that under the same circumstances, AMD GPUs will perform a lot worse than Nvidia GPU or CPU simply due to the overhead? Last fiddled with by xx005fs on 20190314 at 04:25 

20190317, 10:59  #7  
"Mihai Preda"
Apr 2015
1,291 Posts 
Quote:


20190317, 17:09  #8  
"Eric"
Jan 2018
USA
208_{10} Posts 
Quote:
Also, I am noticing that overclocking the HBM on Vega still benefits a lot to the speed even though the DP performance is quite weak (below 1tflop), does that mean GPUOWL also uses significantly more bandwidth than Prime95 and CUDALucas 

20190317, 17:43  #9  
"Mihai Preda"
Apr 2015
1,291 Posts 
Quote:
Examples of relatively good implementations of OpenCL are in LLVM (ROCm, amdgpupro). Examples of poor OpenCL implementations are coming from Nvidia  that is where OpenCL is slower than CUDA and has many more bugs than CUDA. This, mind you, is because of Nvidia, not because of OpenCL. (also, it is tricky to write good OpenCL code, just the same as it's tricky to write good CUDA code.) Quote:


20190317, 20:44  #10  
"Eric"
Jan 2018
USA
2^{4}×13 Posts 
Quote:
Thank you! That makes a lot of sense. Is there a specific amount of memory transfer needed for each individual iteration for GPUOWL's FFT code for a certain FFT size? 

20190319, 10:26  #11  
"Mihai Preda"
Apr 2015
1,291 Posts 
Quote:
E.g. for FFT 4608K, the buffer size is 4608*8 MB (because one DP (double) has 8 bytes), about 37MB. So about 260MB R+W per iteration. For the smaller FFT 4M, only about 5x R+W of bits buffs is needed, 160MB per iteration. To explain, about 4x (or 6x) are actual reads/writes of the bits, and 1x covers the remaining reads of constant buffers: weights, and various trig "wiggles". Last fiddled with by preda on 20190319 at 10:31 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Memory Bandwidth  Fred  Hardware  12  20160201 18:29 
High Bandwidth Memory  tha  GPU Computing  4  20150731 00:21 
configuration for max memory bandwidth  smartypants  Hardware  11  20150726 09:16 
P1 memory bandwidth  TheMawn  Hardware  1  20130615 23:15 
Parallel memory bandwidth  fivemack  Factoring  14  20080611 20:43 