mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2019-03-14, 02:33   #1
xx005fs
 
"Eric"
Jan 2018
USA

24·13 Posts
Default Cache and Memory Bandwidth Requirements for FFT

Hi there, after using numerous hardware configurations to run LL tests with huge FFT sizes, I am curious exactly how could the amount of data transfered between memory and cache be calculated related to the amount of FFT iterations you do per second for a certain size, and how much data transfer bandwidth exactly does FFT at say 4096K size would consume?
xx005fs is offline   Reply With Quote
Old 2019-03-14, 02:46   #2
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

100110001111112 Posts
Default

A typical high-perfoemance FFT implementation will do effectively 2 passes through memory per iteration. Each pass consists of effectively one full residue-array read and one write, and will access data in approximately cache-sized (L1/L2) chunks, consisting of contiguous data in pass 1 and of "overlapping combs" of strips of wide-strided data in pass 2. In addition to the residue-array data accesses, one also has reads of auxiliary data tables of FFT twiddles (complex roots of unity) and DWT weights, the bandwidth needed for those is very implementation-dependent, not sure what a "typical" figure for mem-access of those in terms of percentage of main-data-array bandwidth is, since there are vrious tradeoffs between memory bandwidth and on-the-fly computation available here.

There will also be traffic between cache levels which is probably a decent-sized multiple of the main-memory traffic, reflecting th higher available bandwidth at the between-cache and cache-to-CPU levels. But the main-memory data accesses are usually the gating factor.

It also depends on what you mean by "huge FFT sizes" - much larger than the current GIMPS wavefront and you'll be looking at a minimum of 3 read/write passes through memory.

Last fiddled with by ewmayer on 2019-03-14 at 02:52
ewmayer is offline   Reply With Quote
Old 2019-03-14, 02:53   #3
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×43×83 Posts
Default

A 4M FFT at 8-bytes per FFT element is 32MB. As Ernst said, we must read and write that data twice. Thus 128MB of bandwidth is needed for each iteration. Prime95 uses about 5MB of read-only data each iteration. Grand total is 133MB bandwidth per iteration.
Prime95 is offline   Reply With Quote
Old 2019-03-14, 03:05   #4
xx005fs
 
"Eric"
Jan 2018
USA

24×13 Posts
Default

These are very interesting informations. I am always a bit suspicious about the memory requirement since I realized by only doing around 900 iters/sec is maxing out my memory controller on my Titan V, which has HBM overclocked to 1040MHz yielding a theoretical bandwidth of around 800GB/s, despite upping the core clock from around 1200MHz or so to 1770MHz the GPU still remained around the same speed. For CPU is seems like it is not as memory dependent as GPU is (in terms of System RAM, maybe because workload is offloaded to L1-3 cache?). I also ran a quick benchmark on 4096K FFT again and got around 1185iters/sec, which according to the 133MB/iter figure yielded me about 154GB/s, which seems far lower than the theoretical 800GB/s bandwidth provided by the HBM2 buffers. Is there any overhead on CUDA that causes the memory to be less efficiently utilized compared to x86 Assembly?
xx005fs is offline   Reply With Quote
Old 2019-03-14, 03:58   #5
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

11011111000102 Posts
Default

Quote:
Originally Posted by xx005fs View Post
Is there any overhead on CUDA that causes the memory to be less efficiently utilized compared to x86 Assembly?
Yes.

CudaLucas uses nVidia's FFT library which is closed source. I expect it makes more than two passes over main memory. The reasons for this are three-fold. 1) The CPU internal cache is much larger than a GPU's. This means it takes 3 or maybe 4 passes over memory to do the forward FFT and the same for the inverse FFT. 2) Prime95/Mlucas/GPUowl has special code to finish the forward FFT, point-wise squaring, and start the inverse FFT with just one full access to main memory (Cudalucas will require three). 3) Prime95/Mlucas/GPUowl has special code to finish the inverse FFT, do carry propagation, and start the next forward FFT with just one full access to main memory (Cudalucas will require three).

Summary: Prime95 requires 133MB of bandwidth, CUDALucas likely requires 520MB or 650MB of bandwidth per iteration.
Prime95 is offline   Reply With Quote
Old 2019-03-14, 04:25   #6
xx005fs
 
"Eric"
Jan 2018
USA

24·13 Posts
Default

Quote:
Originally Posted by Prime95 View Post
CudaLucas uses nVidia's FFT library which is closed source. I expect it makes more than two passes over main memory. The reasons for this are three-fold. 1) The CPU internal cache is much larger than a GPU's. This means it takes 3 or maybe 4 passes over memory to do the forward FFT and the same for the inverse FFT. 2) Prime95/Mlucas/GPUowl has special code to finish the forward FFT, point-wise squaring, and start the inverse FFT with just one full access to main memory (Cudalucas will require three). 3) Prime95/Mlucas/GPUowl has special code to finish the inverse FFT, do carry propagation, and start the next forward FFT with just one full access to main memory (Cudalucas will require three).

Summary: Prime95 requires 133MB of bandwidth, CUDALucas likely requires 520MB or 650MB of bandwidth per iteration.

That explains a lot about the bottlenecking since I am seeing Pascal cards with terrible DP performance eating half the memory controller. If Nvidia's FFT library was open source, I would expect HUGE performance gains on GPUs that aren't bottlenecked by DP performance (Kepler series Titan, Volta and Pascal Series Quadro/Tesla P100). From what I have heard before, OpenCL inherently has a lot of overhead by itself, does that mean that under the same circumstances, AMD GPUs will perform a lot worse than Nvidia GPU or CPU simply due to the overhead?

Last fiddled with by xx005fs on 2019-03-14 at 04:25
xx005fs is offline   Reply With Quote
Old 2019-03-17, 10:59   #7
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

1,291 Posts
Default

Quote:
Originally Posted by xx005fs View Post
That explains a lot about the bottlenecking since I am seeing Pascal cards with terrible DP performance eating half the memory controller. If Nvidia's FFT library was open source, I would expect HUGE performance gains on GPUs that aren't bottlenecked by DP performance (Kepler series Titan, Volta and Pascal Series Quadro/Tesla P100). From what I have heard before, OpenCL inherently has a lot of overhead by itself, does that mean that under the same circumstances, AMD GPUs will perform a lot worse than Nvidia GPU or CPU simply due to the overhead?
OpenCL has no overhead whatsoever by itself. It allows to use the hardware close to its limits.
preda is offline   Reply With Quote
Old 2019-03-17, 17:09   #8
xx005fs
 
"Eric"
Jan 2018
USA

20810 Posts
Default

Quote:
Originally Posted by preda View Post
OpenCL has no overhead whatsoever by itself. It allows to use the hardware close to its limits.
Interesting. Does that mean that all the other apps that uses OpenCL are just poorly coded that it won't leverage hardware as well as cuda (such as blender and vray)?
Also, I am noticing that overclocking the HBM on Vega still benefits a lot to the speed even though the DP performance is quite weak (below 1tflop), does that mean GPUOWL also uses significantly more bandwidth than Prime95 and CUDALucas
xx005fs is offline   Reply With Quote
Old 2019-03-17, 17:43   #9
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

1,291 Posts
Default

Quote:
Originally Posted by xx005fs View Post
Interesting. Does that mean that all the other apps that uses OpenCL are just poorly coded that it won't leverage hardware as well as cuda (such as blender and vray)?
OK, a good OpenCL implementation does not have to add a penalty. OpenCL is just an API, pretty much equivalent to CUDA in expressive power, it has no inherent overhead.

Examples of relatively good implementations of OpenCL are in LLVM (ROCm, amdgpu-pro). Examples of poor OpenCL implementations are coming from Nvidia -- that is where OpenCL is slower than CUDA and has many more bugs than CUDA. This, mind you, is because of Nvidia, not because of OpenCL.

(also, it is tricky to write good OpenCL code, just the same as it's tricky to write good CUDA code.)

Quote:
Also, I am noticing that overclocking the HBM on Vega still benefits a lot to the speed even though the DP performance is quite weak (below 1tflop), does that mean GPUOWL also uses significantly more bandwidth than Prime95 and CUDALucas
This might be because GpuOwl splits the work in a few steps (kernel runs), and a few of these steps are more memory-intensive (doing the "transpose" in memory). These memory-bound kernels benefit from faster RAM.
preda is offline   Reply With Quote
Old 2019-03-17, 20:44   #10
xx005fs
 
"Eric"
Jan 2018
USA

24×13 Posts
Default

Quote:
Originally Posted by preda View Post
OK, a good OpenCL implementation does not have to add a penalty. OpenCL is just an API, pretty much equivalent to CUDA in expressive power, it has no inherent overhead.

Examples of relatively good implementations of OpenCL are in LLVM (ROCm, amdgpu-pro). Examples of poor OpenCL implementations are coming from Nvidia -- that is where OpenCL is slower than CUDA and has many more bugs than CUDA. This, mind you, is because of Nvidia, not because of OpenCL.

(also, it is tricky to write good OpenCL code, just the same as it's tricky to write good CUDA code.)


This might be because GpuOwl splits the work in a few steps (kernel runs), and a few of these steps are more memory-intensive (doing the "transpose" in memory). These memory-bound kernels benefit from faster RAM.

Thank you! That makes a lot of sense. Is there a specific amount of memory transfer needed for each individual iteration for GPUOWL's FFT code for a certain FFT size?
xx005fs is offline   Reply With Quote
Old 2019-03-19, 10:26   #11
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

1,291 Posts
Default

Quote:
Originally Posted by xx005fs View Post
Thank you! That makes a lot of sense. Is there a specific amount of memory transfer needed for each individual iteration for GPUOWL's FFT code for a certain FFT size?
It's not very simple, but as an approximation in one iteration (i.e. forward FFT + mul + inverse FFT) GpuOwl does about 7x R+W of full "bits" buffer.

E.g. for FFT 4608K, the buffer size is 4608*8 MB (because one DP (double) has 8 bytes), about 37MB. So about 260MB R+W per iteration.

For the smaller FFT 4M, only about 5x R+W of bits buffs is needed, 160MB per iteration.

To explain, about 4x (or 6x) are actual reads/writes of the bits, and 1x covers the remaining reads of constant buffers: weights, and various trig "wiggles".

Last fiddled with by preda on 2019-03-19 at 10:31
preda is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Memory Bandwidth Fred Hardware 12 2016-02-01 18:29
High Bandwidth Memory tha GPU Computing 4 2015-07-31 00:21
configuration for max memory bandwidth smartypants Hardware 11 2015-07-26 09:16
P-1 memory bandwidth TheMawn Hardware 1 2013-06-15 23:15
Parallel memory bandwidth fivemack Factoring 14 2008-06-11 20:43

All times are UTC. The time now is 16:28.

Sun Oct 25 16:28:29 UTC 2020 up 45 days, 13:39, 1 user, load averages: 1.64, 1.89, 1.86

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.