Originally Posted by henryzz
Memory bandwidth is often the issue. More recently some cpus have had enough L3 cache that memory won't be needed. This threadripper is likely to be one of them.
For small FFT's it's definitely true that memory bandwidth is irrelevant, for large FFT's memory bandwidth should come into play as you probably want to increase worker count to decrease CCX-to-CCX communication, 16 discrete chunks of L3 is not ideal and needs to be worked around. The larger the FFT the more likely memory bandwidth comes into play, but as long as you tune to the point that CCX comm, cache contention, memory bandwidth and compute are in check you should be in the ballpark of optimal.
