I worry you are leaving too much performance on the table. Are you planning on running multithreaded FFTs? The gwnum library's multithreading implementation is not great for smaller FFTs. See the benchmark below showing every implementation of the 128K FFT loses 25% or more running multithreaded. This is from a non-OC'ed Skylake box.
Code:
Prime95 64-bit version 29.2, RdtscTiming=1
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=4 (4 cpus, 1 worker): 0.15 ms. Throughput: 6828.90 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=4 (4 cpus, 4 workers): 0.46, 0.46, 0.46, 0.46 ms. Throughput: 8680.52 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=2 (4 cpus, 1 worker): 0.15 ms. Throughput: 6774.04 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=2 (4 cpus, 4 workers): 0.46, 0.45, 0.46, 0.46 ms. Throughput: 8786.84 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=1 (4 cpus, 1 worker): 0.16 ms. Throughput: 6238.53 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=128, Pass2=1024, clm=1 (4 cpus, 4 workers): 0.46, 0.46, 0.46, 0.46 ms. Throughput: 8616.51 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=4 (4 cpus, 1 worker): 0.18 ms. Throughput: 5605.04 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=4 (4 cpus, 4 workers): 0.45, 0.45, 0.45, 0.45 ms. Throughput: 8911.29 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=2 (4 cpus, 1 worker): 0.17 ms. Throughput: 6036.93 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=2 (4 cpus, 4 workers): 0.43, 0.43, 0.43, 0.43 ms. Throughput: 9346.56 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=1 (4 cpus, 1 worker): 0.17 ms. Throughput: 5885.71 iter/sec.
FFTlen=128K, Type=3, Arch=4, Pass1=512, Pass2=256, clm=1 (4 cpus, 4 workers): 0.44, 0.44, 0.44, 0.44 ms. Throughput: 9054.45 iter/sec.