CUDALLR is available, and in my experience stable. It only uses powerof2 FFT sizes, and speed improves with larger exponents. The main FFT jump we care about is just over 3M for k=69, so your Teslas would be most useful in the upper 2M range, or over 5M (relative to CPU workers, that is).
Check in the hardware/GPU computing forum I didn't see the thread when I glanced, but I've been running the program for over a year, even found a prime for k=5 with it in the 3megabit range.
Thx Curtis, i downloaded it. Will try to get it to work!
Is that power of 2 the only 'disadvantage' over the IBDWT in SSE2 i got running currently?
I tend to remember how my own FFT implementation that also used power of 2 had another few disadvantages (let's say it polite) :)
The tesla's i got here are 0.5 Tflop in theory (of course that's always 2x more than it can do in terms of instructions, they always assume you can use multiplyadd, not sure whether this FFT can), looking forward benchmarking it for this code!
Note it would be possible at Nvidia to run at each SIMD a different code stream. I don't know whether it still can deliver 0.5 Tflop doing that, yet if it can, should be easier to get rid of that power of 2 sized FFT? Maybe?