View Single Post
Old 2011-02-03, 02:38   #5
Andrew Thall
Dec 2010

23 Posts

Thanks, all. I've had a few volunteers by email is mainly the CUDALucas timings I need, but I think we've got it covered for now. Just looking at msec per Lucas iteration for given FFT sizes on Fermi architecture cards.

I pulled some single CPU times off the benchmark pages but would be interested in the sorts of speedups you get with multiple cores...I did some early experiments with multicore FFTW that left me less than thrilled, but that was a few years ago, and I'd like to hear how your well-tuned FFTs perform. Too late to make it into this paper, though.

Finally had a chance to dig into the CUDALucas source code...a very different method than my approach, which is your academic, massively data-parallel, digit-per-thread sort of technique. It's likely faster than mine for large N, but without non-power-of-two transforms, it's not going to be as fast for any given Mersenne number. I suspect its performance won't scale as rapidly with a higher multiprocessing level---more cores---but we'll have to see when we get some better cards.

Last fiddled with by Andrew Thall on 2011-02-03 at 02:40
Andrew Thall is offline   Reply With Quote