20091016, 11:16  #1  
Jul 2009
Tokyo
610_{10} Posts 
CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
Hi,
I convert MaclucasFFTW to CUDA/CUFFTW(Single precision). On ION/atom 330. Quote:
Thank you, 

20091017, 05:13  #2 
Jul 2003
So Cal
2^{2}·23^{2} Posts 
Wow, that's slow! For fun, I thought I'd try it on the C1060. A 64bit compile didn't work, but 32bit works fine although it runs at about the same speed. It's completely bandwidth limited with all of the transfers on and off the device.
Last fiddled with by frmky on 20091017 at 05:14 
20091017, 05:34  #3 
Jul 2009
Tokyo
262_{16} Posts 
Hi, Mr fmky.
Depend PCI BUS bandwith.(My ION cyoice is corect...) Now CPU <> GPU data transfer 4 times/itelation. Reduce to 2 is easy,but 0 is very difficult. All rutine on GPU...is nightmare. ION$ time ./MacLucasFFTW 859433 ... 859001 262144 M( 859433 )P, n = 262144, MacLucasFFTW v8.1 Ballester real 553m57.624s user 486m37.685s sys 63m24.838s 
20091018, 21:56  #4 
Jul 2009
Tokyo
2·5·61 Posts 
Depend ratency,not bandwith.
11211 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 130m0.106s user 108m8.678s sys 11m52.809s Use 4194304 point fft = 130min/11213iter = 0.96 sec/iter 11211 8388608 M( 11213 )P, n = 8388608, MacLucasFFTW v8.1 Ballester real 263m41.280s user 217m26.123s sys 25m9.138s Use 8388608 point fft = 263min/11213iter = 1.41 sec/iter 
20091019, 08:34  #5 
Jul 2009
Tokyo
2×5×61 Posts 
I make MaclucasFFTW/ubuntu 9.04(32 bit)/CUDA 2.3/CUFFTW(double precision) version.

20091021, 11:58  #6 
Jul 2009
Tokyo
2·5·61 Posts 
GTX260 result.
Hi,
I get GTX260 from Akihabara. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 4m14.854s user 2m53.103s sys 1m21.725s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 62m52.755s user 44m54.440s sys 17m58.311s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 22m30.253s user 20m19.644s sys 2m10.604s 2048k fft sec/iter = 0.12 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 44m57.511s user 40m37.008s sys 4m20.572s 4096k fft sec/iter = 0.24 
20091021, 23:48  #7 
Jul 2009
Tokyo
2×5×61 Posts 
GTX260 CUFFT double precision benchmark.
base source from http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/
CUFFT double precision Complex to Complex fft. calculated 4 point fft 1000 times in 0.008224 seconds = 4.863870 mflops calculated 8 point fft 1000 times in 0.011834 seconds = 10.140262 mflops calculated 16 point fft 1000 times in 0.022916 seconds = 13.964068 mflops calculated 32 point fft 1000 times in 0.018881 seconds = 42.370436 mflops calculated 64 point fft 1000 times in 0.021270 seconds = 90.268460 mflops calculated 128 point fft 1000 times in 0.033173 seconds = 135.049269 mflops calculated 256 point fft 1000 times in 0.040438 seconds = 253.227750 mflops calculated 512 point fft 1000 times in 0.039472 seconds = 583.703807 mflops calculated 1024 point fft 1000 times in 0.053949 seconds = 949.044903 mflops calculated 2048 point fft 1000 times in 0.063347 seconds = 1778.141443 mflops calculated 4096 point fft 1000 times in 0.073652 seconds = 3336.771353 mflops calculated 8192 point fft 1000 times in 0.072037 seconds = 7391.749264 mflops calculated 16384 point fft 1000 times in 0.095094 seconds = 12060.484847 mflops calculated 32768 point fft 1000 times in 0.168577 seconds = 14578.497099 mflops calculated 65536 point fft 1000 times in 0.290185 seconds = 18067.368993 mflops calculated 131072 point fft 1000 times in 0.541373 seconds = 20579.382834 mflops calculated 262144 point fft 1000 times in 1.012113 seconds = 23310.597859 mflops calculated 524288 point fft 1000 times in 1.930565 seconds = 25799.369189 mflops calculated 1048576 point fft 1000 times in 3.874456 seconds = 27063.825242 mflops calculated 2097152 point fft 1000 times in 7.998022 seconds = 27531.927191 mflops calculated 4194304 point fft 1000 times in 16.420866 seconds = 28096.778989 mflops calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops CUFFT double precision Complex to Complex fft with memory transfer. calculated 4 point fft 1000 times in 0.039361 seconds = 1.016236 mflops calculated 8 point fft 1000 times in 0.043003 seconds = 2.790504 mflops calculated 16 point fft 1000 times in 0.054218 seconds = 5.902111 mflops calculated 32 point fft 1000 times in 0.051855 seconds = 15.427609 mflops calculated 64 point fft 1000 times in 0.053721 seconds = 35.740155 mflops calculated 128 point fft 1000 times in 0.065968 seconds = 67.911776 mflops calculated 256 point fft 1000 times in 0.075842 seconds = 135.017605 mflops calculated 512 point fft 1000 times in 0.076387 seconds = 301.622209 mflops calculated 1024 point fft 1000 times in 0.096225 seconds = 532.085824 mflops calculated 2048 point fft 1000 times in 0.116388 seconds = 967.797693 mflops calculated 4096 point fft 1000 times in 0.149064 seconds = 1648.687601 mflops calculated 8192 point fft 1000 times in 0.188968 seconds = 2817.832906 mflops calculated 16384 point fft 1000 times in 0.297410 seconds = 3856.224820 mflops calculated 32768 point fft 1000 times in 0.541233 seconds = 4540.742277 mflops calculated 65536 point fft 1000 times in 1.005135 seconds = 5216.095639 mflops calculated 131072 point fft 1000 times in 1.937994 seconds = 5748.790044 mflops calculated 262144 point fft 1000 times in 3.775305 seconds = 6249.285866 mflops calculated 524288 point fft 1000 times in 7.420558 seconds = 6712.077396 mflops calculated 1048576 point fft 1000 times in 14.830570 seconds = 7070.368855 mflops calculated 2097152 point fft 1000 times in 29.878380 seconds = 7369.909602 mflops calculated 4194304 point fft 1000 times in 60.144817 seconds = 7671.042373 mflops calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops /NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest Running on...... device 0:GeForce GTX 260 Quick Mode Host to Device Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2529.5 Quick Mode Device to Host Bandwidth for Pageable memory . Transfer Size (Bytes) Bandwidth(MB/s) 33554432 2173.9 calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops 8388608*16/1024/1024 (Mbyte) /2100 (MB/s) * 2 * 1000 (times) = 121.9 sec 
20091022, 15:50  #8 
Jul 2009
Tokyo
2·5·61 Posts 
Hi,
New result on GTX260. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 2m16.778s user 1m22.469s sys 0m54.311s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 31m40.768s user 19m31.713s sys 12m9.150s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 15m2.462s user 10m15.066s sys 4m47.262s 2048k fft sec/iter = 0.08 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 30m14.297s user 20m37.141s sys 9m41.688s 4096k fft sec/iter = 0.16 Thank you, 
20091023, 13:24  #9 
Jul 2009
Tokyo
2×5×61 Posts 
Hi,
New result on GTX260. 216001 16384 M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 1m42.692s user 1m2.768s sys 0m39.906s 859001 65536 M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 23m8.920s user 14m3.845s sys 9m5.094s 11001 2097152 M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 8m35.896s user 5m5.511s sys 3m30.349s 2048k fft sec/iter = 0.046 11001 4194304 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 17m14.207s user 10m0.930s sys 7m12.587s 4096k fft sec/iter = 0.092 Thank you, 
20091024, 13:39  #10 
Jul 2009
Tokyo
2×5×61 Posts 
Hi,
New result on GTX260. M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester real 1m39.794s user 1m7.864s sys 0m31.934s M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester real 20m33.342s user 14m1.825s sys 6m31.548s M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 7m27.026s user 5m8.783s sys 2m18.257s 2048k fft sec/iter = 0.040 M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 14m54.153s user 10m14.254s sys 4m39.897s 4096k fft sec/iter = 0.080 Thank you, 
20091025, 11:17  #11 
Mar 2003
Melbourne
1003_{8} Posts 
Good work.
My Core i7 920 clocked at default settings gives: Best time for 2048K FFT length: 39.869 ms. Best time for 4096K FFT length: 87.849 ms. So you're getting into the realm of what's theoretically expected. Can't wait untilt he 3xx series comes out with 5fold increase in 64bit floats.  Craig 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Don't DC/LL them with CudaLucas  LaurV  Data  131  20170502 18:41 
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8  Brain  GPU Computing  13  20160219 15:53 
CUDALucas: which binary to use?  Karl M Johnson  GPU Computing  15  20151013 04:44 
settings for cudaLucas  fairsky  GPU Computing  11  20131103 02:08 
Trying to run CUDALucas on Windows 8 CP  Rodrigo  GPU Computing  12  20120307 23:20 