![]() |
|
|
#45 |
|
Jul 2009
Tokyo
2·5·61 Posts |
Hi,
New version on GTX260. This version use Matrix transpose with Cuda(see NVIDIA_GPU_Computing_SDK/C/src/transpose). This version support only 4096k. MacLucasFFTW.cuda.t.tar.gz $ time ./MacLucasFFTW 216091 Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 4194304, MacLucasFFTW v8.1 Ballester ... Iteration 210000 M( 216091 )C, 0xcfe091c8f59f8a7b, n = 4194304, MacLucasFFTW v8.1 Ballester M( 216091 )P, n = 4194304, MacLucasFFTW v8.1 Ballester real 99m5.087s user 42m45.096s sys 0m0.272s $ time ./MacLucasFFTW 63333333 Iteration 10000 M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester real 4m20.656s user 1m41.750s sys 0m0.112s 4096k fft sec/iter = 0.026 Thank you, |
|
|
|
|
|
#46 |
|
Jul 2003
So Cal
1010011001112 Posts |
I just retested the 4096K FFT for the recent versions on the C1060. Version k is the fastest.
Version k: 0.0264 sec/iter Version o: 0.0278 sec/iter Version t: 0.0324 sec/iter |
|
|
|
|
|
#47 | |
|
Jul 2009
Tokyo
2×5×61 Posts |
Hi, frmky
Quote:
Arguments 63333333 Max.Execution Time:30 Thank you, |
|
|
|
|
|
|
#48 |
|
Jul 2003
So Cal
2,663 Posts |
Attached is the csv. I'll upload the graph in the next post.
|
|
|
|
|
|
#49 |
|
Jul 2003
So Cal
2,663 Posts |
Here's the graph:
|
|
|
|
|
|
#50 |
|
Jul 2009
Tokyo
61010 Posts |
|
|
|
|
|
|
#51 |
|
Jul 2003
So Cal
2,663 Posts |
Not sure if it'll help, but...
Code:
[childers release]$ ./transpose Transposing a 256 by 4096 matrix of floats... Naive transpose average time: 2.286 ms Optimized transpose average time: 0.327 ms Test PASSED Press ENTER to exit... |
|
|
|
|
|
#52 |
|
Jul 2009
Tokyo
2×5×61 Posts |
Code:
$ ./transpose Transposing a 256 by 4096 matrix of floats... Naive transpose average time: 1.281 ms Optimized transpose average time: 0.184 ms |
|
|
|
|
|
#53 |
|
Jul 2003
So Cal
1010011001112 Posts |
Try using transposeDiagonal from transposeNew.cu.
Code:
[TransposeNew] > Device 0: "Tesla C1060" > SM Capability 1.3 detected: > CUDA device has 30 Multi-Processors > SM performance scaling factor = 1.00 Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8 Kernel Loop over kernel Loop within kernel ------ ---------------- ------------------ simple copy 73.02 GB/s 74.76 GB/s shared memory copy 70.46 GB/s 71.84 GB/s naive transpose 2.13 GB/s 2.14 GB/s coalesced transpose 17.53 GB/s 18.25 GB/s no bank conflict trans 17.70 GB/s 18.34 GB/s coarse-grained 17.70 GB/s 18.33 GB/s fine-grained 69.51 GB/s 72.12 GB/s diagonal transpose 63.91 GB/s 69.27 GB/s Test PASSED Press ENTER to exit... |
|
|
|
|
|
#54 |
|
Jul 2009
Tokyo
2·5·61 Posts |
I use diagonal transpose code.
Only support 2048k,4096k. $ time ./MacLucasFFTW 216091 Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 2097152, MacLucasFFTW v8.1 Ballester ... Iteration 210000 M( 216091 )C, 0xcfe091c8f59f8a7b, n = 2097152, MacLucasFFTW v8.1 Ballester M( 216091 )P, n = 2097152, MacLucasFFTW v8.1 Ballester real 48m8.792s user 0m22.289s sys 0m9.177s $ time ./MacLucasFFTW 33333333 Iteration 10000 M( 33333333 )C, 0xd717246f501c7d94, n = 2097152, MacLucasFFTW v8.1 Ballester M( 33333333 )C, 0xd717246f501c7d94, n = 2097152, MacLucasFFTW v8.1 Ballester real 2m9.478s user 0m1.248s sys 0m0.528s 2048k fft sec/iter = 0.0130 $ time ./MacLucasFFTW 63333333 Iteration 10000 M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester real 4m6.204s user 0m1.348s sys 0m0.496s 4096k fft sec/iter = 0.0246 Thank you, |
|
|
|
|
|
#55 |
|
Jul 2003
So Cal
51478 Posts |
Version u is much better, but still about 5% slower than version k.
Version k: 0.0264 sec/iter Version o: 0.0278 sec/iter Version u: 0.0277 sec/iter Here's the updated profiler graph. transpose no longer dominates the runtime. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |