mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2009-11-10, 06:06   #45
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi,

New version on GTX260.

This version use Matrix transpose with Cuda(see NVIDIA_GPU_Computing_SDK/C/src/transpose).

This version support only 4096k.

MacLucasFFTW.cuda.t.tar.gz

$ time ./MacLucasFFTW 216091
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 4194304, MacLucasFFTW v8.1 Ballester
...
Iteration 210000 M( 216091 )C, 0xcfe091c8f59f8a7b, n = 4194304, MacLucasFFTW v8.1 Ballester
M( 216091 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 99m5.087s
user 42m45.096s
sys 0m0.272s

$ time ./MacLucasFFTW 63333333
Iteration 10000 M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester
M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester

real 4m20.656s
user 1m41.750s
sys 0m0.112s

4096k fft sec/iter = 0.026

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.t.tar.gz (31.0 KB, 528 views)
msft is offline   Reply With Quote
Old 2009-11-10, 07:15   #46
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,663 Posts
Default

I just retested the 4096K FFT for the recent versions on the C1060. Version k is the fastest.

Version k: 0.0264 sec/iter
Version o: 0.0278 sec/iter
Version t: 0.0324 sec/iter
frmky is offline   Reply With Quote
Old 2009-11-10, 07:45   #47
msft
 
msft's Avatar
 
Jul 2009
Tokyo

10011000102 Posts
Default

Hi, frmky
Quote:
Originally Posted by frmky View Post
I just retested the 4096K FFT for the recent versions on the C1060.
Version k is the fastest.

Version k: 0.0264 sec/iter
Version o: 0.0278 sec/iter
Version t: 0.0324 sec/iter
If you can execute version "t" with "CUDA Visual Profiler", tell me .csv result.

Arguments 63333333
Max.Execution Time:30

Thank you,
Attached Thumbnails
Click image for larger version

Name:	t.png
Views:	634
Size:	96.7 KB
ID:	4278  
msft is offline   Reply With Quote
Old 2009-11-10, 08:15   #48
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,663 Posts
Default

Attached is the csv. I'll upload the graph in the next post.
Attached Files
File Type: zip MacLucasFFTW_t_Session1.zip (80.5 KB, 518 views)
frmky is offline   Reply With Quote
Old 2009-11-10, 08:16   #49
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,663 Posts
Default

Here's the graph:
Attached Thumbnails
Click image for larger version

Name:	profile_t.PNG
Views:	581
Size:	21.5 KB
ID:	4282  
frmky is offline   Reply With Quote
Old 2009-11-10, 08:46   #50
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by frmky View Post
Here's the graph:
Thank you, frmky
transpose_kernel copt from NVIDIA_GPU_Computing_SDK/C/src/transpose/transpose_kernel.cu ,I don't touch.
mmm...
msft is offline   Reply With Quote
Old 2009-11-10, 09:17   #51
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,663 Posts
Default

Not sure if it'll help, but...
Code:
[childers release]$ ./transpose 
Transposing a 256 by 4096 matrix of floats...
Naive transpose average time:     2.286 ms
Optimized transpose average time: 0.327 ms

Test PASSED

Press ENTER to exit...
frmky is offline   Reply With Quote
Old 2009-11-10, 09:25   #52
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

Code:
$ ./transpose
Transposing a 256 by 4096 matrix of floats...
Naive transpose average time:     1.281 ms
Optimized transpose average time: 0.184 ms
C1060 have 512 bank something wrong.
msft is offline   Reply With Quote
Old 2009-11-10, 09:58   #53
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,663 Posts
Default

Try using transposeDiagonal from transposeNew.cu.

Code:
[TransposeNew]
> Device 0: "Tesla C1060"
> SM Capability 1.3 detected:
> CUDA device has 30 Multi-Processors
> SM performance scaling factor = 1.00

Matrix size: 2048x2048 (64x64 tiles), tile size: 32x32, block size: 32x8

Kernel                  Loop over kernel        Loop within kernel
------                  ----------------        ------------------
simple copy             73.02 GB/s              74.76 GB/s
shared memory copy      70.46 GB/s              71.84 GB/s
naive transpose          2.13 GB/s               2.14 GB/s
coalesced transpose     17.53 GB/s              18.25 GB/s
no bank conflict trans  17.70 GB/s              18.34 GB/s
coarse-grained          17.70 GB/s              18.33 GB/s
fine-grained            69.51 GB/s              72.12 GB/s
diagonal transpose      63.91 GB/s              69.27 GB/s

Test PASSED

Press ENTER to exit...
frmky is offline   Reply With Quote
Old 2009-11-10, 15:03   #54
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by frmky View Post
Code:
diagonal transpose      63.91 GB/s              69.27 GB/s
I use diagonal transpose code.

Only support 2048k,4096k.

$ time ./MacLucasFFTW 216091
Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 2097152, MacLucasFFTW v8.1 Ballester
...
Iteration 210000 M( 216091 )C, 0xcfe091c8f59f8a7b, n = 2097152, MacLucasFFTW v8.1 Ballester
M( 216091 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 48m8.792s
user 0m22.289s
sys 0m9.177s

$ time ./MacLucasFFTW 33333333
Iteration 10000 M( 33333333 )C, 0xd717246f501c7d94, n = 2097152, MacLucasFFTW v8.1 Ballester
M( 33333333 )C, 0xd717246f501c7d94, n = 2097152, MacLucasFFTW v8.1 Ballester

real 2m9.478s
user 0m1.248s
sys 0m0.528s

2048k fft sec/iter = 0.0130

$ time ./MacLucasFFTW 63333333
Iteration 10000 M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester
M( 63333333 )C, 0xa5d7a917d728239a, n = 4194304, MacLucasFFTW v8.1 Ballester

real 4m6.204s
user 0m1.348s
sys 0m0.496s

4096k fft sec/iter = 0.0246

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.u.tar.gz (31.4 KB, 532 views)
msft is offline   Reply With Quote
Old 2009-11-10, 21:23   #55
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

A6716 Posts
Default

Version u is much better, but still about 5% slower than version k.

Version k: 0.0264 sec/iter
Version o: 0.0278 sec/iter
Version u: 0.0277 sec/iter

Here's the updated profiler graph. transpose no longer dominates the runtime.
Attached Thumbnails
Click image for larger version

Name:	profile_u.PNG
Views:	546
Size:	43.2 KB
ID:	4284  
frmky is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 15:22.


Fri Jul 7 15:22:11 UTC 2023 up 323 days, 12:50, 0 users, load averages: 1.03, 1.08, 1.09

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔