mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2009-10-16, 11:16   #1
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)

Hi,

I convert MaclucasFFTW to CUDA/CUFFTW(Single precision).

On ION/atom 330.
Quote:
ION$mkdir mers
ION$cd mers
ION$wget http://www.garlic.com/%7Ewedgingt/mers.tar.gz
ION$tar -zxvf mers.tar.gz
ION$patch -p0 -d . < MacLucasFFTW.cuda.0.patch
ION$cd mers
ION$/usr/local/cuda/bin/nvcc -DMERS_PACKAGE -DBIT_SIEVE -DTESTING_SMALL_EXPONENTS
-DSIEVE_SIZE_IN_BYTES=32 -DNUM_SMALL_PRIMES=32768 -O3 -DDO_NOT_USE_LONG_DOUBLE
-I/usr/local/include MacLucasFFTW.c setup.c rw.c balance.c zero.c
-L/usr/local/lib -c
ION$g++ -fPIC -o MacLucasFFTW MacLucasFFTW.o setup.o rw.o balance.o zero.o
-L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib
-L/NVIDIA_GPU_Computing_SDK/C/common/common/lib/linux -lcudart
-L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib
-L/NVIDIA_GPU_Computing_SDK/C/common/lib/linux -lcufft -lcutil -lm

ION$ time ./MacLucasFFTW 11213
1 2048
...
11001 2048
M( 11213 )P, n = 2048, MacLucasFFTW v8.1 Ballester

real 0m4.945s
user 0m4.200s
sys 0m0.744s

ION$ time ./MacLucasFFTW 216091
1 32768
1 32768
1 32768
1 65536
1001 65536
...
216001 65536
M( 216091 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 35m14.453s
user 30m41.439s
sys 4m32.585s
Cannot resume.

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.0.patch.gz (2.6 KB, 578 views)
msft is offline   Reply With Quote
Old 2009-10-17, 05:13   #2
frmky
 
frmky's Avatar
 
Jul 2003
So Cal

2,039 Posts
Default

Quote:
Originally Posted by msft View Post
ION$ time ./MacLucasFFTW 216091
...
real 35m14.453s
Wow, that's slow! For fun, I thought I'd try it on the C1060. A 64-bit compile didn't work, but 32-bit works fine although it runs at about the same speed. It's completely bandwidth limited with all of the transfers on and off the device.

Last fiddled with by frmky on 2009-10-17 at 05:14
frmky is offline   Reply With Quote
Old 2009-10-17, 05:34   #3
msft
 
msft's Avatar
 
Jul 2009
Tokyo

26216 Posts
Default

Hi, Mr fmky.

Depend PCI BUS bandwith.(My ION cyoice is corect...)
Now CPU <-> GPU data transfer 4 times/itelation.
Reduce to 2 is easy,but 0 is very difficult.
All rutine on GPU...is nightmare.

ION$ time ./MacLucasFFTW 859433
...
859001 262144
M( 859433 )P, n = 262144, MacLucasFFTW v8.1 Ballester

real 553m57.624s
user 486m37.685s
sys 63m24.838s
msft is offline   Reply With Quote
Old 2009-10-18, 21:56   #4
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default

Depend ratency,not bandwith.

11211 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 130m0.106s
user 108m8.678s
sys 11m52.809s

Use 4194304 point fft = 130min/11213iter = 0.96 sec/iter

11211 8388608
M( 11213 )P, n = 8388608, MacLucasFFTW v8.1 Ballester

real 263m41.280s
user 217m26.123s
sys 25m9.138s

Use 8388608 point fft = 263min/11213iter = 1.41 sec/iter
msft is offline   Reply With Quote
Old 2009-10-19, 08:34   #5
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

I make MaclucasFFTW/ubuntu 9.04(32 bit)/CUDA 2.3/CUFFTW(double precision) version.
Attached Files
File Type: gz MacLucasFFTW.cuda.1.patch.gz (2.4 KB, 344 views)
msft is offline   Reply With Quote
Old 2009-10-21, 11:58   #6
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default GTX260 result.

Hi,

I get GTX260 from Akihabara.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 4m14.854s
user 2m53.103s
sys 1m21.725s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 62m52.755s
user 44m54.440s
sys 17m58.311s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 22m30.253s
user 20m19.644s
sys 2m10.604s

2048k fft sec/iter = 0.12

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 44m57.511s
user 40m37.008s
sys 4m20.572s

4096k fft sec/iter = 0.24
msft is offline   Reply With Quote
Old 2009-10-21, 23:48   #7
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2·5·61 Posts
Default GTX260 CUFFT double precision benchmark.

base source from http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/

CUFFT double precision Complex to Complex fft.

calculated 4 point fft 1000 times in 0.008224 seconds = 4.863870 mflops
calculated 8 point fft 1000 times in 0.011834 seconds = 10.140262 mflops
calculated 16 point fft 1000 times in 0.022916 seconds = 13.964068 mflops
calculated 32 point fft 1000 times in 0.018881 seconds = 42.370436 mflops
calculated 64 point fft 1000 times in 0.021270 seconds = 90.268460 mflops
calculated 128 point fft 1000 times in 0.033173 seconds = 135.049269 mflops
calculated 256 point fft 1000 times in 0.040438 seconds = 253.227750 mflops
calculated 512 point fft 1000 times in 0.039472 seconds = 583.703807 mflops
calculated 1024 point fft 1000 times in 0.053949 seconds = 949.044903 mflops
calculated 2048 point fft 1000 times in 0.063347 seconds = 1778.141443 mflops
calculated 4096 point fft 1000 times in 0.073652 seconds = 3336.771353 mflops
calculated 8192 point fft 1000 times in 0.072037 seconds = 7391.749264 mflops
calculated 16384 point fft 1000 times in 0.095094 seconds = 12060.484847 mflops
calculated 32768 point fft 1000 times in 0.168577 seconds = 14578.497099 mflops
calculated 65536 point fft 1000 times in 0.290185 seconds = 18067.368993 mflops
calculated 131072 point fft 1000 times in 0.541373 seconds = 20579.382834 mflops
calculated 262144 point fft 1000 times in 1.012113 seconds = 23310.597859 mflops
calculated 524288 point fft 1000 times in 1.930565 seconds = 25799.369189 mflops
calculated 1048576 point fft 1000 times in 3.874456 seconds = 27063.825242 mflops
calculated 2097152 point fft 1000 times in 7.998022 seconds = 27531.927191 mflops
calculated 4194304 point fft 1000 times in 16.420866 seconds = 28096.778989 mflops
calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops

CUFFT double precision Complex to Complex fft with memory transfer.

calculated 4 point fft 1000 times in 0.039361 seconds = 1.016236 mflops
calculated 8 point fft 1000 times in 0.043003 seconds = 2.790504 mflops
calculated 16 point fft 1000 times in 0.054218 seconds = 5.902111 mflops
calculated 32 point fft 1000 times in 0.051855 seconds = 15.427609 mflops
calculated 64 point fft 1000 times in 0.053721 seconds = 35.740155 mflops
calculated 128 point fft 1000 times in 0.065968 seconds = 67.911776 mflops
calculated 256 point fft 1000 times in 0.075842 seconds = 135.017605 mflops
calculated 512 point fft 1000 times in 0.076387 seconds = 301.622209 mflops
calculated 1024 point fft 1000 times in 0.096225 seconds = 532.085824 mflops
calculated 2048 point fft 1000 times in 0.116388 seconds = 967.797693 mflops
calculated 4096 point fft 1000 times in 0.149064 seconds = 1648.687601 mflops
calculated 8192 point fft 1000 times in 0.188968 seconds = 2817.832906 mflops
calculated 16384 point fft 1000 times in 0.297410 seconds = 3856.224820 mflops
calculated 32768 point fft 1000 times in 0.541233 seconds = 4540.742277 mflops
calculated 65536 point fft 1000 times in 1.005135 seconds = 5216.095639 mflops
calculated 131072 point fft 1000 times in 1.937994 seconds = 5748.790044 mflops
calculated 262144 point fft 1000 times in 3.775305 seconds = 6249.285866 mflops
calculated 524288 point fft 1000 times in 7.420558 seconds = 6712.077396 mflops
calculated 1048576 point fft 1000 times in 14.830570 seconds = 7070.368855 mflops
calculated 2097152 point fft 1000 times in 29.878380 seconds = 7369.909602 mflops
calculated 4194304 point fft 1000 times in 60.144817 seconds = 7671.042373 mflops
calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops

/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest
Running on......
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2529.5

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2173.9

calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops
calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops

8388608*16/1024/1024 (Mbyte) /2100 (MB/s) * 2 * 1000 (times) = 121.9 sec
msft is offline   Reply With Quote
Old 2009-10-22, 15:50   #8
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi,

New result on GTX260.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 2m16.778s
user 1m22.469s
sys 0m54.311s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 31m40.768s
user 19m31.713s
sys 12m9.150s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 15m2.462s
user 10m15.066s
sys 4m47.262s

2048k fft sec/iter = 0.08

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 30m14.297s
user 20m37.141s
sys 9m41.688s

4096k fft sec/iter = 0.16

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.a.patch.gz (106 Bytes, 331 views)
msft is offline   Reply With Quote
Old 2009-10-23, 13:24   #9
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi,

New result on GTX260.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 1m42.692s
user 1m2.768s
sys 0m39.906s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 23m8.920s
user 14m3.845s
sys 9m5.094s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 8m35.896s
user 5m5.511s
sys 3m30.349s

2048k fft sec/iter = 0.046

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 17m14.207s
user 10m0.930s
sys 7m12.587s

4096k fft sec/iter = 0.092

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.b.tar.gz (23.3 KB, 315 views)
msft is offline   Reply With Quote
Old 2009-10-24, 13:39   #10
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi,

New result on GTX260.

M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 1m39.794s
user 1m7.864s
sys 0m31.934s

M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 20m33.342s
user 14m1.825s
sys 6m31.548s

M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 7m27.026s
user 5m8.783s
sys 2m18.257s

2048k fft sec/iter = 0.040

M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 14m54.153s
user 10m14.254s
sys 4m39.897s

4096k fft sec/iter = 0.080

Thank you,
Attached Files
File Type: gz MacLucasFFTW.cuda.c.tar.gz (24.9 KB, 334 views)
msft is offline   Reply With Quote
Old 2009-10-25, 11:17   #11
nucleon
 
nucleon's Avatar
 
Mar 2003
Melbourne

5·103 Posts
Default

Good work.

My Core i7 920 clocked at default settings gives:

Best time for 2048K FFT length: 39.869 ms.
Best time for 4096K FFT length: 87.849 ms.


So you're getting into the realm of what's theoretically expected.

Can't wait untilt he 3xx series comes out with 5-fold increase in 64bit floats.

-- Craig
nucleon is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 13:20.

Mon Sep 28 13:20:17 UTC 2020 up 18 days, 10:31, 1 user, load averages: 1.41, 2.32, 2.53

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.