mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW) (https://www.mersenneforum.org/showthread.php?t=12576)

msft 2009-10-16 11:16

CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 
1 Attachment(s)
Hi,

I convert MaclucasFFTW to CUDA/CUFFTW(Single precision).

On ION/atom 330.
[quote]
ION$mkdir mers
ION$cd mers
ION$wget [URL]http://www.garlic.com/%7Ewedgingt/mers.tar.gz[/URL]
ION$tar -zxvf mers.tar.gz
ION$patch -p0 -d . < MacLucasFFTW.cuda.0.patch
ION$cd mers
ION$/usr/local/cuda/bin/nvcc -DMERS_PACKAGE -DBIT_SIEVE -DTESTING_SMALL_EXPONENTS
-DSIEVE_SIZE_IN_BYTES=32 -DNUM_SMALL_PRIMES=32768 -O3 -DDO_NOT_USE_LONG_DOUBLE
-I/usr/local/include MacLucasFFTW.c setup.c rw.c balance.c zero.c
-L/usr/local/lib -c
ION$g++ -fPIC -o MacLucasFFTW MacLucasFFTW.o setup.o rw.o balance.o zero.o
-L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib
-L/NVIDIA_GPU_Computing_SDK/C/common/common/lib/linux -lcudart
-L/usr/local/cuda/lib -L/NVIDIA_GPU_Computing_SDK/C/lib
-L/NVIDIA_GPU_Computing_SDK/C/common/lib/linux -lcufft -lcutil -lm

ION$ time ./MacLucasFFTW 11213
1 2048
...
11001 2048
M( 11213 )P, n = 2048, MacLucasFFTW v8.1 Ballester

real 0m4.945s
user 0m4.200s
sys 0m0.744s

ION$ time ./MacLucasFFTW 216091
1 32768
1 32768
1 32768
1 65536
1001 65536
...
216001 65536
M( 216091 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 35m14.453s
user 30m41.439s
sys 4m32.585s
[/quote]Cannot resume.

Thank you,

frmky 2009-10-17 05:13

[QUOTE=msft;192993]
ION$ time ./MacLucasFFTW 216091
...
real 35m14.453s
[/QUOTE]

Wow, that's slow! For fun, I thought I'd try it on the C1060. A 64-bit compile didn't work, but 32-bit works fine although it runs at about the same speed. It's completely bandwidth limited with all of the transfers on and off the device.

msft 2009-10-17 05:34

Hi, Mr fmky.

Depend PCI BUS bandwith.(My ION cyoice is corect...)
Now CPU <-> GPU data transfer 4 times/itelation.
Reduce to 2 is easy,but 0 is very difficult.
All rutine on GPU...is nightmare.

ION$ time ./MacLucasFFTW 859433
...
859001 262144
M( 859433 )P, n = 262144, MacLucasFFTW v8.1 Ballester

real 553m57.624s
user 486m37.685s
sys 63m24.838s

msft 2009-10-18 21:56

Depend ratency,not bandwith.

11211 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 130m0.106s
user 108m8.678s
sys 11m52.809s

Use 4194304 point fft = 130min/11213iter = 0.96 sec/iter

11211 8388608
M( 11213 )P, n = 8388608, MacLucasFFTW v8.1 Ballester

real 263m41.280s
user 217m26.123s
sys 25m9.138s

Use 8388608 point fft = 263min/11213iter = 1.41 sec/iter

msft 2009-10-19 08:34

1 Attachment(s)
[B]I make MaclucasFFTW/ubuntu 9.04(32 bit)/CUDA 2.3/CUFFTW(double precision) version.[/B]

msft 2009-10-21 11:58

GTX260 result.
 
Hi,

I get GTX260 from Akihabara.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 4m14.854s
user 2m53.103s
sys 1m21.725s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 62m52.755s
user 44m54.440s
sys 17m58.311s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 22m30.253s
user 20m19.644s
sys 2m10.604s

2048k fft sec/iter = 0.12

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 44m57.511s
user 40m37.008s
sys 4m20.572s

4096k fft sec/iter = 0.24

msft 2009-10-21 23:48

GTX260 CUFFT double precision benchmark.
 
base source from [URL="http://www.science.uwaterloo.ca/%7Ehmerz/CUDA_benchFFT/"]http://www.science.uwaterloo.ca/~hmerz/CUDA_benchFFT/[/URL]

CUFFT double precision Complex to Complex fft.

calculated 4 point fft 1000 times in 0.008224 seconds = 4.863870 mflops
calculated 8 point fft 1000 times in 0.011834 seconds = 10.140262 mflops
calculated 16 point fft 1000 times in 0.022916 seconds = 13.964068 mflops
calculated 32 point fft 1000 times in 0.018881 seconds = 42.370436 mflops
calculated 64 point fft 1000 times in 0.021270 seconds = 90.268460 mflops
calculated 128 point fft 1000 times in 0.033173 seconds = 135.049269 mflops
calculated 256 point fft 1000 times in 0.040438 seconds = 253.227750 mflops
calculated 512 point fft 1000 times in 0.039472 seconds = 583.703807 mflops
calculated 1024 point fft 1000 times in 0.053949 seconds = 949.044903 mflops
calculated 2048 point fft 1000 times in 0.063347 seconds = 1778.141443 mflops
calculated 4096 point fft 1000 times in 0.073652 seconds = 3336.771353 mflops
calculated 8192 point fft 1000 times in 0.072037 seconds = 7391.749264 mflops
calculated 16384 point fft 1000 times in 0.095094 seconds = 12060.484847 mflops
calculated 32768 point fft 1000 times in 0.168577 seconds = 14578.497099 mflops
calculated 65536 point fft 1000 times in 0.290185 seconds = 18067.368993 mflops
calculated 131072 point fft 1000 times in 0.541373 seconds = 20579.382834 mflops
calculated 262144 point fft 1000 times in 1.012113 seconds = 23310.597859 mflops
calculated 524288 point fft 1000 times in 1.930565 seconds = 25799.369189 mflops
calculated 1048576 point fft 1000 times in 3.874456 seconds = 27063.825242 mflops
calculated 2097152 point fft 1000 times in 7.998022 seconds = 27531.927191 mflops
calculated 4194304 point fft 1000 times in 16.420866 seconds = 28096.778989 mflops
calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops

CUFFT double precision Complex to Complex fft with memory transfer.

calculated 4 point fft 1000 times in 0.039361 seconds = 1.016236 mflops
calculated 8 point fft 1000 times in 0.043003 seconds = 2.790504 mflops
calculated 16 point fft 1000 times in 0.054218 seconds = 5.902111 mflops
calculated 32 point fft 1000 times in 0.051855 seconds = 15.427609 mflops
calculated 64 point fft 1000 times in 0.053721 seconds = 35.740155 mflops
calculated 128 point fft 1000 times in 0.065968 seconds = 67.911776 mflops
calculated 256 point fft 1000 times in 0.075842 seconds = 135.017605 mflops
calculated 512 point fft 1000 times in 0.076387 seconds = 301.622209 mflops
calculated 1024 point fft 1000 times in 0.096225 seconds = 532.085824 mflops
calculated 2048 point fft 1000 times in 0.116388 seconds = 967.797693 mflops
calculated 4096 point fft 1000 times in 0.149064 seconds = 1648.687601 mflops
calculated 8192 point fft 1000 times in 0.188968 seconds = 2817.832906 mflops
calculated 16384 point fft 1000 times in 0.297410 seconds = 3856.224820 mflops
calculated 32768 point fft 1000 times in 0.541233 seconds = 4540.742277 mflops
calculated 65536 point fft 1000 times in 1.005135 seconds = 5216.095639 mflops
calculated 131072 point fft 1000 times in 1.937994 seconds = 5748.790044 mflops
calculated 262144 point fft 1000 times in 3.775305 seconds = 6249.285866 mflops
calculated 524288 point fft 1000 times in 7.420558 seconds = 6712.077396 mflops
calculated 1048576 point fft 1000 times in 14.830570 seconds = 7070.368855 mflops
calculated 2097152 point fft 1000 times in 29.878380 seconds = 7369.909602 mflops
calculated 4194304 point fft 1000 times in 60.144817 seconds = 7671.042373 mflops
calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops

/NVIDIA_GPU_Computing_SDK/C/bin/linux/release$ ./bandwidthTest
Running on......
device 0:GeForce GTX 260
Quick Mode
Host to Device Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2529.5

Quick Mode
Device to Host Bandwidth for Pageable memory
.
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 2173.9

calculated 8388608 point fft 1000 times in 34.149594 seconds = 28248.942537 mflops
calculated 8388608 point fft 1000 times in 121.557722 seconds = 7936.064479 mflops

8388608*16/1024/1024 (Mbyte) /2100 (MB/s) * 2 * 1000 (times) = 121.9 sec

msft 2009-10-22 15:50

1 Attachment(s)
Hi,

New result on GTX260.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 2m16.778s
user 1m22.469s
sys 0m54.311s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 31m40.768s
user 19m31.713s
sys 12m9.150s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 15m2.462s
user 10m15.066s
sys 4m47.262s

2048k fft sec/iter = 0.08

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 30m14.297s
user 20m37.141s
sys 9m41.688s

4096k fft sec/iter = 0.16

Thank you,

msft 2009-10-23 13:24

1 Attachment(s)
Hi,

New result on GTX260.

216001 16384
M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 1m42.692s
user 1m2.768s
sys 0m39.906s

859001 65536
M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 23m8.920s
user 14m3.845s
sys 9m5.094s

11001 2097152
M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 8m35.896s
user 5m5.511s
sys 3m30.349s

2048k fft sec/iter = 0.046

11001 4194304
M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 17m14.207s
user 10m0.930s
sys 7m12.587s

4096k fft sec/iter = 0.092

Thank you,

msft 2009-10-24 13:39

1 Attachment(s)
Hi,

New result on GTX260.

M( 216091 )P, n = 16384, MacLucasFFTW v8.1 Ballester

real 1m39.794s
user 1m7.864s
sys 0m31.934s

M( 859433 )P, n = 65536, MacLucasFFTW v8.1 Ballester

real 20m33.342s
user 14m1.825s
sys 6m31.548s

M( 11213 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

real 7m27.026s
user 5m8.783s
sys 2m18.257s

2048k fft sec/iter = 0.040

M( 11213 )P, n = 4194304, MacLucasFFTW v8.1 Ballester

real 14m54.153s
user 10m14.254s
sys 4m39.897s

4096k fft sec/iter = 0.080

Thank you,

nucleon 2009-10-25 11:17

Good work.

My Core i7 920 clocked at default settings gives:

Best time for 2048K FFT length: 39.869 ms.
Best time for 4096K FFT length: 87.849 ms.


So you're getting into the realm of what's theoretically expected.

Can't wait untilt he 3xx series comes out with 5-fold increase in 64bit floats.

-- Craig


All times are UTC. The time now is 02:50.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.