CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 2009-10-30, 22:59 #23 msft I understand,lycorn. Thank you,
 2009-10-30, 23:43 #24 fivemack Hi msft. This is great work: many thanks! I had to copy the .h files from a separate install of MacLucasFFTW into the directory and modify some of the paths in the makefile to get it to work, but it works now. It's a bit slower than I expected, 3m40s on a GTX275 to test 216091, but that's probably because 131072 is a very large FFT size to use in double precision for so small a number. Unfortunately my computer crashed the second time I tried testing 216091; I think the graphics card is a bit flaky.
 2009-10-31, 01:55 #25 frmky This is getting interesting! I decided to try the exponent 24036583. On a Tesla C1060 GPU, CUDA MacLucasFFTW runs at 0.0153 sec/iter using a 2048K FFT. Using one thread on a 2GHz Opteron K10 CPU, on the same exponent Prime95 runs at 0.055 sec/iter using a 1280K FFT. So, comparing the speed on a top-of-the-line GPU and a notoriously slow for Prime95 CPU, the GPU version runs about 3.5x faster. Also interesting is that after adding cudaSetDeviceFlags(cudaDeviceBlockingSync); cudaSetDevice(0); near the top of the main() function, MacLucasFFTW uses only about 5% of a cpu core. Assuming the computer doesn't get reset in the next 5 days or that restarting works, I'll let this run to completion.
2009-10-31, 09:27   #26
msft
msft

Jul 2009
Tokyo

61010 Posts

Thank you testing this program,fivemack,
Quote:
 Originally Posted by fivemack I had to copy the .h files from a separate install of MacLucasFFTW into the directory and modify some of the paths in the makefile to get it to work, but it works now.
Sorry,it is my first Makefile.
Quote:
 Originally Posted by fivemack It's a bit slower than I expected, 3m40s on a GTX275 to test 216091, but that's probably because 131072 is a very large FFT size to use in double precision for so small a number.
The side effects of parallelization , GPU need more thrhads, tuning target is 2048k or more higher.
Quote:
 Originally Posted by fivemack Unfortunately my computer crashed the second time I tried testing 216091; I think the graphics card is a bit flaky.
Exactry, What are GTX275 made of ?

2009-10-31, 09:52   #27
msft
msft

Jul 2009
Tokyo

26216 Posts

Hi,frmky.
Quote:
 Originally Posted by frmky I decided to try the exponent 24036583. On a Tesla C1060 GPU, CUDA MacLucasFFTW runs at 0.0153 sec/iter using a 2048K FFT.
I immediately check 24036583's 2000 iteration checksum, It is correct.
Quote:
 Originally Posted by frmky cudaSetDeviceFlags(cudaDeviceBlockingSync); cudaSetDevice(0); near the top of the main() function, MacLucasFFTW uses only about 5% of a cpu core.
Cpu was 95% spin loop, I add this function, thank you.

Quote:
 Originally Posted by frmky Assuming the computer doesn't get reset in the next 5 days or that restarting works, I'll let this run to completion.
Thank you for lots of work,

2009-11-04, 08:30   #28
frmky
frmky

Jul 2003
So Cal

A2516 Posts

Quote:
 Originally Posted by msft I immediately check 24036583's 2000 iteration checksum, It is correct.
And so were the next 24 million.

M( 24036583 )P, n = 2097152, MacLucasFFTW v8.1 Ballester

2009-11-04, 11:03   #29
msft
msft

Jul 2009
Tokyo

2·5·61 Posts

Quote:
 Originally Posted by frmky M( 24036583 )P, n = 2097152, MacLucasFFTW v8.1 Ballester
Prima!!!, Thank you

 2009-11-04, 12:33 #30 lycorn Congrats, msft. The code seems to be running fine. 0.0153 sec/iter for a 1280K FFT is better than I can get on a Core2 duo T8300, with BOTH cores crunching the same exponent (best result is ~ 0.017).
2009-11-04, 14:45   #31
msft
msft

Jul 2009
Tokyo

2·5·61 Posts

Thank you, lycorn

New version on GTX260.

$tar -zxvf MacLucasFFTW.cuda.k.tar.gz$ make
$time ./MacLucasFFTW 216091 M( 216091 )P, n = 131072, MacLucasFFTW v8.1 Ballester real 6m34.691s user 0m10.025s sys 0m0.188s$ time ./MacLucasFFTW 2976221

M( 2976221 )P, n = 262144, MacLucasFFTW v8.1 Ballester

real 129m52.509s
user 19m27.337s
sys 0m1.232s

$time ./MacLucasFFTW 33333333 10001 2097152 real 2m44.702s user 0m22.469s sys 0m1.136s 2048k fft sec/iter = 0.0165$ time ./MacLucasFFTW 63333333
10001 4194304

real 7m0.095s
user 1m43.026s
sys 0m1.160s

4096k fft sec/iter = 0.042

M131101 to M1548619 1000 iterations check sum compare to Glucas,it is correct.

Thank you,
Attached Files
 MacLucasFFTW.cuda.k.tar.gz (30.0 KB, 503 views)

 2009-11-05, 09:46 #32 nucleon Is there any advantage is doing multiple FFTs at the same time on the GPU? i.e. can we get 2x prime checks at the same time is say <50% of the time in doing one check? -- Craig
2009-11-05, 15:06   #33
msft
msft

Jul 2009
Tokyo

2·5·61 Posts

Hi, nucleon

Quote:
 Originally Posted by nucleon Is there any advantage is doing multiple FFTs at the same time on the GPU? i.e. can we get 2x prime checks at the same time is say <50% of the time in doing one check?
Unfortunately PCI was too slow for LL-test.

Thank you,

