CUDALucas (a.k.a. MaclucasFFTW/CUDA 2.3/CUFFTW)
 2009-11-06, 06:27 #34 frmky     Jul 2003 So Cal 50448 Posts Version k runs at .0141 sec/iter for the 2048K FFT and .0264 sec/iter for the 4096K FFT on the C1060.
 2009-11-06, 10:05 #35 nucleon     Mar 2003 Melbourne 5×103 Posts Cool. Can't wait for the 3xx series - Dec 2009 release date apparently (well according to wikipedia). -- Craig
 2009-11-06, 10:20 #36 BigBrother   Feb 2005 The Netherlands 2×109 Posts After some fiddling, I managed to compile and run this under Windows. I had to replace two memalign() functions with malloc(), because memalign() is apparently obsolete. Here are two results: Code: D:\Code\MaclucasFFTW.cuda.k>a 11213 too small Exponent Code: D:\Code\MaclucasFFTW.cuda.k>a 216091 1 131072 10001 131072 20001 131072 30001 131072 40001 131072 50001 131072 60001 131072 70001 131072 80001 131072 90001 131072 100001 131072 110001 131072 120001 131072 130001 131072 140001 131072 150001 131072 160001 131072 170001 131072 180001 131072 190001 131072 200001 131072 210001 131072 M( 216091 )C, 0xfffffffffffffffd, n = 131072, MacLucasFFTW v8.1 Ballester This last one ran for only +- 30 seconds. Other exponents that are big enough have the same results: 0xfffffffffffffffd after a very short while. My video card is a 9600 M GS, and is capable of running Folding@Home.
Quote:
 Originally Posted by BigBrother My video card is a 9600 M GS, and is capable of running Folding@Home.
Are you sure this card supports double precision? If it doesn't, it will use single precision FP internally, and generate completely wrong answers.

Hi, frmky
Quote:
 Originally Posted by frmky .0141 sec/iter for the 2048K FFT and .0264 sec/iter for the 4096K FFT on the C1060.
4096K FFT performance is reasonable,My GTX260's 4096K FFT performance is not.

Hi,nucleon
Quote:
 Originally Posted by nucleon Cool. Can't wait for the 3xx series - Dec 2009 release date apparently (well according to wikipedia).
me too.

Hi, BigBrother
Quote:
 Originally Posted by BigBrother Code: D:\Code\MaclucasFFTW.cuda.k>a 11213 too small Exponent
Need Exponent more than 131072, aint() function need Exponent more than FFT size.
Quote:
 Originally Posted by BigBrother My video card is a 9600 M GS, and is capable of running Folding@Home.
Sorry only 2xx support DP.

Hi, jasonp
Quote:
 Originally Posted by jasonp Are you sure this card supports double precision? If it doesn't, it will use single precision FP internally, and generate completely wrong answers.
Nice support, thank you

Hi,

Version o runs at .0134 sec/iter for the 2048K FFT and .0320 sec/iter for the 4096K FFT on the GTX260.
 2009-11-07, 07:30 #40 frmky     Jul 2003 So Cal 259610 Posts Excellent! I have another calculation running now, so I won't be able to bench it on the C1060 for a few days. Two questions... First, can this be adapted to use non-power-of-2 FFT's, and if so would there be speed gains using comparable FFT sizes to those used by Prime95? Secondly, can this be multithreaded with the calculation split over multiple GPU's, or as the devices can't talk directly to each other will the required memory transfers from/to the host kill the speed? I ask this last question since I'm actually using an S1070 with four C1060's.
Hi, frmky
Quote:
 Originally Posted by frmky First, can this be adapted to use non-power-of-2 FFT's, and if so would there be speed gains using comparable FFT sizes to those used by Prime95?
I Consider it.
Quote:
 Originally Posted by frmky Secondly, can this be multithreaded with the calculation split over multiple GPU's, or as the devices can't talk directly to each other will the required memory transfers from/to the host kill the speed? I ask this last question since I'm actually using an S1070 with four C1060's.
My question is "How is 1D FFT supported on S1070 ?".

Quote:
 Originally Posted by msft My question is "How is 1D FFT supported on S1070 ?".
The S1070 is really just four discrete C1060's, just housed in a separate unit. It is no different than installing four GTX260's in your computer. Each card must be addressed individually from a different program thread, and the cards cannot directly communicate with each other.

Hi,
Quote:
 Originally Posted by frmky can this be adapted to use non-power-of-2 FFT's,
I make non-power-of-2 FFT version with cufftExecD2Z(),but cufftExecD2Z() is two times slower than cufftExecZ2Z().
Someone tell me use Complex FFT method ?

Thank you,

Hi,
Quote:
 Originally Posted by frmky the cards cannot directly communicate with each other.
Nobody say 1D FFT performance on S1070,this is answer.

