20130314, 21:18  #232 
Einyen
Dec 2003
Denmark
2×1,439 Posts 

20130314, 21:22  #233  
ἀβουλία
"Mr. Meeseeks"
Jan 2012
California, USA
4157_{8} Posts 
Quote:
Last fiddled with by kracker on 20130314 at 21:23 

20130314, 21:42  #234 
Dec 2009
Peine, Germany
331 Posts 
Efficient FFT length
In preparation for a new table with efficient FFT lengths I ran the CUDALucas cufftbench on a 810MHz/2500MHz Titan. It's the raw output I will process and filter the next days. Remind that the bench timings are about 4x shorter than real LL test iteration times.
For those who are interested. I heated the Titan up before, boost effect shouldn't be too significant. Code:
rem 1048576 8388608 16777216 rem 8421376 16809984 16842752 e: cd E:\Eigene Dateien\Computing\CUDALucas\2.03\D0 CUDALucas2.035.0sm_35x64.exe cufftbench 1048576 16842752 65536 CUDALucas2.035.0sm_35x64.exe cufftbench 1048576 8421376 32768 > bench_1M8M.txt CUDALucas2.035.0sm_35x64.exe cufftbench 8388608 16809984 32768 > bench_8M16M.txt pause 
20130315, 04:52  #235  
Jun 2003
4,663 Posts 
Quote:
I assume this is because only cuFFT portion is being bechmarked? This is a problem. This could mean that a faster larger FFT as per bench might give poorer iteration times compared to a slightly slower but smaller FFT, because the nonFFT takes less time. Realistic benching would use the full iteration logic, and give the correct iteration times as well. 

20130315, 05:22  #236 
Romulan Interpreter
Jun 2011
Thailand
5×11×157 Posts 
I did benched them many times for 1k steps (for gtx580 only, or mostly, and very few for 570 and Tesla C2xxx). And posted the results repeatedly. I still tune them, not for every exponent, but for every small range where I get exponents. The reality is that the best FFTs goes for high powers of 2 and 3, and the real life (i.e. doing LL test, not FFT benching) favors the 8/16/32/64k multiplies, depending on the card, they fit better to the number of drawers inside of your card. See msft's posts too.
Theoretically, there is no reason why a fft could not be nonmultiple of 8k, 16k, etc. Generally, the time taken for a CL iteration is linear to the time displayed by the cufftbench switch. For my cards, this is about 2.6*t, where t is the time from the bench. That is, if the bench tells me that some FFT needs 2.5ms to be performed, then each LL iteration at that FFT size will take about 2.5*2.6=6.5ms to be performed. But this is not valid for ALL sizes, few of them stand out with a larger constant in front, looking like they need different times to perform the multiplications or the Ifft, effectively 
20130315, 08:54  #237  
Jun 2003
4,663 Posts 
Quote:
Attached is a list of "sensible" FFT lengths (at 8K intervals) that should be sufficient for getting optimal/near optimal benching results. The rule used is that the length must be a product of powers of 2,3,5, and/or 7, or a small prime (<= 31) times a power of 2. If CuLu benching can be modified to add a "sensible length only" flag, the following code can be used: Code:
int isSensible(int fftlen) { while (!(fftlen & 1)) fftlen >>= 1; if(fftlen <= 31) return 1; for(int p=3; p<=7; p+=2) while (fftlen%p == 0) fftlen /= p; return (fftlen == 1); } 

20130315, 09:55  #238 
Romulan Interpreter
Jun 2011
Thailand
5·11·157 Posts 
What you say is right, it makes no sense to test all the sizes, but only "some". "All" were tested once in the beginning, "few" were selected, and those are "tuned" every time when the range changes. Because the final result will still depends on the system (GPU/CPU/running applications combination).
Last fiddled with by LaurV on 20130315 at 09:56 Reason: long storry cancelled 
20130315, 12:12  #239  
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
3^{2}×5×7 Posts 
Quote:
That would be easy to implement. Is there any reason why you are looking at 8K intervals? The actual test can tolerate intervals as small as whatever you have "threads" set to in the ini file. 

20130315, 13:45  #240  
Jun 2003
4,663 Posts 
Quote:
EDIT: Counts of sensible candidates <= 2^24 for various strides: Code:
stride #cands   2^15 143 2^14 187 2^13 241 2^12 306 2^11 382 2^10 472 2^9 577 2^8 700 2^7 842 2^6 1004 Last fiddled with by axn on 20130315 at 13:53 

20130315, 15:27  #241 
"Marv"
May 2009
near the Tannhäuser Gate
2^{2}·5^{3} Posts 
Did anyone ever try mmff to see if/how much it speeds up?
Also, since the board has to have its clock lowered, getting a "superclocked" version would be superflous, or would the faster GPU clock still be usable? Interestingly, it should be noted that the production Tesla K20 & K20X boards have their memory clock set @ 2.6 ghz. 
20130315, 17:23  #242 
Apr 2012
Berlin Germany
3×17 Posts 
180M/s MM127 184bit Last fiddled with by Redarm on 20130315 at 17:24 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Titan's Best Choice  Brain  GPU Computing  30  20191019 19:19 
Titan Black  ATH  Hardware  15  20170527 22:38 
Is any GTX 750 the GeForce GTX 750 Ti owner here?  pepi37  Hardware  12  20160717 22:35 
Nvidia announces Titan X  ixfd64  GPU Computing  20  20150428 00:27 
2x AMD 7990 or 2x Nvidia Titan ??  Manpowre  GPU Computing  27  20130512 10:00 