![]() |
[QUOTE=kracker;333364]In one way it still makes sense to do them on CPU... I mean, 22 ms for a 30M exp.?? On a $1000 card?[/QUOTE]
Look again, it's a 332M exp. It was a test for how long a 100M digit would take on the Titan. |
[QUOTE=ATH;333367]Look again, it's a 332M exp. It was a test for how long a 100M digit would take on the Titan.[/QUOTE]
Ahh, I see. Misread, sorry.. :failed: |
Efficient FFT length
1 Attachment(s)
In preparation for a new table with efficient FFT lengths I ran the CUDALucas cufftbench on a 810MHz/2500MHz Titan. It's the raw output I will process and filter the next days. Remind that the bench timings are about 4x shorter than real LL test iteration times.
For those who are interested. I heated the Titan up before, boost effect shouldn't be too significant. [CODE]rem 1048576 8388608 16777216 rem 8421376 16809984 16842752 e: cd E:\Eigene Dateien\Computing\CUDALucas\2.03\D0 CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 1048576 16842752 65536 CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 1048576 8421376 32768 > bench_1M-8M.txt CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 8388608 16809984 32768 > bench_8M-16M.txt pause[/CODE] |
[QUOTE=Brain;333372]In preparation for a new table with efficient FFT lengths I ran the CUDALucas cufftbench on a 810MHz/2500MHz Titan. [/quote]
Is there a technical reason that the CuLu FFTs must be a multiple of 32K? I fear that there are many excellent FFT lengths that are being overlooked, if only this restriction could be relaxed a bit (perhaps multiple of 8K). [QUOTE=Brain;333372]Remind that the bench timings are about 4x shorter than real LL test iteration times.[/quote] I assume this is because only cuFFT portion is being bechmarked? This is a problem. This could mean that a faster larger FFT as per bench might give poorer iteration times compared to a slightly slower but smaller FFT, because the non-FFT takes less time. Realistic benching would use the full iteration logic, and give the correct iteration times as well. |
I did benched them many times for 1k steps (for gtx580 only, or mostly, and very few for 570 and Tesla C2xxx). And posted the results repeatedly. I still tune them, not for every exponent, but for every small range where I get exponents. The reality is that the best FFTs goes for high powers of 2 and 3, and the real life (i.e. doing LL test, not FFT benching) favors the 8/16/32/64k multiplies, depending on the card, they fit better to the number of drawers inside of your card. See msft's posts too.
Theoretically, there is no reason why a fft could not be non-multiple of 8k, 16k, etc. Generally, the time taken for a CL iteration is linear to the time displayed by the -cufftbench switch. For my cards, this is about 2.6*t, where t is the time from the bench. That is, if the bench tells me that some FFT needs 2.5ms to be performed, then each LL iteration at that FFT size will take about 2.5*2.6=6.5ms to be performed. But this is not valid for ALL sizes, few of them stand out with a larger constant in front, looking like they need different times to perform the multiplications or the Ifft, effectively :shock: |
1 Attachment(s)
[QUOTE=LaurV;333409]I did benched them many times for 1k steps (for gtx580 only, or mostly, and very few for 570 and Tesla C2xxx). [/QUOTE]
It is prohibitively expensive to test all 1K steps on all GPU/cuFFT combinations. However, we do know better. Looking at the factorization of FFT length can tell us that some FFT lengths will have no hope for good performance. Attached is a list of "sensible" FFT lengths (at 8K intervals) that should be sufficient for getting optimal/near optimal benching results. The rule used is that the length must be a product of powers of 2,3,5, and/or 7, or a small prime (<= 31) times a power of 2. If CuLu benching can be modified to add a "sensible length only" flag, the following code can be used: [code] int isSensible(int fftlen) { while (!(fftlen & 1)) fftlen >>= 1; if(fftlen <= 31) return 1; for(int p=3; p<=7; p+=2) while (fftlen%p == 0) fftlen /= p; return (fftlen == 1); } [/code] |
What you say is right, it makes no sense to test all the sizes, but only "some". "All" were tested once in the beginning, "few" were selected, and those are "tuned" every time when the range changes. Because the final result will still depends on the system (GPU/CPU/running applications combination).
|
[QUOTE=axn;333415]It is prohibitively expensive to test all 1K steps on all GPU/cuFFT combinations. However, we do know better. Looking at the factorization of FFT length can tell us that some FFT lengths will have no hope for good performance.
Attached is a list of "sensible" FFT lengths (at 8K intervals) that should be sufficient for getting optimal/near optimal benching results. The rule used is that the length must be a product of powers of 2,3,5, and/or 7, or a small prime (<= 31) times a power of 2. If CuLu benching can be modified to add a "sensible length only" flag, the following code can be used: [code] int isSensible(int fftlen) { while (!(fftlen & 1)) fftlen >>= 1; if(fftlen <= 31) return 1; for(int p=3; p<=7; p+=2) while (fftlen%p == 0) fftlen /= p; return (fftlen == 1); } [/code][/QUOTE] That would be easy to implement. Is there any reason why you are looking at 8K intervals? The actual test can tolerate intervals as small as whatever you have "threads" set to in the ini file. |
[QUOTE=owftheevil;333424]That would be easy to implement. Is there any reason why you are looking at 8K intervals? The actual test can tolerate intervals as small as whatever you have "threads" set to in the ini file.[/QUOTE]
Not particularly. Reducing that to whatever that is "technically allowed" is just fine. Surprisingly, since we're essentially looking for smooth numbers, the total good candidates are very less even with really small strides (like, say, 128) EDIT:- Counts of sensible candidates <= 2^24 for various strides: [CODE] stride #cands ---- --- 2^15 143 2^14 187 2^13 241 2^12 306 2^11 382 2^10 472 2^9 577 2^8 700 2^7 842 2^6 1004[/CODE] |
Did anyone ever try mmff to see if/how much it speeds up?
Also, since the board has to have its clock lowered, getting a "superclocked" version would be superflous, or would the faster GPU clock still be usable? Interestingly, it should be noted that the production Tesla K20 & K20X boards have their memory clock set @ 2.6 ghz. |
[QUOTE=tServo;333458]Did anyone ever try mmff to see if/how much it speeds up?
[/QUOTE] 180M/s MM127 184-bit |
| All times are UTC. The time now is 10:32. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.