mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-03-14, 21:18   #232
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

2×1,439 Posts
Default

Quote:
Originally Posted by kracker View Post
In one way it still makes sense to do them on CPU... I mean, 22 ms for a 30M exp.?? On a $1000 card?
Look again, it's a 332M exp. It was a test for how long a 100M digit would take on the Titan.
ATH is offline   Reply With Quote
Old 2013-03-14, 21:22   #233
kracker
ἀβουλία
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

41578 Posts
Default

Quote:
Originally Posted by ATH View Post
Look again, it's a 332M exp. It was a test for how long a 100M digit would take on the Titan.
Ahh, I see. Misread, sorry..


Last fiddled with by kracker on 2013-03-14 at 21:23
kracker is offline   Reply With Quote
Old 2013-03-14, 21:42   #234
Brain
 
Brain's Avatar
 
Dec 2009
Peine, Germany

331 Posts
Default Efficient FFT length

In preparation for a new table with efficient FFT lengths I ran the CUDALucas cufftbench on a 810MHz/2500MHz Titan. It's the raw output I will process and filter the next days. Remind that the bench timings are about 4x shorter than real LL test iteration times.
For those who are interested.

I heated the Titan up before, boost effect shouldn't be too significant.

Code:
rem 1048576    8388608    16777216
rem         8421376    16809984    16842752
e:
cd E:\Eigene Dateien\Computing\CUDALucas\2.03\D0
CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 1048576 16842752 65536
CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 1048576 8421376 32768 > bench_1M-8M.txt
CUDALucas-2.03-5.0-sm_35-x64.exe -cufftbench 8388608 16809984 32768 > bench_8M-16M.txt
pause
Attached Files
File Type: zip cufftbench_titan_810MHz_2500MHz.zip (5.1 KB, 59 views)
Brain is offline   Reply With Quote
Old 2013-03-15, 04:52   #235
axn
 
axn's Avatar
 
Jun 2003

4,663 Posts
Default

Quote:
Originally Posted by Brain View Post
In preparation for a new table with efficient FFT lengths I ran the CUDALucas cufftbench on a 810MHz/2500MHz Titan.
Is there a technical reason that the CuLu FFTs must be a multiple of 32K? I fear that there are many excellent FFT lengths that are being overlooked, if only this restriction could be relaxed a bit (perhaps multiple of 8K).

Quote:
Originally Posted by Brain View Post
Remind that the bench timings are about 4x shorter than real LL test iteration times.
I assume this is because only cuFFT portion is being bechmarked? This is a problem. This could mean that a faster larger FFT as per bench might give poorer iteration times compared to a slightly slower but smaller FFT, because the non-FFT takes less time. Realistic benching would use the full iteration logic, and give the correct iteration times as well.
axn is online now   Reply With Quote
Old 2013-03-15, 05:22   #236
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

5×11×157 Posts
Default

I did benched them many times for 1k steps (for gtx580 only, or mostly, and very few for 570 and Tesla C2xxx). And posted the results repeatedly. I still tune them, not for every exponent, but for every small range where I get exponents. The reality is that the best FFTs goes for high powers of 2 and 3, and the real life (i.e. doing LL test, not FFT benching) favors the 8/16/32/64k multiplies, depending on the card, they fit better to the number of drawers inside of your card. See msft's posts too.

Theoretically, there is no reason why a fft could not be non-multiple of 8k, 16k, etc.

Generally, the time taken for a CL iteration is linear to the time displayed by the -cufftbench switch. For my cards, this is about 2.6*t, where t is the time from the bench. That is, if the bench tells me that some FFT needs 2.5ms to be performed, then each LL iteration at that FFT size will take about 2.5*2.6=6.5ms to be performed. But this is not valid for ALL sizes, few of them stand out with a larger constant in front, looking like they need different times to perform the multiplications or the Ifft, effectively
LaurV is offline   Reply With Quote
Old 2013-03-15, 08:54   #237
axn
 
axn's Avatar
 
Jun 2003

4,663 Posts
Default

Quote:
Originally Posted by LaurV View Post
I did benched them many times for 1k steps (for gtx580 only, or mostly, and very few for 570 and Tesla C2xxx).
It is prohibitively expensive to test all 1K steps on all GPU/cuFFT combinations. However, we do know better. Looking at the factorization of FFT length can tell us that some FFT lengths will have no hope for good performance.

Attached is a list of "sensible" FFT lengths (at 8K intervals) that should be sufficient for getting optimal/near optimal benching results. The rule used is that the length must be a product of powers of 2,3,5, and/or 7, or a small prime (<= 31) times a power of 2.

If CuLu benching can be modified to add a "sensible length only" flag, the following code can be used:
Code:
int isSensible(int fftlen)
{
	while (!(fftlen & 1)) fftlen >>= 1;

	if(fftlen <= 31) return 1;

	for(int p=3; p<=7; p+=2)
		while (fftlen%p == 0) fftlen /= p;

	return (fftlen == 1);
}
Attached Files
File Type: txt sensible.txt (3.0 KB, 100 views)
axn is online now   Reply With Quote
Old 2013-03-15, 09:55   #238
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

5·11·157 Posts
Default

What you say is right, it makes no sense to test all the sizes, but only "some". "All" were tested once in the beginning, "few" were selected, and those are "tuned" every time when the range changes. Because the final result will still depends on the system (GPU/CPU/running applications combination).

Last fiddled with by LaurV on 2013-03-15 at 09:56 Reason: long storry cancelled
LaurV is offline   Reply With Quote
Old 2013-03-15, 12:12   #239
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Quote:
Originally Posted by axn View Post
It is prohibitively expensive to test all 1K steps on all GPU/cuFFT combinations. However, we do know better. Looking at the factorization of FFT length can tell us that some FFT lengths will have no hope for good performance.

Attached is a list of "sensible" FFT lengths (at 8K intervals) that should be sufficient for getting optimal/near optimal benching results. The rule used is that the length must be a product of powers of 2,3,5, and/or 7, or a small prime (<= 31) times a power of 2.

If CuLu benching can be modified to add a "sensible length only" flag, the following code can be used:
Code:
int isSensible(int fftlen)
{
    while (!(fftlen & 1)) fftlen >>= 1;

    if(fftlen <= 31) return 1;

    for(int p=3; p<=7; p+=2)
        while (fftlen%p == 0) fftlen /= p;

    return (fftlen == 1);
}

That would be easy to implement. Is there any reason why you are looking at 8K intervals? The actual test can tolerate intervals as small as whatever you have "threads" set to in the ini file.
owftheevil is offline   Reply With Quote
Old 2013-03-15, 13:45   #240
axn
 
axn's Avatar
 
Jun 2003

4,663 Posts
Default

Quote:
Originally Posted by owftheevil View Post
That would be easy to implement. Is there any reason why you are looking at 8K intervals? The actual test can tolerate intervals as small as whatever you have "threads" set to in the ini file.
Not particularly. Reducing that to whatever that is "technically allowed" is just fine. Surprisingly, since we're essentially looking for smooth numbers, the total good candidates are very less even with really small strides (like, say, 128)

EDIT:- Counts of sensible candidates <= 2^24 for various strides:

Code:
stride #cands
---- ---
2^15 143
2^14 187
2^13 241
2^12 306
2^11 382
2^10 472
 2^9 577
 2^8 700
 2^7 842
 2^6 1004

Last fiddled with by axn on 2013-03-15 at 13:53
axn is online now   Reply With Quote
Old 2013-03-15, 15:27   #241
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

22·53 Posts
Default

Did anyone ever try mmff to see if/how much it speeds up?

Also, since the board has to have its clock lowered, getting a "superclocked"
version would be superflous, or would the faster GPU clock still be usable?

Interestingly, it should be noted that the production Tesla K20 & K20X boards
have their memory clock set @ 2.6 ghz.
tServo is offline   Reply With Quote
Old 2013-03-15, 17:23   #242
Redarm
 
Redarm's Avatar
 
Apr 2012
Berlin Germany

3×17 Posts
Default

Quote:
Originally Posted by tServo View Post
Did anyone ever try mmff to see if/how much it speeds up?

180M/s MM127 184-bit

Last fiddled with by Redarm on 2013-03-15 at 17:24
Redarm is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Titan's Best Choice Brain GPU Computing 30 2019-10-19 19:19
Titan Black ATH Hardware 15 2017-05-27 22:38
Is any GTX 750 the GeForce GTX 750 Ti owner here? pepi37 Hardware 12 2016-07-17 22:35
Nvidia announces Titan X ixfd64 GPU Computing 20 2015-04-28 00:27
2x AMD 7990 or 2x Nvidia Titan ?? Manpowre GPU Computing 27 2013-05-12 10:00

All times are UTC. The time now is 08:34.

Wed Aug 5 08:34:39 UTC 2020 up 19 days, 4:21, 1 user, load averages: 1.93, 1.80, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.