mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-10-02, 12:37   #1
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default How to pick the best FFT size: a CUDALucas guide

Greetings.

I wrote this guide after thinking of a semi-automated method for best FFT size generation.
Nothing new to be honest, just a systematic approach.
The method, of course, could be fully scripted, but I neither possess the knowledge to do that or feel the need to, it's easy already.

The only pre-requesites are a perl interpreter and unixutils/Cygwin.

Let's begin!



0. Pick an exponent
1. Run
Code:
CUDALucas -t -c 10000 $exp$
2. Note the failed FFT($starting_FFT) size and the one that works(working_FFT).
2.1a If the $starting_FFT$ did not fail, then the new $starting_FFT$ = 2^n, where the old $starting_FFT$ > 2^n.
3. Run the following command, where working_FFT < ending_FFT >= 2^n.
Code:
CUDALucas -cufftbench $starting_FFT$ $ending_FFT$ 512 > data.txt
4. After the command has finished, delete all non-FFT related text from the data.txt file (usually it's 2-3 lines).
5. Run
Code:
perl -ne "chomp; my $a=substr ($_, 24); my $b=substr ($_, 0, 24); print \"$a $b\n\"" < data.txt > data_flipped.txt
6. Run
Code:
sort -n $../data_flipped.txt$ -o $../FFT_data.txt$
7. Run the following command, if at first 500 iterations you see that error size >= 0.25, pick the next best FFT size and run the command with it again.
Code:
CUDALucas -t -c 500 -f $best FFT size$ $exp$
8. If the error is < 2.5, you're good to go.



Example № 1:
0. Let's pick 6972593 as our exponent.
1. With my GTX Titan, the starting FFT was chosen as 393216, which is 2^17 * 3.
2.1a The new starting FFT should be 2^18 (262144), since 2^18 < 2^17 * 3.
3. Ending FFT was chosen as 524800, since 2^17 * 3 < 2^19 + 512 > 2^19. Full command now looks like
Code:
CUDALucas -cufftbench 262144 524800 512 > data.txt
4. Done, sample data.txt attached.
5. Done, sample data_flipped.txt attached.
6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file.
7. First seven FFT sizes produced error rates > 0.25, no good.
8. FFT size of 401408, which is 2^13 * 7^2 was picked as the fastest. As a result, it produces timings of 0.3095ms per LL iteration vs 0.3520ms per LL iteration with the default(393216) FFT size.

Example № 2:
0. Let's pick 68321161 as our exponent.
1. With GTX titan, the starting FFT was chosen as 3670016, which is 2^19 * 7.
2. Working FFT was chosen as 3932160, which is 2^18 * 3 * 5.
3. Ending FFT was chosen as 4194816, since 2^19 * 7 < 2^22 + 512 > 2^22. Full command now looks like
Code:
CUDALucas -cufftbench 3670016 4194816 512 > data.txt
4. Done, sample data.txt attached.
5. Done, sample data_flipped.txt attached.
6. Done, sorted by efficiency FFT sizes are in the attached FFT_data.txt file.
7. First FFT size was golden.
8. FFT size of 4096000, which is 2^15 * 5^3, was picked as the fastest. As a result, it produces timings of 2.9670ms per LL iteration vs. 4.2085ms per LL iteration with the default (3932160) FFT size. This would save me 20h of work, should I decide to do a LL test on 68321161.



Thoughs and assumptions:

Cufftbench can be reworked to test numbers only in the following forms:
2^n
2^n * 3^m
2^n * 5^m
2^n * 7^m
2^n * 3^m * 5^l
2^n * 3^m * 7^l
2^n * 5^m * 7^l
2^n * 3^m * 5^l * 7^k
where n,m,l,k are within all the numbers within some sane boundary ([1;32] ?)
Once the numbers are generated, another algorithm applies, which discards values that are too big and too small (this is the current cufftbench's start and end values).
This way, we don't waste our time on numbers, which will yield bad timings, and the FFT size selection happens a lot faster.
Of course, I may be missing something here, but as far as I know, current high-end NV gpus benefit from FFT sizes in some of those forms.
Other forms may or may not be beneficial for future CUDA/OpenCL GPUs.

This guide should be useful for huge exponents.

I'm currently doing a LL test again 59473411, and the best FFT size for it is 3359232, which is 2^9 * 3^8.
Interesting combination, ehh?



I'd like to hear your comments, suggestions and corrections regarding all of this.
Attached Files
File Type: zip examples.zip (37.9 KB, 121 views)

Last fiddled with by Karl M Johnson on 2013-10-02 at 12:39 Reason: yes
Karl M Johnson is offline   Reply With Quote
Old 2013-10-02, 22:30   #2
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Be careful with the fft size. Last I looked, in order to give correct results, the fft size must be a multiple of the threads parameter. I doubt you are using threads > 512, but it would be worth checking just to make sure.

Also, the pointwise multiplication kernel requires fft size be a multiple of 512 for correct output, so don't go lower than 2^9 on the power of two part.

Last fiddled with by owftheevil on 2013-10-02 at 22:33
owftheevil is offline   Reply With Quote
Old 2013-10-02, 22:41   #3
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

24×5×7×17 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Also, the pointwise multiplication kernel requires fft size be a multiple of 512 for correct output, so don't go lower than 2^9 on the power of two part.
That sounds like something from Python!
chalsall is offline   Reply With Quote
Old 2013-10-03, 09:17   #4
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3·137 Posts
Default

Quote:
Originally Posted by owftheevil View Post
Be careful with the fft size. Last I looked, in order to give correct results, the fft size must be a multiple of the threads parameter. I doubt you are using threads > 512, but it would be worth checking just to make sure.

Also, the pointwise multiplication kernel requires fft size be a multiple of 512 for correct output, so don't go lower than 2^9 on the power of two part.
When you pick a FFT size that is indivisible by thread count, CUDALucas crashes.
No big deal.
I assume cufftbench will not produce such FFT sizes (since it depends on the threads parameter), so that changes nothing.

As for pointwise multiplication, if the FFT size is indivisible by 2^9, while the threads are set to < 2^9, it just throws an error rate of 0.5 at you.
Again, no big deal.
If it is even possible to stumble across such a FFT size, it can safely be skipped.

Any other comments?
Am I the only one picking golden FFT sizes before commencing a LL test (it's somewhat akin to GNFS's polynomial selection)?
Karl M Johnson is offline   Reply With Quote
Old 2013-10-03, 12:50   #5
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Any other comments?
Am I the only one picking golden FFT sizes before commencing a LL test (it's somewhat akin to GNFS's polynomial selection)?
I have to study your method. I have tried to follow some of LaurV's advice in choosing FFT size, but I have not been as systematic as you or he. I have not put as much effort into CuLu, but I like to at least figure out how to run the various GPU programs. Being neither a programmer, nor a mathematician means that it sometimes takes me a while.

I do appreciate, and try to take advantage of the efforts of those more thoughtful and skilled.
kladner is offline   Reply With Quote
Old 2013-10-03, 13:48   #6
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

32×5×7 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
When you pick a FFT size that is indivisible by thread count, CUDALucas crashes.
No big deal.
I assume cufftbench will not produce such FFT sizes (since it depends on the threads parameter), so that changes nothing.

As for pointwise multiplication, if the FFT size is indivisible by 2^9, while the threads are set to < 2^9, it just throws an error rate of 0.5 at you.
Again, no big deal.
If it is even possible to stumble across such a FFT size, it can safely be skipped.

Any other comments?
Am I the only one picking golden FFT sizes before commencing a LL test (it's somewhat akin to GNFS's polynomial selection)?

So in the rare cases that it might show up, the problem is simply annoying rather than critical, since the program gives immediate feedback that something is wrong rather than quietly producing an incorrect result many hours later. That's good to know.
owftheevil is offline   Reply With Quote
Old 2013-10-10, 08:20   #7
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

Here's another guide which utilizes an easier approach.
I've done a cufftbench on a [25M;100M] exponent range, which was identified as FFT sizes of [1048576;8388608], with a step of 512.

Now, there are three lists: The raw results of FFT benchmark, FFT sizes sorted by efficiency and factored FFT sizes, sorted by efficiency or download all from sendspace.
The devs may need all 3 lists, users of Cudalucas should stick with the second.

Now, to pick a golden FFT size, note the FFT size Cudalucas selects, and search for a similar (or even a smaller one, if Culu doesn't fail) size.

Here's an example:
Say I want to commence a LL test on 2^85123453 - 1.
Here's what I get when first launching Cudalucas:
Code:
Starting M85123453 fft length = 4718592
iteration = 33 < 1000 && err = 0.253906 >= 0.25, increasing n from 4718592
Starting M85123453 fft length = 5242880
Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 5242880, CUDALucas v2.03 err = 0.0244 (0:05 real, 4.5829 ms/iter, ETA 108:21:49)
So the starting FFT was 4.718M, and we can't go lower as the error rate is already > 0.25.
Search the sorted FFT size list for "CUFFT_Z2Z size= 4\d\d\d\d\d\d" (you'll need a text editor that supports regxp, Notepad++ with TextFX plugin does).
First results will be 4096000, which is too small for us.

After a dozen of clicks, you should find 4741632.
Even though it's bigger, here's what it produces:
Code:
Starting M85123453 fft length = 4741632
Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 4741632, CUDALucas v2.03 err = 0.2969 (0:04 real, 4.1280 ms/iter, ETA 97:36:25)
Iteration 2000 M( 85123453 )C, 0x8898eec48732d5cc, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0200 ms/iter, ETA 95:03:02)
Iteration 3000 M( 85123453 )C, 0x1fc79db79df64352, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0224 ms/iter, ETA 95:06:28)
Iteration 4000 M( 85123453 )C, 0x7c85fa5ae6a60809, n = 4741632, CUDALucas v2.03 err = 0.3140 (0:04 real, 4.0204 ms/iter, ETA 95:03:34)
No idea why Cudalucas ignored the error, but it's bigger than 0.25, so we shouldn't use it.

Soon, you should find 4816896.
Looks like it's what we need:
Code:
Iteration 1000 M( 85123453 )C, 0xdf4efecbfda348f5, n = 4816896, CUDALucas v2.03 err = 0.2188 (0:05 real, 4.1924 ms/iter, ETA 99:07:43)
Iteration 2000 M( 85123453 )C, 0x8898eec48732d5cc, n = 4816896, CUDALucas v2.03 err = 0.2188 (0:04 real, 4.0760 ms/iter, ETA 96:22:35)
Iteration 3000 M( 85123453 )C, 0x1fc79db79df64352, n = 4816896, CUDALucas v2.03 err = 0.2227 (0:04 real, 4.1040 ms/iter, ETA 97:02:15)
Error is < 2.5, timings look better than for FFT size of 5242880.

A note to Cudalucas developers: as you can see from the third list, a lot of FFT sizes are not entirely made of classic 2/3/5/7 integers, yet they still remain faster than some others.

Conclusion(s):
1. Cufftbenchmark is perfect, no need to change it, as it revealed FFT sizes composed of interesting factors. No connection found.
2. Testing FFT sizes on [100M;200M] range would take a while.

Comments, ideas, suggestions are welcome.

Last fiddled with by Karl M Johnson on 2013-10-10 at 08:22
Karl M Johnson is offline   Reply With Quote
Old 2013-10-10, 17:45   #8
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

24×5×7×17 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
Comments, ideas, suggestions are welcome.
Interesting...

For those who do CUDALucas work, could a table be derived which would show where the "sweat spots" (and the related FFT length) are for the ranges currently being worked?

This thought comes out of discussions Kracker and LaurV recently had as to where their kit's best (read: most efficient) work area was for DCing.
chalsall is offline   Reply With Quote
Old 2013-10-11, 04:15   #9
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

19B16 Posts
Default

Such a table will be highly beneficial, but I could only compile it by using Cudalucas with constantly increasing exponents.
I've seen people estimate FFT lengths based on exponent size.
Is there a formula for approximate FFT size, which is based on exponent?
Or is it a rule of thumb?
Karl M Johnson is offline   Reply With Quote
Old 2013-10-11, 07:33   #10
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

222248 Posts
Default

The numbers are system-dependent (i.e. GPU mostly). A table will only have guidance character, but the user still need to bench its own system for the range of the expos he is doing. You will be very surprised!

All this discussion we already had for cudaLucas, when Dubslow repeatedly changed (see following posts there) the table used for FFT selection in that program. We got a list with those best FFT's for different cards, including for Tesla, or for Titan, (and if you dig around those posts, there is a lot of info, say read 4-5 pages back and forth, from about page 150, in the cudaLucas thread), which table is partially embedded into cudaLucas 2.04, and it is doing well for my gtx580, but still has some "small flicks" here and there. I always do my tuning when I switch exponents/ranges.

I mean, the just-written tutorial is nice, and kudos/kotgw to Karl for putting all together, but re-negotiating the same things every 4 months is a bit bothering, don't you think?

The idea is that if you want to squeeze the last bit of speed from the GPU, you need to tune, anyhow. If not, then let cudaLucas to do its job, he is (almost) very good in doing it. With a "table", you won't be able to beat it (except for few very particular cards/systems), unless you make this table a "book" (one table for one GPU, or so). Of course, there are "guides" about the "granularity" of the FFT (again, see cuLu thread), finer granularities doing better than "some small factors times a big prime", but that is not a rule. There are cases when (depending of the number of threads, card, exponent), for example, x*y*2^2*5 and x*y*2*3^2 are both worse than x*y*19. See concrete example in those lists.

Last fiddled with by LaurV on 2013-10-11 at 07:43
LaurV is offline   Reply With Quote
Old 2013-10-11, 20:23   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1162510 Posts
Default

Quote:
Originally Posted by Karl M Johnson View Post
I've seen people estimate FFT lengths based on exponent size.
Is there a formula for approximate FFT size, which is based on exponent?
Or is it a rule of thumb?
See here, which is based on the heuristic I developed here. That may need a bit of tweaking if your code gives highly disparate ROEs depending on the odd component of the FFT length, but that could be easily added to the basic functionality.
ewmayer is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRP:- Pick A Range Citrix Prime Sierpinski Project 2 2014-02-16 18:47
Guide OmbooHankvald Operation Billion Digits 4 2009-08-26 08:18
Help me pick a math course. jasong Math 9 2005-03-11 21:04
Pick and Choose Wacky Puzzles 5 2003-07-16 20:02
Pick a stone, or two, .... or three Wacky Puzzles 5 2003-06-24 16:11

All times are UTC. The time now is 15:46.

Sun Apr 11 15:46:23 UTC 2021 up 3 days, 10:27, 1 user, load averages: 2.03, 1.98, 1.89

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.