![]() |
[QUOTE=msft;293660]I'll marge to CUDALucas.
$ ./CLUDALucas -d 1 -cufftbench 1048576,2097152,65536[/QUOTE] Nice, thanks for working on it. |
[QUOTE=msft;293659]Understand.[/QUOTE]
Whooh! When I saw msft post my heart started to rattle... I was hoping for v1.68 (the one with interactive "-aggressive" switch). From your post #1020 seemed like the code is already modified... and (see my post #1022) "you sir, made my day", but it was only 19 of March the day you made...hehe... anyhow, an interactive "-t" and/or "-s" would be nicer, and "-t nnnnn" the nicest! :bow: (edit: of course, all interactive, menu lines to turn variables on/off only, same as for -aggressive switch). kotgw! |
So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ? |
[QUOTE=Karl M Johnson;293668]So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ?[/QUOTE] Exactly! You have to select the FastestFT for which the errors can still be kept "under control", depending on your exponent and hardware. This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too. For example, for my gtx580, assuming I run a 26M exponent and 512 threads, then the best value is 1474560, for which I get 2.7ms average iteration time (without -t it will be even shorter), and around 0.15 max rounding error (here shown lower, but it will increase later). [CODE]e:\-99-Prime\CudaLucas>cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive 26236981 DEVICE:1------------------------ name GeForce GTX 580 totalGlobalMem 1610612736 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 totalConstMem 65536 major.minor 2.0 clockRate 1564000 textureAlignment 512 deviceOverlap 1 multiProcessorCount 16 mkdir: cannot create directory `backup1': File exists start M26236981 fft length = 1474560 Iteration 100000 M( 26236981 )C, 0xd41bed59d73c8128, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7172 ms/iter, ETA 19:41:58) Iteration 200000 M( 26236981 )C, 0x0baa7d32840e44b4, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:33 real, 2.7258 ms/iter, ETA 19:41:11)[/CODE] [CODE]start M26251807 fft length = 1474560 Iteration 100000 M( 26251807 )C, 0x78ba7538d4751989, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:31 real, 2.7152 ms/iter, ETA 19:41:06) Iteration 200000 M( 26251807 )C, 0x6cf1b6e6d1ec5072, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7180 ms/iter, ETA 19:37:47) Iteration 300000 M( 26251807 )C, 0xf7ec22c7f2c69750, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7243 ms/iter, ETA 19:35:59) Iteration 400000 M( 26251807 )C, 0x69da500199e2849a, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7322 ms/iter, ETA 19:34:50) Iteration 500000 M( 26251807 )C, 0xb6430dea49299dc2, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7245 ms/iter, ETA 19:27:00) Iteration 600000 M( 26251807 )C, 0x84340ee28836b57d, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7260 ms/iter, ETA 19:23:06)[/CODE] Remark that 1474560 is 32768*[B]45[/B], first is the "granularity", it must be multiplies of it, and second is the multiplier. Note that 45 is "smooth", as 3^2*5. This most probable helps to "fill in" the threads when the "butterflies" of FFT are computed. Other values as 1409024 (multiplier 43) and 1343488 (multiplier 41) give much longer times, almost as long as the full power of 2 (multiplier 64). The *44=1441792 gave almost the same ETA as *45, a bit higher, and higher error too, and *40 can not be used (the error is over .45). I tried also larger multipliers, up to the full power of two. The default value of *48=1572864 takes longer time (about 30-50% more) and the error is in 0.07 range. Good ETA's can be optained also for *49 and *54, and generally are worse for all the other multipliers. For different cards (different number of threads, etc) the results may vary. Generally you have to tune the threads and FFT lengths not for every exponent, but for every range, say a million or even finer. This depends how good is your card and how "confortable" you are to risk. Because increasing the error increases the possibility to get bad results (if no -t) or abort the test (with -t) and having to repeat some percent iterations. |
[QUOTE=Karl M Johnson;293668]So I've ran cufftbench and it gave different timings for different FFT sizes.
So, to actually extract some useful info from that, you need to look at lowest timed FFT sizes ?[/QUOTE]CUFFT is black-box,capricious,enigma.:lol: |
1 Attachment(s)
Ver 1.68
1) print depend -d. 2) change fft length code. 3) change raund off err check. not -f potion && iter < 1000 && err >= 0.25 increasing fft length not -f option && iter >= 1000 && err >= 0.35 exit program -f option && err >= 0.35 exit program 4) add -cufftbench 5) add -k 6) change -aggressive to -polite 7) change checkpoint file format ! [code] $ ./CUDALucas $ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-m] exponent|input_filename $ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-t] [-polite iteration] -r $ CUDALucas [-d device_number] -cufftbench start end distance -threads set threads number(default=256) -f set fft length(if round off error then exit) -s save all checkpoint files -t check round off error all iterations -polite GPU polite per iteration(default -polite 1) -polite 0 GPU aggressive -cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 ) -r exec residue test. -k enable keys (p change -polite,t change -t,s change -s) [/code] [code] $ ./CUDALucas -r Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 4608, CUDALucas v1.68 err = 0.01074 (0:02 real, 0.2310 ms/iter, ETA 0:16) Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 7168, CUDALucas v1.68 err = 0.02881 (0:02 real, 0.1701 ms/iter, ETA 0:20) Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.68 err = 0.004395 (0:02 real, 0.1482 ms/iter, ETA 0:29) Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.68 err = 0.03125 (0:02 real, 0.2535 ms/iter, ETA 3:07) Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.68 err = 0.008789 (0:03 real, 0.2963 ms/iter, ETA 4:08) Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3411 ms/iter, ETA 7:02) Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.68 err = 0.08789 (0:03 real, 0.3753 ms/iter, ETA 8:37) Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.68 err = 0.04688 (0:08 real, 0.7523 ms/iter, ETA 37:06) Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.68 err = 0.06445 (0:07 real, 0.7562 ms/iter, ETA 37:56) Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.68 err = 0.04688 (0:18 real, 1.7246 ms/iter, ETA 3:20:03) Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.68 err = 0.02734 (0:32 real, 3.1939 ms/iter, ETA 11:55:57) Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.68 err = 0.08984 (0:47 real, 4.7056 ms/iter, ETA 27:25:24) Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.68 err = 0.1875 (0:52 real, 5.1515 ms/iter, ETA 34:22:19) iteration < 1000 && err = 0.28125 >= 0.25, increasing n from 1310720 Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.68 err = 0.01758 (1:04 real, 6.3293 ms/iter, ETA 45:37:26) iteration < 1000 && err = 0.4375 >= 0.25, increasing n from 1572864 Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.68 err = 0.04688 (1:12 real, 7.1573 ms/iter, ETA 60:25:09) Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.68 err = 0.2422 (1:12 real, 7.1500 ms/iter, ETA 64:41:14) iteration < 1000 && err = 0.34375 >= 0.25, increasing n from 1966080 Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.68 err = 0.1055 (1:21 real, 8.0733 ms/iter, ETA 83:17:23) Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.68 err = 0.1953 (1:31 real, 9.0537 ms/iter, ETA 107:12:40) iteration < 1000 && err = 0.25 >= 0.25, increasing n from 2359296 Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2621440, CUDALucas v1.68 err = 0.02148 (1:44 real, 10.3703 ms/iter, ETA 124:09:19) [/code] [code] $ ./CUDALucas -k -s test -polite 2 1257787 start M1257787 fft length = 65536 Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 7:03) Iteration 20000 M( 1257787 )C, 0x961d384390291afd, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3432 ms/iter, ETA 7:02) Iteration 30000 M( 1257787 )C, 0x53fec009337bd2d5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 6:56) -polite 0 Iteration 40000 M( 1257787 )C, 0x195c41704df8f7f0, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2717 ms/iter, ETA 5:28) Iteration 50000 M( 1257787 )C, 0x6b3f73933bd773df, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2540 ms/iter, ETA 5:04) Iteration 60000 M( 1257787 )C, 0x9b92e660ca8b91d3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2543 ms/iter, ETA 5:02) -polite 2 Iteration 70000 M( 1257787 )C, 0xb63c17041a4b7a76, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3146 ms/iter, ETA 6:11) Iteration 80000 M( 1257787 )C, 0xd0960e64d43d7a0e, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3469 ms/iter, ETA 6:45) Iteration 90000 M( 1257787 )C, 0xdce2b6e8ca6914b1, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3393 ms/iter, ETA 6:33) s desable -s Iteration 100000 M( 1257787 )C, 0x5df3c10927093cb3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3402 ms/iter, ETA 6:31) Iteration 110000 M( 1257787 )C, 0x056d4b3c0ecb57f5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3349 ms/iter, ETA 6:21) -polite 0 Iteration 120000 M( 1257787 )C, 0xa8c4e198df143da9, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3247 ms/iter, ETA 6:06) Iteration 130000 M( 1257787 )C, 0x9c7abf9d886c48fb, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2540 ms/iter, ETA 4:44) t enable -t Iteration 140000 M( 1257787 )C, 0x0780de24fcca7557, n = 65536, CUDALucas v1.68 err = 0.1172 (0:03 real, 0.3055 ms/iter, ETA 5:39) Iteration 150000 M( 1257787 )C, 0x82d2707bb75c4bde, n = 65536, CUDALucas v1.68 err = 0.1172 (0:04 real, 0.3360 ms/iter, ETA 6:09) Iteration 160000 M( 1257787 )C, 0x6608965c5d475691, n = 65536, CUDALucas v1.68 err = 0.1211 (0:03 real, 0.3355 ms/iter, ETA 6:05) [/code] [code] $ ./CUDALucas -cufftbench 1024 2048 256 CUFFT bench start = 1024 end = 2048 distance = 256 CUFFT_Z2Z size= 1024 time= 0.026302 msec CUFFT_Z2Z size= 1280 time= 0.033568 msec CUFFT_Z2Z size= 1536 time= 0.032268 msec CUFFT_Z2Z size= 1792 time= 0.031069 msec [/code] |
1 Attachment(s)
[QUOTE=LaurV;293671]Exactly! You have to select the FastestFT for which the errors can still be kept "under control", depending on your exponent and hardware. This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too. For example, for my gtx580, assuming I run a 26M exponent and 512 threads, then the best value is 1474560, for which I get 2.7ms average iteration time (without -t it will be even shorter), and around 0.15 max rounding error (here shown lower, but it will increase later).
[CODE]e:\-99-Prime\CudaLucas>cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive 26236981 DEVICE:1------------------------ name GeForce GTX 580 totalGlobalMem 1610612736 sharedMemPerBlock 49152 regsPerBlock 32768 warpSize 32 memPitch 2147483647 maxThreadsPerBlock 1024 maxThreadsDim[3] 1024,1024,64 maxGridSize[3] 65535,65535,65535 totalConstMem 65536 major.minor 2.0 clockRate 1564000 textureAlignment 512 deviceOverlap 1 multiProcessorCount 16 mkdir: cannot create directory `backup1': File exists start M26236981 fft length = 1474560 Iteration 100000 M( 26236981 )C, 0xd41bed59d73c8128, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7172 ms/iter, ETA 19:41:58) Iteration 200000 M( 26236981 )C, 0x0baa7d32840e44b4, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:33 real, 2.7258 ms/iter, ETA 19:41:11)[/CODE] [CODE]start M26251807 fft length = 1474560 Iteration 100000 M( 26251807 )C, 0x78ba7538d4751989, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:31 real, 2.7152 ms/iter, ETA 19:41:06) Iteration 200000 M( 26251807 )C, 0x6cf1b6e6d1ec5072, n = 1474560, CUDALucas v1.67 err = 0.1133 (4:32 real, 2.7180 ms/iter, ETA 19:37:47) Iteration 300000 M( 26251807 )C, 0xf7ec22c7f2c69750, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7243 ms/iter, ETA 19:35:59) Iteration 400000 M( 26251807 )C, 0x69da500199e2849a, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7322 ms/iter, ETA 19:34:50) Iteration 500000 M( 26251807 )C, 0xb6430dea49299dc2, n = 1474560, CUDALucas v1.67 err = 0.125 (4:32 real, 2.7245 ms/iter, ETA 19:27:00) Iteration 600000 M( 26251807 )C, 0x84340ee28836b57d, n = 1474560, CUDALucas v1.67 err = 0.125 (4:33 real, 2.7260 ms/iter, ETA 19:23:06)[/CODE] For different cards (different number of threads, etc) the results may vary. Generally you have to tune the threads and FFT lengths not for every exponent, but for every range, say a million or even finer. This depends how good is your card and how "confortable" you are to risk. Because increasing the error increases the possibility to get bad results (if no -t) or abort the test (with -t) and having to repeat some percent iterations.[/QUOTE] With your info in mind, I see a few things: 1) Select the Fastest FFT for which the errors can still be kept "under control" 2) The FFT will never be larger than the default FFT selected by CUDALucas 3) This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too For #3, I'm not sure how make the best selection. Once you select the 'best candidates', what's the method for fiding "smoothness" (or "compositeness"). Then, based on this: [QUOTE]Remark that 1474560 is 32768*[B]45[/B], first is the "granularity", it must be multiplies of it, and second is the multiplier. Note that 45 is "smooth", as 3^2*5. This most probable helps to "fill in" the threads when the "butterflies" of FFT are computed. Other values as 1409024 (multiplier 43) and 1343488 (multiplier 41) give much longer times, almost as long as the full power of 2 (multiplier 64). The *44=1441792 gave almost the same ETA as *45, a bit higher, and higher error too, and *40 can not be used (the error is over .45). I tried also larger multipliers, up to the full power of two. The default value of *48=1572864 takes longer time (about 30-50% more) and the error is in 0.07 range. Good ETA's can be optained also for *49 and *54, and generally are worse for all the other multipliers.[/QUOTE] What is the way to select the best one? How do you find the smoothness? Will any selected FFT find all the factors if the error is not too high? Attached is a cufft test run from my card. |
1 Attachment(s)
[QUOTE=msft;293677]Ver 1.68
1) print depend -d. 2) change fft length code. 3) change raund off err check. not -f potion && iter < 1000 && err >= 0.25 increasing fft length not -f option && iter >= 1000 && err >= 0.35 exit program -f option && err >= 0.35 exit program 4) add -cufftbench 5) add -k 6) change -aggressive to -polite 7) change checkpoint file format ! [code] $ ./CUDALucas $ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-c checkpoint_iteration] [-f fft_length] [-s folder] [-t] [-polite iteration] [-m] exponent|input_filename $ CUDALucas [-d device_number] [-threads 32|64|128|256|512|1024] [-t] [-polite iteration] -r $ CUDALucas [-d device_number] -cufftbench start end distance -threads set threads number(default=256) -f set fft length(if round off error then exit) -s save all checkpoint files -t check round off error all iterations -polite GPU polite per iteration(default -polite 1) -polite 0 GPU aggressive -cufftbench exec CUFFT benchmark (Ex. $ ./CUDALucas -d 1 -cufftbench 1179648 6291456 32768 ) -r exec residue test. -k enable keys (p change -polite,t change -t,s change -s) [/code] [code] $ ./CUDALucas -r Iteration 10000 M( 86243 )C, 0x23992ccd735a03d9, n = 4608, CUDALucas v1.68 err = 0.01074 (0:02 real, 0.2310 ms/iter, ETA 0:16) Iteration 10000 M( 132049 )C, 0x4c52a92b54635f9e, n = 7168, CUDALucas v1.68 err = 0.02881 (0:02 real, 0.1701 ms/iter, ETA 0:20) Iteration 10000 M( 216091 )C, 0x30247786758b8792, n = 12288, CUDALucas v1.68 err = 0.004395 (0:02 real, 0.1482 ms/iter, ETA 0:29) Iteration 10000 M( 756839 )C, 0x5d2cbe7cb24a109a, n = 40960, CUDALucas v1.68 err = 0.03125 (0:02 real, 0.2535 ms/iter, ETA 3:07) Iteration 10000 M( 859433 )C, 0x3c4ad525c2d0aed0, n = 49152, CUDALucas v1.68 err = 0.008789 (0:03 real, 0.2963 ms/iter, ETA 4:08) Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3411 ms/iter, ETA 7:02) Iteration 10000 M( 1398269 )C, 0xa4a6d2f0e34629db, n = 73728, CUDALucas v1.68 err = 0.08789 (0:03 real, 0.3753 ms/iter, ETA 8:37) Iteration 10000 M( 2976221 )C, 0x2a7111b7f70fea2f, n = 163840, CUDALucas v1.68 err = 0.04688 (0:08 real, 0.7523 ms/iter, ETA 37:06) Iteration 10000 M( 3021377 )C, 0x6387a70a85d46baf, n = 163840, CUDALucas v1.68 err = 0.06445 (0:07 real, 0.7562 ms/iter, ETA 37:56) Iteration 10000 M( 6972593 )C, 0x88f1d2640adb89e1, n = 393216, CUDALucas v1.68 err = 0.04688 (0:18 real, 1.7246 ms/iter, ETA 3:20:03) Iteration 10000 M( 13466917 )C, 0x9fdc1f4092b15d69, n = 786432, CUDALucas v1.68 err = 0.02734 (0:32 real, 3.1939 ms/iter, ETA 11:55:57) Iteration 10000 M( 20996011 )C, 0x5fc58920a821da11, n = 1179648, CUDALucas v1.68 err = 0.08984 (0:47 real, 4.7056 ms/iter, ETA 27:25:24) Iteration 10000 M( 24036583 )C, 0xcbdef38a0bdc4f00, n = 1310720, CUDALucas v1.68 err = 0.1875 (0:52 real, 5.1515 ms/iter, ETA 34:22:19) iteration < 1000 && err = 0.28125 >= 0.25, increasing n from 1310720 Iteration 10000 M( 25964951 )C, 0x62eb3ff0a5f6237c, n = 1572864, CUDALucas v1.68 err = 0.01758 (1:04 real, 6.3293 ms/iter, ETA 45:37:26) iteration < 1000 && err = 0.4375 >= 0.25, increasing n from 1572864 Iteration 10000 M( 30402457 )C, 0x0b8600ef47e69d27, n = 1835008, CUDALucas v1.68 err = 0.04688 (1:12 real, 7.1573 ms/iter, ETA 60:25:09) Iteration 10000 M( 32582657 )C, 0x02751b7fcec76bb1, n = 1835008, CUDALucas v1.68 err = 0.2422 (1:12 real, 7.1500 ms/iter, ETA 64:41:14) iteration < 1000 && err = 0.34375 >= 0.25, increasing n from 1966080 Iteration 10000 M( 37156667 )C, 0x67ad7646a1fad514, n = 2097152, CUDALucas v1.68 err = 0.1055 (1:21 real, 8.0733 ms/iter, ETA 83:17:23) Iteration 10000 M( 42643801 )C, 0x8f90d78d5007bba7, n = 2359296, CUDALucas v1.68 err = 0.1953 (1:31 real, 9.0537 ms/iter, ETA 107:12:40) iteration < 1000 && err = 0.25 >= 0.25, increasing n from 2359296 Iteration 10000 M( 43112609 )C, 0xe86891ebf6cd70c4, n = 2621440, CUDALucas v1.68 err = 0.02148 (1:44 real, 10.3703 ms/iter, ETA 124:09:19) [/code] [code] $ ./CUDALucas -k -s test -polite 2 1257787 start M1257787 fft length = 65536 Iteration 10000 M( 1257787 )C, 0x3f45bf9bea7213ea, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 7:03) Iteration 20000 M( 1257787 )C, 0x961d384390291afd, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3432 ms/iter, ETA 7:02) Iteration 30000 M( 1257787 )C, 0x53fec009337bd2d5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3413 ms/iter, ETA 6:56) -polite 0 Iteration 40000 M( 1257787 )C, 0x195c41704df8f7f0, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2717 ms/iter, ETA 5:28) Iteration 50000 M( 1257787 )C, 0x6b3f73933bd773df, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2540 ms/iter, ETA 5:04) Iteration 60000 M( 1257787 )C, 0x9b92e660ca8b91d3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.2543 ms/iter, ETA 5:02) -polite 2 Iteration 70000 M( 1257787 )C, 0xb63c17041a4b7a76, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3146 ms/iter, ETA 6:11) Iteration 80000 M( 1257787 )C, 0xd0960e64d43d7a0e, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3469 ms/iter, ETA 6:45) Iteration 90000 M( 1257787 )C, 0xdce2b6e8ca6914b1, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3393 ms/iter, ETA 6:33) s desable -s Iteration 100000 M( 1257787 )C, 0x5df3c10927093cb3, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3402 ms/iter, ETA 6:31) Iteration 110000 M( 1257787 )C, 0x056d4b3c0ecb57f5, n = 65536, CUDALucas v1.68 err = 0.1055 (0:03 real, 0.3349 ms/iter, ETA 6:21) -polite 0 Iteration 120000 M( 1257787 )C, 0xa8c4e198df143da9, n = 65536, CUDALucas v1.68 err = 0.1055 (0:04 real, 0.3247 ms/iter, ETA 6:06) Iteration 130000 M( 1257787 )C, 0x9c7abf9d886c48fb, n = 65536, CUDALucas v1.68 err = 0.1055 (0:02 real, 0.2540 ms/iter, ETA 4:44) t enable -t Iteration 140000 M( 1257787 )C, 0x0780de24fcca7557, n = 65536, CUDALucas v1.68 err = 0.1172 (0:03 real, 0.3055 ms/iter, ETA 5:39) Iteration 150000 M( 1257787 )C, 0x82d2707bb75c4bde, n = 65536, CUDALucas v1.68 err = 0.1172 (0:04 real, 0.3360 ms/iter, ETA 6:09) Iteration 160000 M( 1257787 )C, 0x6608965c5d475691, n = 65536, CUDALucas v1.68 err = 0.1211 (0:03 real, 0.3355 ms/iter, ETA 6:05) [/code] [code] $ ./CUDALucas -cufftbench 1024 2048 256 CUFFT bench start = 1024 end = 2048 distance = 256 CUFFT_Z2Z size= 1024 time= 0.026302 msec CUFFT_Z2Z size= 1280 time= 0.033568 msec CUFFT_Z2Z size= 1536 time= 0.032268 msec CUFFT_Z2Z size= 1792 time= 0.031069 msec [/code][/QUOTE] Attached CUDALucas 1.68 x64 binaries: - CUDA 4.0 | SM 2.0 - CUDA 4.1 | SM 2.0 - CUDA 4.1 | SM 2.1 |
1 Attachment(s)
Attached CUDALucas 1.68 x64 binaries:
- CUDA 3.2 | SM 1.3 |
[QUOTE=flashjh;293678]With your info in mind, I see a few things:
1) Select the Fastest FFT for which the errors can still be kept "under control" [/QUOTE] Number 1 is clear. As a general rule, if FFT length increases, then the time to compute it will increase, as more data is involved. This is common sense. Shorter FFT means shorter time. But shorter FFT means also larger error. We increase the accuracy by increasing the length of FFT, same as when we increase the accuracy of a measurement by increasing the number of decimals. As real numbers are used instead of integers, the calculus is never exact, there is always a small error, which is later cleared by rounding back to integers. If the FFT is too short, then the error can become bigger then 0.5, and the rounding will point to the nearby integer, instead of the correct (expected) one. The FFT is automatically chosen to be the smallest (fastest) for which the errors are "reasonable". But this depends on many things and it is a difficult choice to make, so the "automatically chosen" FFT size is never the "optimum" value. It is always a "safe" value, that is, a higher one (longer FFT, for which the error is smaller). To get the "optimum" time one have to "tune down" the FFT according with the exponent and hardware he has. That is why there was such a big fuss about having the "-f" parameter. For example, I can use half hour to play with the exponent and FFT, but then get 19 hours ETA instead of 24. [QUOTE] 2) The FFT will never be larger than the default FFT selected by CUDALucas [/QUOTE]This is not true. The story said above at point 1 (shorter FFT = faster = shorter time) is the "ideal" case. The real case is when the resources are limited. Imagine you have a frying pan in which you can only fit two donuts at a time, and it takes a minute to cook a donut on a side. When you have 2 donuts, you will need 2 minutes, as there are 4 sides. And if you have 4 donuts you will need 4 minutes, for 6 donuts - 6 minutes and so on. Now imagine you have to cook 3 donuts. In an "ideal" case you still need 3 minutes, as there are 6 sides and you can cook 2 in a minute. But to do that, you must be able to cook donuts 1 and 2 on a side for first minute, then take out donut 1 and put donut 3, and turn over donut 2, cook them for a second minutes, now take out donut 2 which is finished, put back donut 1 to be cooked on the other side, and flip donut 3. Cook them for the third minute and all is finished. If you are not able to do that, for whatever reason, like "donut 1 cannot be kept half cooked", etc, then you still need 4 minutes to cook 3 donuts: cook 2 for 2 minuts, and cook the third in half of the pan, for another 2 minutes. In this case half of the pan will be wasted for two minutes, and you could get some more output cooking 4 donuts in the same time. The same story happens with FFT butterflies and threads. You cross-multiply them for a while, wait for this, multiply with that, wait for those wings, etc. You have 4 frying pans. When you have powers of 2 donuts, all the frying pans are always full. If you have only 12 donuts, you can cook them in pairs, or by-3, by-4, by-6, etc. You can not always optimally fill the pans (cores, threads, etc). This is where the "granularity" of the FFT length come in place. The longest time you will need to cook 11 or 13 donuts, because you can't really split them evenly and some pans will always be empty, especially when you must split them in halves and quarters like FFT does. Ok, this is a stupid example. But just to give an idea, without too much math. FFT works like "divide et impera". So, coming back to the question, a counterexample can easily be found: increase the exponent little by little, till you find an "default FFT" with a bad multiplier (like prime, or with big prime factors), you will see that some "higher" FFT's get shorter times. For the 26M case, the default is *48, but let's imagine the default would be *47. You can try and see that *48 and *49 get shorter times than *47. In fact, *47 and *43 are quite bad, even *64 (the full power of 2) is faster. I don't know how to explain this, the only explanation coming to my mind is related to the "bad fitting" of the primes to the number of threads/cores, etc. As msft said, cuFFT is a blackbox, total enigma. [QUOTE] 3) This does not necessarily means the "shortest", because the "smoothness" (or "compositeness") of it plays an important role too For #3, I'm not sure how make the best selection. Once you select the 'best candidates', what's the method for fiding "smoothness" (or "compositeness"). Then, based on this: What is the way to select the best one? How do you find the smoothness? Will any selected FFT find all the factors if the error is not too high? [/QUOTE]I don't know how to select the best, beside of experimenting for each expo and hardware. What I said is [B]experimental evidence and guessing[/B]. What factors are you talking about? I just got two matches with [CODE]>cl1673213x64 -d 0 -threads 512 -c 100000 -f 1474560 -s backup0 -t -aggressive [SIZE=2]26236183[/SIZE] >cl1673213x64 -d 1 -threads 512 -c 100000 -f 1474560 -s backup1 -t -aggressive [SIZE=2]26251031[/SIZE] [/CODE]They ran smooth for 19 hours only (the name of the program means v1.67, drv 3.2, cc 1.3, the one compiled by you, I renamed it to my taste, and drv 3.2 is a bit faster then 4.0 and 4.1). |
Too small fft length make round off error,
But too big fft length make unstable results with this Version.(>1.58?) Narrow launch window,it is stimulating! |
| All times are UTC. The time now is 23:13. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.