![]() |
|
|
#1904 | |
|
"Svein Johansen"
May 2013
Norway
3·67 Posts |
Quote:
You are testing exponent 57885161 which is alot bigger than 3276800, therefore the array and the time to do FFT is less on a smaller exponent. so it makes sence. |
|
|
|
|
|
|
#1905 |
|
"Svein Johansen"
May 2013
Norway
110010012 Posts |
I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.
2.05 alpha with automatic chosen FFT length: Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36) 2.05 alpha with manual chosen FFT lengt = to what 2.03 uses. Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13) its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length.. Running the same exponent on 2.03 and manual FFT of 3584k gives this: Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03 err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38) |
|
|
|
|
|
#1906 | |
|
Jun 2012
17 Posts |
Quote:
On a gtx titan with the same exponent as yours for a LL with fft len=3072K i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs I compiled cudalucas 2.05alpha with cuda 5.0 and sm_30 with the gpu kernel <<256 256>> and /256 and 256>> From: Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 512, 128 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 128, 128 >>>
(g_x, n, threads, g_data, g_ttp1);
Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct);
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
normalize_kernel <<<n / threads, threads >>>
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>>
(g_x, n, threads, g_data, g_ttp1);
|
|
|
|
|
|
|
#1907 | |
|
"Svein Johansen"
May 2013
Norway
20110 Posts |
Quote:
while 2.03 gives me 0.43ms per iteration. |
|
|
|
|
|
|
#1908 | |
|
Jun 2012
1710 Posts |
Quote:
Al |
|
|
|
|
|
|
#1909 | |
|
"Svein Johansen"
May 2013
Norway
3×67 Posts |
Quote:
Mt 2.03 was also compiiled on same machine.. |
|
|
|
|
|
|
#1910 |
|
"Carl Darby"
Oct 2012
Spring Mountains, Nevada
4738 Posts |
Hi Manpowre, prime7989,
If you want to run those kernels with 256 threads per block and you also want correct results, please make the following changes: From: Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>> (g_x, n, threads, g_data, g_ttp1); Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 1024, 256 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 255) / 256, 256 >>> (g_x, n, threads, g_data, g_ttp1); Last fiddled with by owftheevil on 2013-05-28 at 22:31 |
|
|
|
|
|
#1911 |
|
Mar 2010
3×137 Posts |
I recall sm_13 being the best target architecture for most GPUs for CuLu, just saying.
Last fiddled with by Karl M Johnson on 2013-05-28 at 23:05 |
|
|
|
|
|
#1912 |
|
Romulan Interpreter
Jun 2011
Thailand
25C016 Posts |
What are you guys talking about "0.5" and "0.43" ms/iteration? I only see 4.xxx ms with the VS-compiled version, and 3.8xxx with the original, which is FASTER, and not the other way around. What is the problem? I can't understand. For the records, we get about 5.xx ms with our OC'd gtx580, and even our i7-2600k gets 26ms/iter/core, which is (compounded all 4 cores, running 4 workers) equivalent to 6-7ms/iter. My opinion is still "wait for the maxwell". The difference in performance for Titans does not worth the difference in the money spent. Of course, if you did not buy it for other purposes (gaming, bitcoins, folding proteins, whatever, I don't know). If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.
LaurV grumpy... |
|
|
|
|
|
#1913 |
|
"Mr. Meeseeks"
Jan 2012
California, USA
23·271 Posts |
Never mind I can't "officially" buy a 570 or 580 I think...
Just for my curiosity, how many Ghzdays(Yes, TF..wrong thread?) do you/can you get on at 680 per $? Last fiddled with by kracker on 2013-05-29 at 02:46 |
|
|
|
|
|
#1914 | |
|
"Kieren"
Jul 2011
In My Own Galaxy!
27AE16 Posts |
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Don't DC/LL them with CudaLucas | LaurV | Data | 131 | 2017-05-02 18:41 |
| CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 | Brain | GPU Computing | 13 | 2016-02-19 15:53 |
| CUDALucas: which binary to use? | Karl M Johnson | GPU Computing | 15 | 2015-10-13 04:44 |
| settings for cudaLucas | fairsky | GPU Computing | 11 | 2013-11-03 02:08 |
| Trying to run CUDALucas on Windows 8 CP | Rodrigo | GPU Computing | 12 | 2012-03-07 23:20 |