![]() |
[QUOTE=James Heinrich;341532]I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:[code]Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)[/code]And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it. However, when running a benchmark on 3200K I get this:[code]cudalucas -cufftbench 3276800 3276800 32768 CUFFT bench start = 3276800 end = 3276800 distance = 32768 CUFFT_Z2Z size= 3276800 time= 3.704131 msec[/code]Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?[/QUOTE] When I read the cudalucas code, there are several things happening for each iteration. The first part is memcopy to gpu, then FFT work on the copied arrays, then normalizing the array and rounding to zero. You are testing exponent 57885161 which is alot bigger than 3276800, therefore the array and the time to do FFT is less on a smaller exponent. so it makes sence. |
205 alpha and FFT lengths
I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.
2.05 alpha with automatic chosen FFT length: Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36) 2.05 alpha with manual chosen FFT lengt = to what 2.03 uses. Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13) its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length.. Running the same exponent on 2.03 and manual FFT of 3584k gives this: Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03 err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38) |
[QUOTE=Manpowre;341637]I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.
2.05 alpha with automatic chosen FFT length: Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36) 2.05 alpha with manual chosen FFT lengt = to what 2.03 uses. Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13) its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length.. Running the same exponent on 2.03 and manual FFT of 3584k gives this: Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03 err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38)[/QUOTE] which GPU do you get these timings on? On a gtx titan with the same exponent as yours for a LL with fft len=3072K i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs I compiled cudalucas 2.05alpha with cuda 5.0 and sm_30 with the gpu kernel <<256 256>> and /256 and 256>> From: [CODE] cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 512, 128 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 128, 128 >>> (g_x, n, threads, g_data, g_ttp1); [/CODE] TO: [CODE] cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>> (g_x, n, threads, g_data, g_ttp1);[/CODE] |
[QUOTE=prime7989;341749]which GPU do you get these timings on?
On a gtx titan with the same exponent as yours for a LL with fft len=3072K i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs [/QUOTE] Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with. while 2.03 gives me 0.43ms per iteration. |
For Gtx Titan in the Makefile change the option to nvcc to sm_30
[QUOTE=Manpowre;341794]Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with.
while 2.03 gives me 0.43ms per iteration.[/QUOTE] For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me? Al |
[QUOTE=prime7989;341795]For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me?
Al[/QUOTE] I didnt compile it with makefile, I made a VS2010 project and mapped all .h files and all. I also tried sm_30, and compute_30. same result. strange. Mt 2.03 was also compiiled on same machine.. |
Hi Manpowre, prime7989,
If you want to run those kernels with 256 threads per block and you also want correct results, please make the following changes: From: [CODE] cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>> (g_x, n, threads, g_data, g_ttp1);[/CODE]to: [CODE] cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); rftfsub_kernel <<< n / 1024, 256 >>> (n, g_x, g_ct); cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE)); normalize_kernel <<<n / threads, threads >>> (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f); normalize2_kernel <<< ((n + threads - 1) / threads + 255) / 256, 256 >>> (g_x, n, threads, g_data, g_ttp1);[/CODE]The timings are somewhat meaningless otherwise. |
I recall sm_13 being the best target architecture for most GPUs for CuLu, just saying.
|
What are you guys talking about "0.5" and "0.43" ms/iteration? I only see 4.xxx ms with the VS-compiled version, and 3.8xxx with the original, which is FASTER, and [B]not the other way around[/B]. What is the problem? I can't understand. For the records, we get about 5.xx ms with our OC'd gtx580, and even our i7-2600k gets 26ms/iter/core, which is (compounded all 4 cores, running 4 workers) equivalent to 6-7ms/iter. My opinion is still "wait for the maxwell". The difference in performance for Titans does not worth the difference in the money spent. Of course, if you did not buy it for other purposes (gaming, bitcoins, folding proteins, whatever, I don't know). If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.
LaurV grumpy... |
[QUOTE=LaurV;341829] then 570 is still the best buy, followed by 580.
LaurV grumpy...[/QUOTE] Never mind I can't "officially" buy a 570 or 580 I think... Just for my curiosity, how many Ghzdays(Yes, TF..wrong thread?) do you/can you get on at 680 per $? |
[QUOTE]If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.
LaurV grumpy... [/QUOTE] Very grumpy here, too, that Gigabyte is trying to foist a 660ti on me as a "replacement" for my 570. |
| All times are UTC. The time now is 23:12. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.