mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-05-26, 20:05   #1904
Manpowre
 
"Svein Johansen"
May 2013
Norway

3·67 Posts
Default

Quote:
Originally Posted by James Heinrich View Post
I'm confused about benchmark timings vs production timings. For example, on my GTX 670 I get this:
Code:
Iteration 10000 M( 57885161 )C, 0x76c27556683cd84d, n = 3200K,
CUDALucas v2.04 Beta err = 0.1076 (1:21 real, 8.0405 ms/iter, ETA 129:15:02)
And indeed, 57885161 * 0.0080405s = 129.28 hours, so I believe the 8ms/it.

However, when running a benchmark on 3200K I get this:
Code:
cudalucas -cufftbench 3276800 3276800 32768
CUFFT bench start = 3276800 end = 3276800 distance = 32768
CUFFT_Z2Z size= 3276800 time= 3.704131 msec
Why do I get 3.7ms on the benchmark but 8.0ms when testing an exponent?
When I read the cudalucas code, there are several things happening for each iteration. The first part is memcopy to gpu, then FFT work on the copied arrays, then normalizing the array and rounding to zero.

You are testing exponent 57885161 which is alot bigger than 3276800, therefore the array and the time to do FFT is less on a smaller exponent. so it makes sence.
Manpowre is offline   Reply With Quote
Old 2013-05-26, 21:46   #1905
Manpowre
 
"Svein Johansen"
May 2013
Norway

110010012 Posts
Default 205 alpha and FFT lengths

I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.

2.05 alpha with automatic chosen FFT length:
Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A
lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36)

2.05 alpha with manual chosen FFT lengt = to what 2.03 uses.
Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A
lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13)

its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length..

Running the same exponent on 2.03 and manual FFT of 3584k gives this:
Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03
err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38)
Manpowre is offline   Reply With Quote
Old 2013-05-28, 09:02   #1906
prime7989
 
Jun 2012

17 Posts
Talking

Quote:
Originally Posted by Manpowre View Post
I tested the 2.05 alpha today, and I found that the FFT test does something with the FFT length which apparently slows down the iteration tremedously.

2.05 alpha with automatic chosen FFT length:
Iteration 40000 M( 61787581 )C, 0x00000000e3305715, n = 3360K, CUDALucas v2.05 A
lpha err = 0.21875 (0:48 real, 4.8966 ms/iter, ETA 83:58:36)

2.05 alpha with manual chosen FFT lengt = to what 2.03 uses.
Iteration 70000 M( 61787581 )C, 0x00000000435d0c3f, n = 3584K, CUDALucas v2.05 A
lpha err = 0.04297 (0:45 real, 4.4737 ms/iter, ETA 76:41:13)

its 7h difference between how 2.03 chooses FFT length and how 2.05 chooses FFT length..

Running the same exponent on 2.03 and manual FFT of 3584k gives this:
Iteration 60000 M( 61787581 )C, 0x026b17031f430ab1, n = 3670016, CUDALucas v2.03
err = 0.0469 (0:44 real, 4.4753 ms/iter, ETA 76:43:38)
which GPU do you get these timings on?
On a gtx titan with the same exponent as yours for a LL with fft len=3072K
i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs
I compiled cudalucas 2.05alpha with cuda 5.0 and sm_30 with the gpu kernel
<<256 256>> and /256 and 256>>

From:

Code:
  
 cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
  rftfsub_kernel <<< n / 512, 128 >>> (n, g_x, g_ct);
  cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
  
  normalize_kernel <<<n / threads, threads >>> 
                    (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
  normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 128, 128 >>> 
                     (g_x, n, threads, g_data, g_ttp1);
TO:

Code:
 cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
  rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct);
  cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
  
  normalize_kernel <<<n / threads, threads >>> 
                    (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);
  normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>> 
                     (g_x, n, threads, g_data, g_ttp1);
prime7989 is offline   Reply With Quote
Old 2013-05-28, 20:21   #1907
Manpowre
 
"Svein Johansen"
May 2013
Norway

20110 Posts
Default

Quote:
Originally Posted by prime7989 View Post
which GPU do you get these timings on?
On a gtx titan with the same exponent as yours for a LL with fft len=3072K
i get timings of 3.8636ms/per iteration and ETA time for all as 66:26 hrs
Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with.

while 2.03 gives me 0.43ms per iteration.
Manpowre is offline   Reply With Quote
Old 2013-05-28, 20:48   #1908
prime7989
 
Jun 2012

1710 Posts
Post For Gtx Titan in the Makefile change the option to nvcc to sm_30

Quote:
Originally Posted by Manpowre View Post
Its a GTX Titan too. I changed the same as you did, and still get 0.5ms per iteration. 84h. with same FFT length as you ran your test with.

while 2.03 gives me 0.43ms per iteration.
For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me?
Al
prime7989 is offline   Reply With Quote
Old 2013-05-28, 21:43   #1909
Manpowre
 
"Svein Johansen"
May 2013
Norway

3×67 Posts
Default

Quote:
Originally Posted by prime7989 View Post
For Gtx Titan in the Makefile change the option to nvcc to sm_30 instead of using sm_13 or sm_35 tell me if you get similar timings to me?
Al
I didnt compile it with makefile, I made a VS2010 project and mapped all .h files and all. I also tried sm_30, and compute_30. same result. strange.
Mt 2.03 was also compiiled on same machine..
Manpowre is offline   Reply With Quote
Old 2013-05-28, 22:30   #1910
owftheevil
 
owftheevil's Avatar
 
"Carl Darby"
Oct 2012
Spring Mountains, Nevada

4738 Posts
Default

Hi Manpowre, prime7989,

If you want to run those kernels with 256 threads per block and you also want correct results, please make the following changes:

From:

Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));  
 rftfsub_kernel <<< n / 256, 256 >>> (n, g_x, g_ct);   
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));      
normalize_kernel <<<n / threads, threads >>>                     
 (g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);   
normalize2_kernel <<< ((n + threads - 1) / threads + 127) / 256, 256 >>>                      
 (g_x, n, threads, g_data, g_ttp1);
to:

Code:
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));
rftfsub_kernel <<< n / 1024, 256 >>> (n, g_x, g_ct);   
cufftSafeCall (cufftExecZ2Z (plan, (cufftDoubleComplex *) g_x, (cufftDoubleComplex *) g_x, CUFFT_INVERSE));      
normalize_kernel <<<n / threads, threads >>>                      
(g_x, g_data, g_ttp, g_ttmp, g_numbits, digit, bit, g_err, *maxerr, error_flag || t_f);   
normalize2_kernel <<< ((n + threads - 1) / threads + 255) / 256, 256 >>>                      
 (g_x, n, threads, g_data, g_ttp1);
The timings are somewhat meaningless otherwise.

Last fiddled with by owftheevil on 2013-05-28 at 22:31
owftheevil is offline   Reply With Quote
Old 2013-05-28, 23:04   #1911
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

3×137 Posts
Default

I recall sm_13 being the best target architecture for most GPUs for CuLu, just saying.

Last fiddled with by Karl M Johnson on 2013-05-28 at 23:05
Karl M Johnson is offline   Reply With Quote
Old 2013-05-29, 02:39   #1912
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

25C016 Posts
Default

What are you guys talking about "0.5" and "0.43" ms/iteration? I only see 4.xxx ms with the VS-compiled version, and 3.8xxx with the original, which is FASTER, and not the other way around. What is the problem? I can't understand. For the records, we get about 5.xx ms with our OC'd gtx580, and even our i7-2600k gets 26ms/iter/core, which is (compounded all 4 cores, running 4 workers) equivalent to 6-7ms/iter. My opinion is still "wait for the maxwell". The difference in performance for Titans does not worth the difference in the money spent. Of course, if you did not buy it for other purposes (gaming, bitcoins, folding proteins, whatever, I don't know). If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.

LaurV grumpy...
LaurV is offline   Reply With Quote
Old 2013-05-29, 02:46   #1913
kracker
 
kracker's Avatar
 
"Mr. Meeseeks"
Jan 2012
California, USA

23·271 Posts
Default

Quote:
Originally Posted by LaurV View Post
then 570 is still the best buy, followed by 580.

LaurV grumpy...
Never mind I can't "officially" buy a 570 or 580 I think...

Just for my curiosity, how many Ghzdays(Yes, TF..wrong thread?) do you/can you get on at 680 per $?

Last fiddled with by kracker on 2013-05-29 at 02:46
kracker is offline   Reply With Quote
Old 2013-05-29, 03:08   #1914
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

27AE16 Posts
Default

Quote:
If you main goal is primes/factoring, then 570 is still the best buy, followed by 580.

LaurV grumpy...
Very grumpy here, too, that Gigabyte is trying to foist a 660ti on me as a "replacement" for my 570.
kladner is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Don't DC/LL them with CudaLucas LaurV Data 131 2017-05-02 18:41
CUDALucas / cuFFT Performance on CUDA 7 / 7.5 / 8 Brain GPU Computing 13 2016-02-19 15:53
CUDALucas: which binary to use? Karl M Johnson GPU Computing 15 2015-10-13 04:44
settings for cudaLucas fairsky GPU Computing 11 2013-11-03 02:08
Trying to run CUDALucas on Windows 8 CP Rodrigo GPU Computing 12 2012-03-07 23:20

All times are UTC. The time now is 21:03.


Fri Aug 6 21:03:27 UTC 2021 up 14 days, 15:32, 1 user, load averages: 2.37, 2.52, 2.55

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.