mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-01-10, 06:49   #34
msft
 
msft's Avatar
 
Jul 2009
Tokyo

26216 Posts
Default

v0.13 err check per 0xff iteration count,it is big jump.
v0.12 err check every iteration,err report per 0xff iteration count.
msft is offline   Reply With Quote
Old 2011-01-10, 07:35   #35
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

2×5×283 Posts
Default

Quote:
Originally Posted by Ralf Recker View Post

real 15m36.000s
user 4m27.065s
sys 6m40.837s
What's the meaning of real, user and sys? Does the GPU client also use CPU time?
em99010pepe is offline   Reply With Quote
Old 2011-01-10, 12:35   #36
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by em99010pepe View Post
What's the meaning of real, user and sys? Does the GPU client also use CPU time?
real mean elapsed time,wall clock,...
msft is offline   Reply With Quote
Old 2011-01-10, 13:42   #37
em99010pepe
 
em99010pepe's Avatar
 
Sep 2004

2×5×283 Posts
Default

Quote:
Originally Posted by msft View Post
real mean elapsed time,wall clock,...
By wall clock you are assuming that the GPU needs CPU time, correct?
em99010pepe is offline   Reply With Quote
Old 2011-01-10, 13:54   #38
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by em99010pepe View Post
By wall clock you are assuming that the GPU needs CPU time, correct?
program logic not need CPU time,but Nvidia driver need 100% CPU time now.(von Neumann say "Electronics need this tube")
CUDALucas need 4% CPU time.
I can not understand this reason.

Last fiddled with by msft on 2011-01-10 at 14:01
msft is offline   Reply With Quote
Old 2011-01-10, 16:49   #39
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

2·197 Posts
Default

Quote:
Originally Posted by msft View Post
v0.13 err check per 0xff iteration count,it is big jump.
v0.12 err check every iteration,err report per 0xff iteration count.
I don't believe I changed the rate of error checking. I changed it to check with floats instead of doubles; though that didn't seem to improve the speed much.
Ken_g6 is offline   Reply With Quote
Old 2011-01-10, 20:42   #40
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I don't believe I changed the rate of error checking. I changed it to check with floats instead of doubles; though that didn't seem to improve the speed much.
I understand. Thank you everything,
msft is offline   Reply With Quote
Old 2011-01-11, 00:27   #41
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

39410 Posts
Default

Quote:
Originally Posted by msft View Post
I concern index conflict with x[].
I can not understand "wrapindex".
I'm still trying to figure this out. My latest idea was based on the fact that cuda_normalize2_kernel only seems to manipulate x[] in either its own region between x[0] and x[STRIDE] or at wrapindex. Based on that I figured I could have the thread that was supposed to work on the region containing wrapindex work on that other stuff first; then all the threads could go on and do the stuff from cuda_normalize3_kernel.

The only problem is that it doesn't work. So does anybody see anything wrong with this kernel? I checked the NOSMO parameter alone; that works, but any speed improvement is within the margin of error.

Code:
__global__ void cuda_normalize2_kernel(
        double *x,
        int    	N,
        double *g_limitbw,
        double *g_invlimit,
        double *g_carry,
	int 	wrapindex,
	double 	wrapfactor,
	double 	BIGWORD,
	int 	STRIDE,
	double 	*g_hlimit,
	int 	*g_flag,
	int 	NOSMO		// N Over Stride Minus One (N/STRIDE-1).  Divisions are expensive!
)
{
        int threadID = blockIdx.x * blockDim.x + threadIdx.x;
        register int    j;
        register double xx, zz;
        register double carry = 0.0;
	int flag;
	double hlim, limbw;
	flag = 0;

	// First, work on the last section in the section containing wrapindex.
	if((wrapindex < STRIDE && wrapindex >= 0 && threadID==(NOSMO)) ||
			(threadID+STRIDE+STRIDE > wrapindex && threadID+STRIDE <= wrapindex))
	{
		threadID=(NOSMO);
		carry=g_carry[threadID];
		if (carry)
		{
			double carry2 = 0.0;
			j = 0;
			if (wrapindex) carry2 = carry*wrapfactor;
			carry = -carry;
			while ((carry||carry2) && (j < STRIDE))
			{	if (wrapindex && !carry) {		// Skip already normalized words
					j = wrapindex;
				}
				xx = x[IDX(j)] + carry;
				hlim = g_hlimit[IDX(j)];
				if (wrapindex && j==wrapindex) {
					xx += carry2;
					carry2 = 0.0;
				}
				limbw = ((hlim+hlim)*BIGWORD)-BIGWORD;
				zz = (xx+limbw)-limbw;
				carry = zz*g_invlimit[IDX(j)];	// Compute the carry on next word
				xx = xx - zz;	// And the balanced remainder in current word
				x[IDX(j)] = xx;
				if((xx>hlim) && (xx<-hlim)) flag=1;
				if (++j == N)
				{

					j = 0;
					if (wrapindex)
						carry2 = carry*wrapfactor;
					carry = -carry;
				}

			}
			if(flag==1)g_flag[threadID]=1;
		}
		// Do no more if this was the threadID it was assigned to.
		if(threadID == blockIdx.x * blockDim.x + threadIdx.x) return;
		// Otherwise reset to work on the originally planned section.
		threadID = blockIdx.x * blockDim.x + threadIdx.x;
		flag = 0;
		carry = 0.0;
	}
	if( threadID<(NOSMO))
	{
		carry=g_carry[threadID];
        	if (carry)
        	{
                	j =  threadID*STRIDE+STRIDE;
                	while ((carry) && (j < (threadID*STRIDE+STRIDE+STRIDE)))
                	{       
				hlim = g_hlimit[IDX(j)];
                        	xx = x[IDX(j)] + carry;
				limbw = ((hlim+hlim)*BIGWORD)-BIGWORD;
                        	zz = (xx+limbw)-limbw;
                        	carry = zz*g_invlimit[IDX(j)]; // Compute the carry on next word
                        	xx = xx - zz;      // And the balanced remainder in current word
                        	x[IDX(j)] = xx;
				if((xx>hlim) && (xx<-hlim)) flag=1;
				++j;
                	}
        	}
		if(flag==1)g_flag[threadID]=1;
	}
}
Ken_g6 is offline   Reply With Quote
Old 2011-01-11, 02:13   #42
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

We can learn from code.

wrapindex depend k(k*2^n+1) value and fft length.
Quote:
+ ./llrpi -q39*2^113549+1 -d
FFTLEN = 16384, fftm = 5677.764270, bpw = 6.930865, log2h= 5.285402, Bits per double = 28.547131
wrapindex = 16383, wrapfactor = 50.000000
+ ./llrpi -q39*2^113547+1 -d
FFTLEN = 16384, fftm = 5677.664270, bpw = 6.930743, log2h= 5.285402, Bits per double = 28.546887
wrapindex = 16383, wrapfactor = 50.000000
+ ./llrpi -q3*2^303093+1 -d
FFTLEN = 32768, fftm = 15154.779248, bpw = 9.249743, log2h= 1.584963, Bits per double = 30.084449
wrapindex = 32767, wrapfactor = 128.000000
+ ./llrpi -q27*2^672007+1 -d
FFTLEN = 65536, fftm = 33600.637744, bpw = 10.254101, log2h= 4.754888, Bits per double = 35.863089
wrapindex = 65535, wrapfactor = 160.000000
+ ./llrpi -q3*2^2145353+1 -d
FFTLEN = 262144, fftm = 107267.779248, bpw = 8.183882, log2h= 1.584963, Bits per double = 29.752727
wrapindex = 262143, wrapfactor = 64.000000
+ ./llrpi -q3*2^5082306+1 -d
FFTLEN = 524288, fftm = 254115.429248, bpw = 9.693734, log2h= 1.584963, Bits per double = 33.372431
wrapindex = 524287, wrapfactor = 256.000000
msft is offline   Reply With Quote
Old 2011-01-11, 02:24   #43
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

We can merge cuda_normalize2_kernel & cuda_normalize3_kernel.
Code:
__global__ void cuda_normalize2_kernel(
        double *x,
        int     N,
        double *g_limitbw,
        double *g_invlimit,
        double *g_carry,
        int     wrapindex,
        double  wrapfactor,
        double  BIGWORD,
        int     STRIDE,
        double  *g_hlimit,
        int     *g_flag
)
{
        const int threadID = blockIdx.x * blockDim.x + threadIdx.x;
        register int    j;
        register double *px , xx, zz;
        register double carry = 0.0, carry2 = 0.0;
        int flag;
        double hlim, limbw;
        flag = 0;
        if( threadID==(N/STRIDE-1))
        {
                carry=g_carry[threadID];
                if (carry)
                {
                        j = 0;
                        px = x;
                        if (wrapindex) carry2 = carry*wrapfactor;
                        carry = -carry;
                        while ((carry||carry2) && (j < STRIDE))
                        {       if (wrapindex && !carry) {              // Skip already normalized words
                                        j = wrapindex;
                                        px = x + wrapindex;
                                }
                                xx = x[IDX(j)] + carry;
                                hlim = g_hlimit[IDX(j)];
                                if (wrapindex && j==wrapindex) {
                                        xx += carry2;
                                        carry2 = 0.0;
                                }
                                limbw = ((hlim+hlim)*BIGWORD)-BIGWORD;
                                zz = (xx+limbw)-limbw;
                                carry = zz*g_invlimit[IDX(j)];  // Compute the carry on next word
                                xx = xx - zz;   // And the balanced remainder in current word
                                px++;
                                x[IDX(j)] = xx;
                                if((xx>hlim) && (xx<-hlim)) flag=1;
                                if (++j == N)
                                {

                                        j = 0;
                                        px = x;
                                        if (wrapindex)
                                                carry2 = carry*wrapfactor;
                                        carry = -carry;
                                }

                        }
                }
                if(flag==1)g_flag[threadID]=1;
        }
        else
        {
                carry=g_carry[threadID];
                if (carry)
                {
                        j =  threadID*STRIDE+STRIDE;
                        px = &x[threadID*STRIDE+STRIDE];
                        while ((carry) && (j < (threadID*STRIDE+STRIDE+STRIDE)))
                        {
                                hlim = g_hlimit[IDX(j)];
                                xx = x[IDX(j)] + carry;
                                limbw = ((hlim+hlim)*BIGWORD)-BIGWORD;
                                zz = (xx+limbw)-limbw;
                                carry = zz*g_invlimit[IDX(j)]; // Compute the carry on next word
                                xx = xx - zz;      // And the balanced remainder in current word
                                px++;
                                x[IDX(j)] = xx;
                                if((xx>hlim) && (xx<-hlim)) flag=1;
                                ++j;
                        }
                }
                if(flag==1)g_flag[threadID]=1;
        }
}
msft is offline   Reply With Quote
Old 2011-01-11, 05:13   #44
msft
 
msft's Avatar
 
Jul 2009
Tokyo

61010 Posts
Default

merge cuda_normalize2_kernel & cuda_normalize3_kernel.
Attached Files
File Type: gz llrCUDA.0.14.tar.gz (95.2 KB, 100 views)
msft is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
LLRcuda shanecruise Riesel Prime Search 8 2014-09-16 02:09
LLRCUDA - getting it to work diep GPU Computing 1 2013-10-02 12:12

All times are UTC. The time now is 08:22.

Sat Jul 11 08:22:56 UTC 2020 up 108 days, 5:56, 0 users, load averages: 1.93, 1.54, 1.46

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.