mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2013-01-21, 01:14   #2135
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

42410 Posts
Default

I did go back and muck with the gpusieveprimes value, but this time in the range of 50000 to 90000 I could only budge it by a few tenths of a ghz day. Much less volatile than before. 70000 was still the hot spot for me in the 60/61m range.
Aramis Wyler is offline   Reply With Quote
Old 2013-01-21, 18:54   #2136
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

45716 Posts
Default

Hi,

Quote:
Originally Posted by Aillas View Post
Hi,

could someone please make a version of mfaktc 0.20 for cuda 4.0?

Thanks a lot.
I guess that you have a good reason not upgrading your video driver?
Do you need a Windows or Linux executable?

On Windows it is not funny to switch CUDA toolkit version (at least for me...).
On Linux it is pretty easy:
  • install CUDA toolkit 4.0 to /usr/local/cuda (default path)
  • mv /usr/local/cuda /usr/local/cuda_4.0
  • install CUDA toolkit 4.1 to /usr/local/cuda (default path)
  • mv /usr/local/cuda /usr/local/cuda_4.1
  • ...

And than choose your toolkit version by just executing the following command:
Code:
ln -snf /usr/local/cuda_<version> /usr/local/cuda
I want something similar for Windows! But I'm afraid it is not that simple because of
  • registry
  • integration into MS Visual Studio (from Windows SDK)
  • my lack of Windows experience
  • ...

Oliver
TheJudger is offline   Reply With Quote
Old 2013-01-22, 00:25   #2137
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Wow, I wasn't aware George had made a GPU sieve until now. It looks awesome!

Now, I'd like to integrate it into ppsieve/tpsieve. But I'm not sure where to begin. I can't find where George's sieve outputs its "small" primes. Are they a list of 64-bit numbers (which would be best), a list of some-other-bit numbers, or a bitmap that you somehow get numbers out of?


Meanwhile, while perusing the code, I think I've found a speedup potential! In mod_p there's a mul.lo.s32 followed by a sub.s32. Ideally, they could be combined into a mad.lo.s32, except that the second instruction is a sub and not an add. If we combine them anyway, the problem then becomes how to negate r before getting to the add. I think that negating either (but not both of) p or pinv will do the trick. So let's call this new function mod_neg_p.

Edit: to spell out what I'm thinking of:
Code:
__device__ __inline static int mod_neg_p (int x, int p, int pinv)
{
//	int	q, r, a, b;

//	q = __mulhi (x, pinv);		// quotient = x * -inverse_of_p
//	a = x + q * p;			// x mod p (but may be too large by one p)
//	b = a - p;			// x mod p (the alternative return value)
//	asm("slct.s32.s32 %0, %1, %2, %3;" : "=r" (r) : "r" (b) , "r" (a) , "r" (b));

	int	r;
	asm ("mul.hi.s32 %0, %1, %2;\n\t"		//	r = __mulhi (x, pinv);
	     "mad.lo.s32 %1, %0, %3, %1;\n\t"		//	x = r * p + x;
	     "sub.s32 	%0, %1, %3;\n\t"		//	r = x - p;
	     "slct.s32.s32 %0, %0, %1, %0;"		//	r = (r >= 0) ? r : x
	     : "=r" (r), "+r" (x) : "r" (pinv), "r" (p));

#ifdef GWDEBUG
        pinv = -pinv;
	if (pinv != gen_pinv (p))
		printf ("p doesn't match pinv!! p = %d, pinv = %d\n", p, pinv);
	if (r < 0 || r >= p)
		printf ("x mod p out of range!! x = %d, p = %d, pinv = %d, r = %d\n", x, p, pinv, r);
#endif

	return r;
}
Modifying mod_const_p to use mod_neg_p looks pretty easy:

Code:
#define gen_neg_pinv(p)	(1-(0xFFFFFFFF / (p)))
// Inline to calculate x mod p where p is a constant

__device__ __inline static int mod_const_p (int x, int p)
{
	return mod_neg_p (x, p, gen_neg_pinv (p));
}
I'm sure there are other opportunities to use mod_neg_p in the code as well.

Last fiddled with by Ken_g6 on 2013-01-22 at 00:30
Ken_g6 is offline   Reply With Quote
Old 2013-01-22, 00:40   #2138
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I can't find where George's sieve outputs its "small" primes. Are they a list of 64-bit numbers (which would be best), a list of some-other-bit numbers, or a bitmap that you somehow get numbers out of?
It does not output a list of 64-bit numbers. It outputs a bitmap that the TF kernels must "read" (convert set bits into the factor candidate). Look at the start of each TF kernel to see how the bitmap is read.
Prime95 is offline   Reply With Quote
Old 2013-01-22, 00:42   #2139
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

9,767 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Wow, I wasn't aware George had made a GPU sieve until now. It looks awesome!
Have you been asleep for the last three quarters?
chalsall is offline   Reply With Quote
Old 2013-01-22, 00:49   #2140
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Quote:
Originally Posted by Prime95 View Post
It does not output a list of 64-bit numbers. It outputs a bitmap that the TF kernels must "read" (convert set bits into the factor candidate). Look at the start of each TF kernel to see how the bitmap is read.
I was afraid of that. This makes things awkward as I expect an exact number of primes to test in my sieves. (Proportional to the number of compute units.)

Quote:
Originally Posted by chalsall View Post
Have you been asleep for the last three quarters?
Three quarters? Nine months? Is there someplace his sieve is being discussed other than this thread? I don't get over here much. I knew about his trial factoring in CUDA, but I didn't expect a public-domain small-prime sieve to go with it!

I'd looked at the "World's second-dumbest CUDA program" thread, but it didn't have any conclusive results. And I rarely look at this thread - I was lucky to see it as soon as I did!
Ken_g6 is offline   Reply With Quote
Old 2013-01-22, 01:13   #2141
Chuck
 
Chuck's Avatar
 
May 2011
Orange Park, FL

3×5×59 Posts
Default

Quote:
Originally Posted by Aramis Wyler View Post
I did go back and muck with the gpusieveprimes value, but this time in the range of 50000 to 90000 I could only budge it by a few tenths of a ghz day. Much less volatile than before. 70000 was still the hot spot for me in the 60/61m range.
I also switched to 70000; it gives me an additional 1 GHz-d/day.
Chuck is offline   Reply With Quote
Old 2013-01-22, 01:41   #2142
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Meanwhile, while perusing the code, I think I've found a speedup potential! In mod_p there's a mul.lo.s32 followed by a sub.s32. Ideally, they could be combined into a mad.lo.s32, except that the second instruction is a sub and not an add.
Good catch! I'll put it on my todo list for mmff / mfaktc. Unfortunately, I'm up to my eyeballs in more optimizations for prime95 FFTs (no, don't get excited). It should be possible to convert every mod_p call.
Prime95 is offline   Reply With Quote
Old 2013-01-22, 01:41   #2143
Dubslow
Basketry That Evening!
 
Dubslow's Avatar
 
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88

11100001101012 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I'd looked at the "World's second-dumbest CUDA program" thread, but it didn't have any conclusive results. And I rarely look at this thread - I was lucky to see it as soon as I did!
Some months ago in that thread, B^2 had been continuously improving a CUDA SoE, with help from axn and rcv; AFAIK, George used that as a launching point for a mfaktc sieve.
Dubslow is offline   Reply With Quote
Old 2013-01-22, 01:50   #2144
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

7,537 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
I was afraid of that. This makes things awkward as I expect an exact number of primes to test in my sieves. (Proportional to the number of compute units.)
Mfaktc used to have the same requirement. It turns out that by processing a big enough chunk of the bitmap (mfaktc does 8KB or 16KB) and then spreading the set bits evenly over 256 threads, there is relatively little wastage of compute resources. In fact, it may be more efficient because memory I/O is reduced since each factor candidate is represented by 1 bit instead of 64 bits.

Last fiddled with by Prime95 on 2013-01-22 at 01:53
Prime95 is offline   Reply With Quote
Old 2013-01-22, 02:19   #2145
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
Jun 2011
Thailand

72·197 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
Three quarters? Nine months? Is there someplace his sieve is being discussed other than this thread?
It all started with mmff thread, and factoring fermat and double mersenne. The sieve came to mfaktc much later.
LaurV is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
mfakto: an OpenCL program for Mersenne prefactoring Bdot GPU Computing 1676 2021-06-30 21:23
The P-1 factoring CUDA program firejuggler GPU Computing 753 2020-12-12 18:07
gr-mfaktc: a CUDA program for generalized repunits prefactoring MrRepunit GPU Computing 32 2020-11-11 19:56
mfaktc 0.21 - CUDA runtime wrong keisentraut Software 2 2020-08-18 07:03
World's second-dumbest CUDA program fivemack Programming 112 2015-02-12 22:51

All times are UTC. The time now is 07:42.


Mon Aug 2 07:42:12 UTC 2021 up 10 days, 2:11, 0 users, load averages: 1.56, 1.40, 1.38

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.