![]() |
|
|
#2135 |
|
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA
6508 Posts |
I did go back and muck with the gpusieveprimes value, but this time in the range of 50000 to 90000 I could only budge it by a few tenths of a ghz day. Much less volatile than before. 70000 was still the hot spot for me in the 60/61m range.
|
|
|
|
|
|
#2136 | |
|
"Oliver"
Mar 2005
Germany
11×101 Posts |
Hi,
Quote:
Do you need a Windows or Linux executable? On Windows it is not funny to switch CUDA toolkit version (at least for me...). On Linux it is pretty easy:
And than choose your toolkit version by just executing the following command: Code:
ln -snf /usr/local/cuda_<version> /usr/local/cuda
Oliver |
|
|
|
|
|
|
#2137 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
Wow, I wasn't aware George had made a GPU sieve until now. It looks awesome!
![]() Now, I'd like to integrate it into ppsieve/tpsieve. But I'm not sure where to begin. I can't find where George's sieve outputs its "small" primes. Are they a list of 64-bit numbers (which would be best), a list of some-other-bit numbers, or a bitmap that you somehow get numbers out of? Meanwhile, while perusing the code, I think I've found a speedup potential! In mod_p there's a mul.lo.s32 followed by a sub.s32. Ideally, they could be combined into a mad.lo.s32, except that the second instruction is a sub and not an add. If we combine them anyway, the problem then becomes how to negate r before getting to the add. I think that negating either (but not both of) p or pinv will do the trick. So let's call this new function mod_neg_p. Edit: to spell out what I'm thinking of: Code:
__device__ __inline static int mod_neg_p (int x, int p, int pinv)
{
// int q, r, a, b;
// q = __mulhi (x, pinv); // quotient = x * -inverse_of_p
// a = x + q * p; // x mod p (but may be too large by one p)
// b = a - p; // x mod p (the alternative return value)
// asm("slct.s32.s32 %0, %1, %2, %3;" : "=r" (r) : "r" (b) , "r" (a) , "r" (b));
int r;
asm ("mul.hi.s32 %0, %1, %2;\n\t" // r = __mulhi (x, pinv);
"mad.lo.s32 %1, %0, %3, %1;\n\t" // x = r * p + x;
"sub.s32 %0, %1, %3;\n\t" // r = x - p;
"slct.s32.s32 %0, %0, %1, %0;" // r = (r >= 0) ? r : x
: "=r" (r), "+r" (x) : "r" (pinv), "r" (p));
#ifdef GWDEBUG
pinv = -pinv;
if (pinv != gen_pinv (p))
printf ("p doesn't match pinv!! p = %d, pinv = %d\n", p, pinv);
if (r < 0 || r >= p)
printf ("x mod p out of range!! x = %d, p = %d, pinv = %d, r = %d\n", x, p, pinv, r);
#endif
return r;
}
Code:
#define gen_neg_pinv(p) (1-(0xFFFFFFFF / (p)))
// Inline to calculate x mod p where p is a constant
__device__ __inline static int mod_const_p (int x, int p)
{
return mod_neg_p (x, p, gen_neg_pinv (p));
}
Last fiddled with by Ken_g6 on 2013-01-22 at 00:30 |
|
|
|
|
|
#2138 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
It does not output a list of 64-bit numbers. It outputs a bitmap that the TF kernels must "read" (convert set bits into the factor candidate). Look at the start of each TF kernel to see how the bitmap is read.
|
|
|
|
|
|
#2139 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
9,767 Posts |
|
|
|
|
|
|
#2140 | |
|
Jan 2005
Caught in a sieve
5×79 Posts |
Quote:
Three quarters? Nine months? Is there someplace his sieve is being discussed other than this thread? I don't get over here much. I knew about his trial factoring in CUDA, but I didn't expect a public-domain small-prime sieve to go with it! I'd looked at the "World's second-dumbest CUDA program" thread, but it didn't have any conclusive results. And I rarely look at this thread - I was lucky to see it as soon as I did! |
|
|
|
|
|
|
#2141 |
|
May 2011
Orange Park, FL
11011101012 Posts |
I also switched to 70000; it gives me an additional 1 GHz-d/day.
|
|
|
|
|
|
#2142 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Good catch! I'll put it on my todo list for mmff / mfaktc. Unfortunately, I'm up to my eyeballs in more optimizations for prime95 FFTs (no, don't get excited). It should be possible to convert every mod_p call.
|
|
|
|
|
|
#2143 |
|
Basketry That Evening!
"Bunslow the Bold"
Jun 2011
40<A<43 -89<O<-88
722110 Posts |
Some months ago in that thread, B^2 had been continuously improving a CUDA SoE, with help from axn and rcv; AFAIK, George used that as a launching point for a mfaktc sieve.
|
|
|
|
|
|
#2144 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
7,537 Posts |
Mfaktc used to have the same requirement. It turns out that by processing a big enough chunk of the bitmap (mfaktc does 8KB or 16KB) and then spreading the set bits evenly over 256 threads, there is relatively little wastage of compute resources. In fact, it may be more efficient because memory I/O is reduced since each factor candidate is represented by 1 bit instead of 64 bits.
Last fiddled with by Prime95 on 2013-01-22 at 01:53 |
|
|
|
|
|
#2145 | |
|
Romulan Interpreter
Jun 2011
Thailand
226658 Posts |
Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| mfakto: an OpenCL program for Mersenne prefactoring | Bdot | GPU Computing | 1676 | 2021-06-30 21:23 |
| The P-1 factoring CUDA program | firejuggler | GPU Computing | 753 | 2020-12-12 18:07 |
| gr-mfaktc: a CUDA program for generalized repunits prefactoring | MrRepunit | GPU Computing | 32 | 2020-11-11 19:56 |
| mfaktc 0.21 - CUDA runtime wrong | keisentraut | Software | 2 | 2020-08-18 07:03 |
| World's second-dumbest CUDA program | fivemack | Programming | 112 | 2015-02-12 22:51 |