mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GPU Computing (https://www.mersenneforum.org/forumdisplay.php?f=92)
-   -   mfaktc: a CUDA program for Mersenne prefactoring (https://www.mersenneforum.org/showthread.php?t=12827)

Aramis Wyler 2013-01-21 01:14

I did go back and muck with the gpusieveprimes value, but this time in the range of 50000 to 90000 I could only budge it by a few tenths of a ghz day. Much less volatile than before. 70000 was still the hot spot for me in the 60/61m range.

TheJudger 2013-01-21 18:54

Hi,

[QUOTE=Aillas;325298]Hi,

could someone please make a version of mfaktc 0.20 for cuda 4.0?

Thanks a lot.[/QUOTE]

I [B]guess[/B] that you have a good reason not upgrading your video driver?
Do you need a Windows or Linux executable?

On Windows it is not funny to switch CUDA toolkit version (at least for me...).
On Linux it is pretty easy:[LIST][*]install CUDA toolkit 4.0 to /usr/local/cuda (default path)[*]mv /usr/local/cuda /usr/local/cuda_4.0[*]install CUDA toolkit 4.1 to /usr/local/cuda (default path)[*]mv /usr/local/cuda /usr/local/cuda_4.1[*]...[/LIST]
And than choose your toolkit version by just executing the following command: [CODE]ln -snf /usr/local/cuda_<version> /usr/local/cuda[/CODE]

I want something similar for Windows! But I'm afraid it is not that simple because of[LIST][*][B]registry[/B][*]integration into MS Visual Studio (from Windows SDK)[*]my lack of Windows experience[*]...[/LIST]
Oliver

Ken_g6 2013-01-22 00:25

Wow, I wasn't aware George had made a GPU sieve until now. It looks awesome! :w00t:

Now, I'd like to integrate it into ppsieve/tpsieve. But I'm not sure where to begin. I can't find where George's sieve outputs its "small" primes. Are they a list of 64-bit numbers (which would be best), a list of some-other-bit numbers, or a bitmap that you somehow get numbers out of?


Meanwhile, while perusing the code, I think I've found a speedup potential! In mod_p there's a mul.lo.s32 followed by a sub.s32. Ideally, they could be combined into a mad.lo.s32, except that the second instruction is a sub and not an add. If we combine them anyway, the problem then becomes how to negate r before getting to the add. I [i]think[/i] that negating either (but not both of) p or pinv will do the trick. So let's call this new function mod_neg_p.

Edit: to spell out what I'm thinking of:
[code]
__device__ __inline static int mod_neg_p (int x, int p, int pinv)
{
// int q, r, a, b;

// q = __mulhi (x, pinv); // quotient = x * -inverse_of_p
// a = x + q * p; // x mod p (but may be too large by one p)
// b = a - p; // x mod p (the alternative return value)
// asm("slct.s32.s32 %0, %1, %2, %3;" : "=r" (r) : "r" (b) , "r" (a) , "r" (b));

int r;
asm ("mul.hi.s32 %0, %1, %2;\n\t" // r = __mulhi (x, pinv);
"mad.lo.s32 %1, %0, %3, %1;\n\t" // x = r * p + x;
"sub.s32 %0, %1, %3;\n\t" // r = x - p;
"slct.s32.s32 %0, %0, %1, %0;" // r = (r >= 0) ? r : x
: "=r" (r), "+r" (x) : "r" (pinv), "r" (p));

#ifdef GWDEBUG
pinv = -pinv;
if (pinv != gen_pinv (p))
printf ("p doesn't match pinv!! p = %d, pinv = %d\n", p, pinv);
if (r < 0 || r >= p)
printf ("x mod p out of range!! x = %d, p = %d, pinv = %d, r = %d\n", x, p, pinv, r);
#endif

return r;
}
[/code]

Modifying mod_const_p to use mod_neg_p looks pretty easy:

[code]
#define gen_neg_pinv(p) (1-(0xFFFFFFFF / (p)))
// Inline to calculate x mod p where p is a constant

__device__ __inline static int mod_const_p (int x, int p)
{
return mod_neg_p (x, p, gen_neg_pinv (p));
}[/code]
I'm sure there are other opportunities to use mod_neg_p in the code as well.

Prime95 2013-01-22 00:40

[QUOTE=Ken_g6;325406]I can't find where George's sieve outputs its "small" primes. Are they a list of 64-bit numbers (which would be best), a list of some-other-bit numbers, or a bitmap that you somehow get numbers out of?[/QUOTE]

It does not output a list of 64-bit numbers. It outputs a bitmap that the TF kernels must "read" (convert set bits into the factor candidate). Look at the start of each TF kernel to see how the bitmap is read.

chalsall 2013-01-22 00:42

[QUOTE=Ken_g6;325406]Wow, I wasn't aware George had made a GPU sieve until now. It looks awesome! :w00t:[/QUOTE]

Have you been asleep for the last three quarters?

Ken_g6 2013-01-22 00:49

[QUOTE=Prime95;325409]It does not output a list of 64-bit numbers. It outputs a bitmap that the TF kernels must "read" (convert set bits into the factor candidate). Look at the start of each TF kernel to see how the bitmap is read.[/QUOTE]I was afraid of that. This makes things awkward as I expect an exact number of primes to test in my sieves. (Proportional to the number of compute units.)

[QUOTE=chalsall;325410]Have you been asleep for the last three quarters?[/QUOTE]Three quarters? Nine months? Is there someplace his sieve is being discussed other than this thread? I don't get over here much. I knew about his trial factoring in CUDA, but I didn't expect a public-domain small-prime sieve to go with it!

I'd looked at the "World's second-dumbest CUDA program" thread, but it didn't have any conclusive results. And I rarely look at this thread - I was lucky to see it as soon as I did!

Chuck 2013-01-22 01:13

[QUOTE=Aramis Wyler;325327]I did go back and muck with the gpusieveprimes value, but this time in the range of 50000 to 90000 I could only budge it by a few tenths of a ghz day. Much less volatile than before. 70000 was still the hot spot for me in the 60/61m range.[/QUOTE]

I also switched to 70000; it gives me an additional 1 GHz-d/day.

Prime95 2013-01-22 01:41

[QUOTE=Ken_g6;325406]
Meanwhile, while perusing the code, I think I've found a speedup potential! In mod_p there's a mul.lo.s32 followed by a sub.s32. Ideally, they could be combined into a mad.lo.s32, except that the second instruction is a sub and not an add. [/QUOTE]

Good catch! I'll put it on my todo list for mmff / mfaktc. Unfortunately, I'm up to my eyeballs in more optimizations for prime95 FFTs (no, don't get excited). It should be possible to convert every mod_p call.

Dubslow 2013-01-22 01:41

[QUOTE=Ken_g6;325411]
I'd looked at the "World's second-dumbest CUDA program" thread, but it didn't have any conclusive results. And I rarely look at this thread - I was lucky to see it as soon as I did![/QUOTE]

Some months ago in that thread, B^2 had been continuously improving a CUDA SoE, with help from axn and rcv; AFAIK, George used that as a launching point for a mfaktc sieve.

Prime95 2013-01-22 01:50

[QUOTE=Ken_g6;325411]I was afraid of that. This makes things awkward as I expect an exact number of primes to test in my sieves. (Proportional to the number of compute units.)[/QUOTE]

Mfaktc used to have the same requirement. It turns out that by processing a big enough chunk of the bitmap (mfaktc does 8KB or 16KB) and then spreading the set bits evenly over 256 threads, there is relatively little wastage of compute resources. In fact, it may be more efficient because memory I/O is reduced since each factor candidate is represented by 1 bit instead of 64 bits.

LaurV 2013-01-22 02:19

[QUOTE=Ken_g6;325411]Three quarters? Nine months? Is there someplace his sieve is being discussed other than this thread?[/QUOTE]
It all started with [URL="http://www.mersenneforum.org/showthread.php?t=17162"]mmff thread,[/URL] and factoring fermat and double mersenne. The sieve came to mfaktc much later.


All times are UTC. The time now is 23:15.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.