mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   Making a 64-bit fixed N Proth siever (https://www.mersenneforum.org/showthread.php?t=12101)

Ken_g6 2009-06-28 01:11

Making a 64-bit fixed N Proth siever
 
I recently got a new Core 2 machine, with 64-bit Linux. I was sieving away with NewPGen the other day when it occurred to me that NewPGen probably isn't optimized for 64-bit Linux. So I started working on my own siever - when the hardware handles 64 bits, how hard could it be?:big grin:

Well, the problems have started early. I wrote some code that I thought would work, and created some unit tests to check the functions I had so far. Well, 8 out of 9 aren't working; but the first failure is surprising. See, I lifted some mulmod code from [URL="http://www.loria.fr/~kruppaal/factorcyc.20090612.c"]a program by some guy named Alex Kruppa[/URL]. (Alex, if you're reading this, I have several more questions I'd like to pester you with. :wink:) Anyway, it failed my unit test!

So, here's Alex's code:
[code]/* Multiply a and b, and reduce the product modulo m. Remainder is
returned */
static inline unsigned long
mulmod (unsigned long a, const unsigned long b, const unsigned long m)
{
unsigned long r;
__asm__
( "mulq %2\n\t"
"divq %3"
: "=&d" (r), "+%a" (a)
: "rm" (b), "rm" (m)
: "cc"
);
return r;
}[/code]

I tested this with the following line:[code]printf("271828182846 * 314159265359 mod 141421356237 == 127891859871...%lu\n", mulmod(271828182846ul, 314159265359ul, 141421356237ul));[/code]
If the test were to pass, the numbers on either side of "..." should be the same. I compiled with g++ 4.3.3, and if I compile straight, it works. But if I compile with any optimization switch, the result is:
[code]271828182846 * 314159265359 mod 141421356237 == 127891859871...95951954835[/code]
gcalctool and dc both say 127891859871 is correct.

Any ideas on how I can compile with optimization switches?

ldesnogu 2009-06-28 11:36

The program works fine with gcc 4.4.0 and fails with 3.4.6 on my machine.
If I prevent inlining from happening with 3.4.6, it works. Perhaps some assembly constraint problem?

akruppa 2009-06-28 12:16

Yes, the commutativity modifier "%" seems to be wrong. It's probably better to do it with two separate asm statements, to simplify the constraints.

[CODE]
/* Multiply a and b, and reduce the product modulo m. Remainder is
returned */
static inline unsigned long
mulmod (unsigned long a, const unsigned long b, const unsigned long m)
{
unsigned long q, r, t1, t2;
__asm__
( "mulq %3\n\t"
: "=a" (t1), "=d" (t2)
: "%0" (a), "rm" (b)
: "cc");
__asm__
( "divq %4"
: "=a" (q), "=d" (r)
: "0" (t1), "1" (t2), "rm" (m)
: "cc"
);
return r;
}
[/CODE]

Alex

Ken_g6 2009-06-29 00:19

Thanks, Alex, that did the trick!

Since I posted this I discovered your REDC/Montgomery multiplication code. I didn't see a license on that source code, so I hope you don't mind if I use/modify it?

akruppa 2009-06-29 00:54

Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.

Alex

ET_ 2009-06-29 11:23

[QUOTE=akruppa;179203]Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.

Alex[/QUOTE]

Let us know where to look for! :geek:

Luigi

akruppa 2009-06-29 12:11

[url]http://www.loria.fr/~kruppaal/factorcyc.20090629.c[/url]

Alex

ET_ 2009-06-29 12:18

[QUOTE=akruppa;179237][url]http://www.loria.fr/~kruppaal/factorcyc.20090629.c[/url]

Alex[/QUOTE]

:bow:

Ken_g6 2009-07-01 20:01

Here's an update on my progress:

I'm focusing on K*2^N+/-1, no other bases or weird forms. At first, I thought I should compute 2^N mod P and then compute the invmod() of that. But when I profiled the slow application that resulted, invmod, not mulmod, was taking the most time!

This led me to understand a cryptic hint in NewPGen's help file:

"NewPGen is much faster if the base is 2. This is because division by 2 modulo a prime is easy (you shift it right if it is even, otherwise you add the prime then shift it). Division by other bases isn't so straightforward, however."

I wasn't sure why I would ever divide by 2 modulo P, until I realized that (2^N)^-1 mod P == (2^-1)^N mod P. So that led to this optimized inverse-powmod code, derived from Alex's powmod_REDC, which only works for N (confusingly named b/t in the function) >= 128:

[code]// Hybrid powmod, sidestepping several loops and possible mispredicts, and with no more than one longmod!
// The other thing about this is that when you cross 2^-1 mod N with Montgomery multiplication,
// 2^-64 * 2^64 (mod N) = 1 (mod N). So that kills one longmod.
// The other longmod is needed, 2^64 >> (b & 0x3F) (mod N).
static inline unsigned long
invpowmod_REDCh (const unsigned long b, const unsigned long N) {
unsigned long r, t = b, Ns;

/* We want R*Rs - N*Ns == 1, i.e. Ns = -N^{-1} (mod R) */
// "s" is "a" in Montgomery representation, i.e. a * 2^64 (mod N)
// First of all, s = 2^-64 (mod N) * 2^64 (mod N). The numbers cancel, leaving 1!
unsigned long s = 1ul;

// Get the last 6 bits of b==t. (Requires b >= 128.)
unsigned int i = t & 0x3f;
t >>= 6;

// If there were low bits in b...
// (And one wonders about splitting this up into a few different functions, as b is constant...)
if(i) {
// r = 2^-i * 2^64 (mod N), something the C compiler can do on its own!
r = 1ul << (64-i);
// If i is large, there's a good chance no mod is needed!
if(r >= N) r %= N;
} else {
// r = 2^0 * 2^64 (mod N), which needs longmod. (Or could be done with 2^63 (mod N) * 2 (mod N) if desperate).
r = longmod(1ul, 0ul, N);
}
// Might as well calculate Ns in any latency overlap.
Ns = -invmod2pow_ul (N); /* Ns = -N^{-1} % 2^64 */

for(;;) {
if (t & 1) {
r = mulmod_REDC (r, s, N, Ns);
}
if((t >>= 1) == 0) break;
s = mulmod_REDC (s, s, N, Ns);
}

r = mulmod_REDC (r, 1UL, N, Ns); /* Should be only REDC(r, N, Ns) */

return r;
}[/code]
With that code, my program is running about 1/10 the speed of NewPGen, for N about 1 million, and P around 3.75 trillion. I infer that part of this is because I'm testing all P's that are not divisible by 2 or 3, 1/3 of all numbers, which was a quick hack for testing. ln(3.75 trillion) is about 30, so if I could find and test only primes, my program would be about as fast as NewPGen.

So I have a couple of questions I hope somebody can answer. First, I can see how to (relatively) easily sieve a list of P's not divisible by anything less than ~120. But that would get me only 1/10 of all numbers, not 1/30. So can anyone suggest an easy and efficient way of finding only prime P's to test?

Second, I expected this 64-bit code to be [i]faster[/i] than NewPGen! (Even though NewPGen uses SSE2, I figured that was just the hard way of doing 64-bit math, because it wasn't written for 64-bit processors.) So does anybody see some other optimization I'm missing? Or maybe even something in Alex's mulmod_REDC code or something?

Ken_g6 2009-07-02 03:30

My profiler said mulmod_REDC was the main slow function. After studying that assembly code for probably too long, I found something! :smile:

The original assembly code starts with:[code] __asm__
( "mulq %2\n\t" /* rdx:rax = T */
"movq %%rax,%%rbx\n\t" /* rcx:rbx = T */
"movq %%rdx,%%rcx\n\t"
"imulq %4\n\t" /* rax = (T mod R) * N', rax = m */
"mulq %3\n\t" /* rdx:rax = m * N */[/code]

That imulq is a full 64->128 bit multiply, but rdx is discarded by the next mulq. So I tried turning it into the two-argument form, "imulq %4, %%rax\n\t". That worked, but for some reason didn't speed up the code at all. Maybe it's a decoding thing; I don't know. But I also noticed that since the two-argument form doesn't touch rdx, the imulq could go before the movq involving rdx. And [i]that[/i] produced a marked speedup, taking my short test from 5 seconds to 4 (according to gprof, anyway). It still won't beat NewPGen - yet - but I think it's better. And it should help you out too, Alex.

Edit: Technically, I don't see much real-time speedup, but those timings tend to vary a lot for me.

Here's the new code down to the second mulq:
[code] __asm__
( "mulq %2\n\t" // rdx:rax = T
"movq %%rax,%%rbx\n\t" // rcx:rbx = T
"imulq %4, %%rax\n\t" // rax = (T mod R) * N', rax = m
"movq %%rdx,%%rcx\n\t"
"mulq %3\n\t" // rdx:rax = m * N [/code]

akruppa 2009-07-02 13:21

I understand why it should be faster on a K8 or K10: integers muls (mul and imul) must execute in pipe 0. The first mul produces two macro-ops which go in pipe 0 and 1, first mov goes in pipe 2, and the mov from rdx goes in pipe 0 again. That mov and the next mul have the same data dependency [SIZE="1"](EDIT: no they dont, the low word of the first mul should be available one cycle earlier... need to check)[/SIZE], so the mov delays the mul in pipe 0 by 1 cycle. I don't know the Core 2 well enough to tell why it should be faster there.

It's been a while since I wrote that code. I updated with a newer version of a mulredc macro which should be faster. Thanks for the heads-up!

New version [url]http://www.loria.fr/~kruppaal/factorcyc.20090702.c[/url]

Alex


All times are UTC. The time now is 08:01.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.