mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2009-06-28, 01:11   #1
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default Making a 64-bit fixed N Proth siever

I recently got a new Core 2 machine, with 64-bit Linux. I was sieving away with NewPGen the other day when it occurred to me that NewPGen probably isn't optimized for 64-bit Linux. So I started working on my own siever - when the hardware handles 64 bits, how hard could it be?

Well, the problems have started early. I wrote some code that I thought would work, and created some unit tests to check the functions I had so far. Well, 8 out of 9 aren't working; but the first failure is surprising. See, I lifted some mulmod code from a program by some guy named Alex Kruppa. (Alex, if you're reading this, I have several more questions I'd like to pester you with. ) Anyway, it failed my unit test!

So, here's Alex's code:
Code:
/* Multiply a and b, and reduce the product modulo m. Remainder is 
   returned */
static inline unsigned long 
mulmod (unsigned long a, const unsigned long b, const unsigned long m)
{
  unsigned long r;
  __asm__
  ( "mulq %2\n\t"
    "divq %3"
    : "=&d" (r), "+%a" (a)
    : "rm" (b), "rm" (m)
    : "cc"
  );
  return r;
}
I tested this with the following line:
Code:
printf("271828182846 * 314159265359 mod 141421356237 == 127891859871...%lu\n", mulmod(271828182846ul, 314159265359ul, 141421356237ul));
If the test were to pass, the numbers on either side of "..." should be the same. I compiled with g++ 4.3.3, and if I compile straight, it works. But if I compile with any optimization switch, the result is:
Code:
271828182846 * 314159265359 mod 141421356237 == 127891859871...95951954835
gcalctool and dc both say 127891859871 is correct.

Any ideas on how I can compile with optimization switches?
Ken_g6 is offline   Reply With Quote
Old 2009-06-28, 11:36   #2
ldesnogu
 
ldesnogu's Avatar
 
Jan 2008
France

2×271 Posts
Default

The program works fine with gcc 4.4.0 and fails with 3.4.6 on my machine.
If I prevent inlining from happening with 3.4.6, it works. Perhaps some assembly constraint problem?
ldesnogu is offline   Reply With Quote
Old 2009-06-28, 12:16   #3
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Yes, the commutativity modifier "%" seems to be wrong. It's probably better to do it with two separate asm statements, to simplify the constraints.

Code:
/* Multiply a and b, and reduce the product modulo m. Remainder is
   returned */
static inline unsigned long
mulmod (unsigned long a, const unsigned long b, const unsigned long m)
{
  unsigned long q, r, t1, t2;
  __asm__
  ( "mulq %3\n\t"
    : "=a" (t1), "=d" (t2)
    : "%0" (a), "rm" (b)
    : "cc");
  __asm__
  ( "divq %4"
    : "=a" (q), "=d" (r)
    : "0" (t1), "1" (t2), "rm" (m)
    : "cc"
  );
  return r;
}
Alex
akruppa is offline   Reply With Quote
Old 2009-06-29, 00:19   #4
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Thanks, Alex, that did the trick!

Since I posted this I discovered your REDC/Montgomery multiplication code. I didn't see a license on that source code, so I hope you don't mind if I use/modify it?
Ken_g6 is offline   Reply With Quote
Old 2009-06-29, 00:54   #5
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.

Alex
akruppa is offline   Reply With Quote
Old 2009-06-29, 11:23   #6
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

112708 Posts
Default

Quote:
Originally Posted by akruppa View Post
Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.

Alex
Let us know where to look for!

Luigi
ET_ is offline   Reply With Quote
Old 2009-06-29, 12:11   #7
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

http://www.loria.fr/~kruppaal/factorcyc.20090629.c

Alex

Last fiddled with by akruppa on 2009-06-29 at 12:12
akruppa is offline   Reply With Quote
Old 2009-06-29, 12:18   #8
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

23×599 Posts
Default

Quote:
Originally Posted by akruppa View Post
ET_ is offline   Reply With Quote
Old 2009-07-01, 20:01   #9
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

1100010112 Posts
Default

Here's an update on my progress:

I'm focusing on K*2^N+/-1, no other bases or weird forms. At first, I thought I should compute 2^N mod P and then compute the invmod() of that. But when I profiled the slow application that resulted, invmod, not mulmod, was taking the most time!

This led me to understand a cryptic hint in NewPGen's help file:

"NewPGen is much faster if the base is 2. This is because division by 2 modulo a prime is easy (you shift it right if it is even, otherwise you add the prime then shift it). Division by other bases isn't so straightforward, however."

I wasn't sure why I would ever divide by 2 modulo P, until I realized that (2^N)^-1 mod P == (2^-1)^N mod P. So that led to this optimized inverse-powmod code, derived from Alex's powmod_REDC, which only works for N (confusingly named b/t in the function) >= 128:

Code:
// Hybrid powmod, sidestepping several loops and possible mispredicts, and with no more than one longmod!
// The other thing about this is that when you cross 2^-1 mod N with Montgomery multiplication,
// 2^-64 * 2^64 (mod N) = 1 (mod N).  So that kills one longmod.
// The other longmod is needed, 2^64 >> (b & 0x3F) (mod N).
static inline unsigned long 
invpowmod_REDCh (const unsigned long b, const unsigned long N) {
	unsigned long r, t = b, Ns;

	/* We want R*Rs - N*Ns == 1, i.e. Ns = -N^{-1} (mod R) */
	// "s" is "a" in Montgomery representation, i.e.  a * 2^64 (mod N)
	// First of all, s = 2^-64 (mod N) * 2^64 (mod N).  The numbers cancel, leaving 1!
	unsigned long s = 1ul;

	// Get the last 6 bits of b==t.  (Requires b >= 128.)
	unsigned int i = t & 0x3f;
	t >>= 6;

	// If there were low bits in b...
	// (And one wonders about splitting this up into a few different functions, as b is constant...)
	if(i) {
		// r = 2^-i * 2^64 (mod N), something the C compiler can do on its own!
		r = 1ul << (64-i);
		// If i is large, there's a good chance no mod is needed!
		if(r >= N) r %= N;
	} else {
		// r = 2^0 * 2^64 (mod N), which needs longmod. (Or could be done with 2^63 (mod N) * 2 (mod N) if desperate).
		r = longmod(1ul, 0ul, N);
	}
	// Might as well calculate Ns in any latency overlap.
	Ns = -invmod2pow_ul (N); /* Ns = -N^{-1} % 2^64 */

	for(;;) {
		if (t & 1) {
			r = mulmod_REDC (r, s, N, Ns);
		}
		if((t >>= 1) == 0) break;
		s = mulmod_REDC (s, s, N, Ns);
	}

	r = mulmod_REDC (r, 1UL, N, Ns); /* Should be only REDC(r, N, Ns) */

	return r;
}
With that code, my program is running about 1/10 the speed of NewPGen, for N about 1 million, and P around 3.75 trillion. I infer that part of this is because I'm testing all P's that are not divisible by 2 or 3, 1/3 of all numbers, which was a quick hack for testing. ln(3.75 trillion) is about 30, so if I could find and test only primes, my program would be about as fast as NewPGen.

So I have a couple of questions I hope somebody can answer. First, I can see how to (relatively) easily sieve a list of P's not divisible by anything less than ~120. But that would get me only 1/10 of all numbers, not 1/30. So can anyone suggest an easy and efficient way of finding only prime P's to test?

Second, I expected this 64-bit code to be faster than NewPGen! (Even though NewPGen uses SSE2, I figured that was just the hard way of doing 64-bit math, because it wasn't written for 64-bit processors.) So does anybody see some other optimization I'm missing? Or maybe even something in Alex's mulmod_REDC code or something?
Ken_g6 is offline   Reply With Quote
Old 2009-07-02, 03:30   #10
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

My profiler said mulmod_REDC was the main slow function. After studying that assembly code for probably too long, I found something!

The original assembly code starts with:
Code:
  __asm__
    ( "mulq %2\n\t"            /* rdx:rax = T */
      "movq %%rax,%%rbx\n\t"   /* rcx:rbx = T */
      "movq %%rdx,%%rcx\n\t"
      "imulq %4\n\t"           /* rax = (T mod R) * N', rax = m */
      "mulq %3\n\t"            /* rdx:rax = m * N */
That imulq is a full 64->128 bit multiply, but rdx is discarded by the next mulq. So I tried turning it into the two-argument form, "imulq %4, %%rax\n\t". That worked, but for some reason didn't speed up the code at all. Maybe it's a decoding thing; I don't know. But I also noticed that since the two-argument form doesn't touch rdx, the imulq could go before the movq involving rdx. And that produced a marked speedup, taking my short test from 5 seconds to 4 (according to gprof, anyway). It still won't beat NewPGen - yet - but I think it's better. And it should help you out too, Alex.

Edit: Technically, I don't see much real-time speedup, but those timings tend to vary a lot for me.

Here's the new code down to the second mulq:
Code:
  __asm__
    ( "mulq %2\n\t"            // rdx:rax = T 
      "movq %%rax,%%rbx\n\t"   // rcx:rbx = T 
      "imulq %4, %%rax\n\t"    // rax = (T mod R) * N', rax = m 
      "movq %%rdx,%%rcx\n\t"
      "mulq %3\n\t"            // rdx:rax = m * N

Last fiddled with by Ken_g6 on 2009-07-02 at 03:34 Reason: gprof and time don't agree
Ken_g6 is offline   Reply With Quote
Old 2009-07-02, 13:21   #11
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

I understand why it should be faster on a K8 or K10: integers muls (mul and imul) must execute in pipe 0. The first mul produces two macro-ops which go in pipe 0 and 1, first mov goes in pipe 2, and the mov from rdx goes in pipe 0 again. That mov and the next mul have the same data dependency (EDIT: no they dont, the low word of the first mul should be available one cycle earlier... need to check), so the mov delays the mul in pipe 0 by 1 cycle. I don't know the Core 2 well enough to tell why it should be faster there.

It's been a while since I wrote that code. I updated with a newer version of a mulredc macro which should be faster. Thanks for the heads-up!

New version http://www.loria.fr/~kruppaal/factorcyc.20090702.c

Alex

Last fiddled with by akruppa on 2009-07-02 at 16:27
akruppa is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
fbncsieve - a new fixed n sieve rogue Software 38 2018-02-11 00:08
A siever for K (b, n, c fixed)? pepi37 Software 7 2015-07-10 04:42
Sieving k*2^n-1 With Fixed n c10ck3r Riesel Prime Search 14 2013-02-03 00:19
User interface bug fixed on LLR V3.8.4 Jean Penné Software 0 2011-01-22 16:47
KEP is reporting computer fixed KEP Twin Prime Search 3 2007-02-13 18:29

All times are UTC. The time now is 18:43.

Thu Jan 21 18:43:12 UTC 2021 up 49 days, 14:54, 1 user, load averages: 2.54, 2.24, 2.00

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.