20090628, 01:11  #1 
Jan 2005
Caught in a sieve
5×79 Posts 
Making a 64bit fixed N Proth siever
I recently got a new Core 2 machine, with 64bit Linux. I was sieving away with NewPGen the other day when it occurred to me that NewPGen probably isn't optimized for 64bit Linux. So I started working on my own siever  when the hardware handles 64 bits, how hard could it be?
Well, the problems have started early. I wrote some code that I thought would work, and created some unit tests to check the functions I had so far. Well, 8 out of 9 aren't working; but the first failure is surprising. See, I lifted some mulmod code from a program by some guy named Alex Kruppa. (Alex, if you're reading this, I have several more questions I'd like to pester you with. ) Anyway, it failed my unit test! So, here's Alex's code: Code:
/* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long r; __asm__ ( "mulq %2\n\t" "divq %3" : "=&d" (r), "+%a" (a) : "rm" (b), "rm" (m) : "cc" ); return r; } Code:
printf("271828182846 * 314159265359 mod 141421356237 == 127891859871...%lu\n", mulmod(271828182846ul, 314159265359ul, 141421356237ul)); Code:
271828182846 * 314159265359 mod 141421356237 == 127891859871...95951954835 Any ideas on how I can compile with optimization switches? 
20090628, 11:36  #2 
Jan 2008
France
1000100010_{2} Posts 
The program works fine with gcc 4.4.0 and fails with 3.4.6 on my machine.
If I prevent inlining from happening with 3.4.6, it works. Perhaps some assembly constraint problem? 
20090628, 12:16  #3 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Yes, the commutativity modifier "%" seems to be wrong. It's probably better to do it with two separate asm statements, to simplify the constraints.
Code:
/* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long q, r, t1, t2; __asm__ ( "mulq %3\n\t" : "=a" (t1), "=d" (t2) : "%0" (a), "rm" (b) : "cc"); __asm__ ( "divq %4" : "=a" (q), "=d" (r) : "0" (t1), "1" (t2), "rm" (m) : "cc" ); return r; } 
20090629, 00:19  #4 
Jan 2005
Caught in a sieve
5·79 Posts 
Thanks, Alex, that did the trick!
Since I posted this I discovered your REDC/Montgomery multiplication code. I didn't see a license on that source code, so I hope you don't mind if I use/modify it? 
20090629, 00:54  #5 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.
Alex 
20090629, 11:23  #6 
Banned
"Luigi"
Aug 2002
Team Italia
4,813 Posts 

20090629, 12:11  #7 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Last fiddled with by akruppa on 20090629 at 12:12 
20090629, 12:18  #8  
Banned
"Luigi"
Aug 2002
Team Italia
12CD_{16} Posts 
Quote:


20090701, 20:01  #9 
Jan 2005
Caught in a sieve
5·79 Posts 
Here's an update on my progress:
I'm focusing on K*2^N+/1, no other bases or weird forms. At first, I thought I should compute 2^N mod P and then compute the invmod() of that. But when I profiled the slow application that resulted, invmod, not mulmod, was taking the most time! This led me to understand a cryptic hint in NewPGen's help file: "NewPGen is much faster if the base is 2. This is because division by 2 modulo a prime is easy (you shift it right if it is even, otherwise you add the prime then shift it). Division by other bases isn't so straightforward, however." I wasn't sure why I would ever divide by 2 modulo P, until I realized that (2^N)^1 mod P == (2^1)^N mod P. So that led to this optimized inversepowmod code, derived from Alex's powmod_REDC, which only works for N (confusingly named b/t in the function) >= 128: Code:
// Hybrid powmod, sidestepping several loops and possible mispredicts, and with no more than one longmod! // The other thing about this is that when you cross 2^1 mod N with Montgomery multiplication, // 2^64 * 2^64 (mod N) = 1 (mod N). So that kills one longmod. // The other longmod is needed, 2^64 >> (b & 0x3F) (mod N). static inline unsigned long invpowmod_REDCh (const unsigned long b, const unsigned long N) { unsigned long r, t = b, Ns; /* We want R*Rs  N*Ns == 1, i.e. Ns = N^{1} (mod R) */ // "s" is "a" in Montgomery representation, i.e. a * 2^64 (mod N) // First of all, s = 2^64 (mod N) * 2^64 (mod N). The numbers cancel, leaving 1! unsigned long s = 1ul; // Get the last 6 bits of b==t. (Requires b >= 128.) unsigned int i = t & 0x3f; t >>= 6; // If there were low bits in b... // (And one wonders about splitting this up into a few different functions, as b is constant...) if(i) { // r = 2^i * 2^64 (mod N), something the C compiler can do on its own! r = 1ul << (64i); // If i is large, there's a good chance no mod is needed! if(r >= N) r %= N; } else { // r = 2^0 * 2^64 (mod N), which needs longmod. (Or could be done with 2^63 (mod N) * 2 (mod N) if desperate). r = longmod(1ul, 0ul, N); } // Might as well calculate Ns in any latency overlap. Ns = invmod2pow_ul (N); /* Ns = N^{1} % 2^64 */ for(;;) { if (t & 1) { r = mulmod_REDC (r, s, N, Ns); } if((t >>= 1) == 0) break; s = mulmod_REDC (s, s, N, Ns); } r = mulmod_REDC (r, 1UL, N, Ns); /* Should be only REDC(r, N, Ns) */ return r; } So I have a couple of questions I hope somebody can answer. First, I can see how to (relatively) easily sieve a list of P's not divisible by anything less than ~120. But that would get me only 1/10 of all numbers, not 1/30. So can anyone suggest an easy and efficient way of finding only prime P's to test? Second, I expected this 64bit code to be faster than NewPGen! (Even though NewPGen uses SSE2, I figured that was just the hard way of doing 64bit math, because it wasn't written for 64bit processors.) So does anybody see some other optimization I'm missing? Or maybe even something in Alex's mulmod_REDC code or something? 
20090702, 03:30  #10 
Jan 2005
Caught in a sieve
5×79 Posts 
My profiler said mulmod_REDC was the main slow function. After studying that assembly code for probably too long, I found something!
The original assembly code starts with: Code:
__asm__ ( "mulq %2\n\t" /* rdx:rax = T */ "movq %%rax,%%rbx\n\t" /* rcx:rbx = T */ "movq %%rdx,%%rcx\n\t" "imulq %4\n\t" /* rax = (T mod R) * N', rax = m */ "mulq %3\n\t" /* rdx:rax = m * N */ Edit: Technically, I don't see much realtime speedup, but those timings tend to vary a lot for me. Here's the new code down to the second mulq: Code:
__asm__ ( "mulq %2\n\t" // rdx:rax = T "movq %%rax,%%rbx\n\t" // rcx:rbx = T "imulq %4, %%rax\n\t" // rax = (T mod R) * N', rax = m "movq %%rdx,%%rcx\n\t" "mulq %3\n\t" // rdx:rax = m * N Last fiddled with by Ken_g6 on 20090702 at 03:34 Reason: gprof and time don't agree 
20090702, 13:21  #11 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
I understand why it should be faster on a K8 or K10: integers muls (mul and imul) must execute in pipe 0. The first mul produces two macroops which go in pipe 0 and 1, first mov goes in pipe 2, and the mov from rdx goes in pipe 0 again. That mov and the next mul have the same data dependency (EDIT: no they dont, the low word of the first mul should be available one cycle earlier... need to check), so the mov delays the mul in pipe 0 by 1 cycle. I don't know the Core 2 well enough to tell why it should be faster there.
It's been a while since I wrote that code. I updated with a newer version of a mulredc macro which should be faster. Thanks for the headsup! New version http://www.loria.fr/~kruppaal/factorcyc.20090702.c Alex Last fiddled with by akruppa on 20090702 at 16:27 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
fbncsieve  a new fixed n sieve  rogue  Software  38  20180211 00:08 
A siever for K (b, n, c fixed)?  pepi37  Software  7  20150710 04:42 
Sieving k*2^n1 With Fixed n  c10ck3r  Riesel Prime Search  14  20130203 00:19 
User interface bug fixed on LLR V3.8.4  Jean PennĂ©  Software  0  20110122 16:47 
KEP is reporting computer fixed  KEP  Twin Prime Search  3  20070213 18:29 