![]() |
Making a 64-bit fixed N Proth siever
I recently got a new Core 2 machine, with 64-bit Linux. I was sieving away with NewPGen the other day when it occurred to me that NewPGen probably isn't optimized for 64-bit Linux. So I started working on my own siever - when the hardware handles 64 bits, how hard could it be?:big grin:
Well, the problems have started early. I wrote some code that I thought would work, and created some unit tests to check the functions I had so far. Well, 8 out of 9 aren't working; but the first failure is surprising. See, I lifted some mulmod code from [URL="http://www.loria.fr/~kruppaal/factorcyc.20090612.c"]a program by some guy named Alex Kruppa[/URL]. (Alex, if you're reading this, I have several more questions I'd like to pester you with. :wink:) Anyway, it failed my unit test! So, here's Alex's code: [code]/* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long r; __asm__ ( "mulq %2\n\t" "divq %3" : "=&d" (r), "+%a" (a) : "rm" (b), "rm" (m) : "cc" ); return r; }[/code] I tested this with the following line:[code]printf("271828182846 * 314159265359 mod 141421356237 == 127891859871...%lu\n", mulmod(271828182846ul, 314159265359ul, 141421356237ul));[/code] If the test were to pass, the numbers on either side of "..." should be the same. I compiled with g++ 4.3.3, and if I compile straight, it works. But if I compile with any optimization switch, the result is: [code]271828182846 * 314159265359 mod 141421356237 == 127891859871...95951954835[/code] gcalctool and dc both say 127891859871 is correct. Any ideas on how I can compile with optimization switches? |
The program works fine with gcc 4.4.0 and fails with 3.4.6 on my machine.
If I prevent inlining from happening with 3.4.6, it works. Perhaps some assembly constraint problem? |
Yes, the commutativity modifier "%" seems to be wrong. It's probably better to do it with two separate asm statements, to simplify the constraints.
[CODE] /* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long q, r, t1, t2; __asm__ ( "mulq %3\n\t" : "=a" (t1), "=d" (t2) : "%0" (a), "rm" (b) : "cc"); __asm__ ( "divq %4" : "=a" (q), "=d" (r) : "0" (t1), "1" (t2), "rm" (m) : "cc" ); return r; } [/CODE] Alex |
Thanks, Alex, that did the trick!
Since I posted this I discovered your REDC/Montgomery multiplication code. I didn't see a license on that source code, so I hope you don't mind if I use/modify it? |
Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.
Alex |
[QUOTE=akruppa;179203]Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain.
Alex[/QUOTE] Let us know where to look for! :geek: Luigi |
[url]http://www.loria.fr/~kruppaal/factorcyc.20090629.c[/url]
Alex |
[QUOTE=akruppa;179237][url]http://www.loria.fr/~kruppaal/factorcyc.20090629.c[/url]
Alex[/QUOTE] :bow: |
Here's an update on my progress:
I'm focusing on K*2^N+/-1, no other bases or weird forms. At first, I thought I should compute 2^N mod P and then compute the invmod() of that. But when I profiled the slow application that resulted, invmod, not mulmod, was taking the most time! This led me to understand a cryptic hint in NewPGen's help file: "NewPGen is much faster if the base is 2. This is because division by 2 modulo a prime is easy (you shift it right if it is even, otherwise you add the prime then shift it). Division by other bases isn't so straightforward, however." I wasn't sure why I would ever divide by 2 modulo P, until I realized that (2^N)^-1 mod P == (2^-1)^N mod P. So that led to this optimized inverse-powmod code, derived from Alex's powmod_REDC, which only works for N (confusingly named b/t in the function) >= 128: [code]// Hybrid powmod, sidestepping several loops and possible mispredicts, and with no more than one longmod! // The other thing about this is that when you cross 2^-1 mod N with Montgomery multiplication, // 2^-64 * 2^64 (mod N) = 1 (mod N). So that kills one longmod. // The other longmod is needed, 2^64 >> (b & 0x3F) (mod N). static inline unsigned long invpowmod_REDCh (const unsigned long b, const unsigned long N) { unsigned long r, t = b, Ns; /* We want R*Rs - N*Ns == 1, i.e. Ns = -N^{-1} (mod R) */ // "s" is "a" in Montgomery representation, i.e. a * 2^64 (mod N) // First of all, s = 2^-64 (mod N) * 2^64 (mod N). The numbers cancel, leaving 1! unsigned long s = 1ul; // Get the last 6 bits of b==t. (Requires b >= 128.) unsigned int i = t & 0x3f; t >>= 6; // If there were low bits in b... // (And one wonders about splitting this up into a few different functions, as b is constant...) if(i) { // r = 2^-i * 2^64 (mod N), something the C compiler can do on its own! r = 1ul << (64-i); // If i is large, there's a good chance no mod is needed! if(r >= N) r %= N; } else { // r = 2^0 * 2^64 (mod N), which needs longmod. (Or could be done with 2^63 (mod N) * 2 (mod N) if desperate). r = longmod(1ul, 0ul, N); } // Might as well calculate Ns in any latency overlap. Ns = -invmod2pow_ul (N); /* Ns = -N^{-1} % 2^64 */ for(;;) { if (t & 1) { r = mulmod_REDC (r, s, N, Ns); } if((t >>= 1) == 0) break; s = mulmod_REDC (s, s, N, Ns); } r = mulmod_REDC (r, 1UL, N, Ns); /* Should be only REDC(r, N, Ns) */ return r; }[/code] With that code, my program is running about 1/10 the speed of NewPGen, for N about 1 million, and P around 3.75 trillion. I infer that part of this is because I'm testing all P's that are not divisible by 2 or 3, 1/3 of all numbers, which was a quick hack for testing. ln(3.75 trillion) is about 30, so if I could find and test only primes, my program would be about as fast as NewPGen. So I have a couple of questions I hope somebody can answer. First, I can see how to (relatively) easily sieve a list of P's not divisible by anything less than ~120. But that would get me only 1/10 of all numbers, not 1/30. So can anyone suggest an easy and efficient way of finding only prime P's to test? Second, I expected this 64-bit code to be [i]faster[/i] than NewPGen! (Even though NewPGen uses SSE2, I figured that was just the hard way of doing 64-bit math, because it wasn't written for 64-bit processors.) So does anybody see some other optimization I'm missing? Or maybe even something in Alex's mulmod_REDC code or something? |
My profiler said mulmod_REDC was the main slow function. After studying that assembly code for probably too long, I found something! :smile:
The original assembly code starts with:[code] __asm__ ( "mulq %2\n\t" /* rdx:rax = T */ "movq %%rax,%%rbx\n\t" /* rcx:rbx = T */ "movq %%rdx,%%rcx\n\t" "imulq %4\n\t" /* rax = (T mod R) * N', rax = m */ "mulq %3\n\t" /* rdx:rax = m * N */[/code] That imulq is a full 64->128 bit multiply, but rdx is discarded by the next mulq. So I tried turning it into the two-argument form, "imulq %4, %%rax\n\t". That worked, but for some reason didn't speed up the code at all. Maybe it's a decoding thing; I don't know. But I also noticed that since the two-argument form doesn't touch rdx, the imulq could go before the movq involving rdx. And [i]that[/i] produced a marked speedup, taking my short test from 5 seconds to 4 (according to gprof, anyway). It still won't beat NewPGen - yet - but I think it's better. And it should help you out too, Alex. Edit: Technically, I don't see much real-time speedup, but those timings tend to vary a lot for me. Here's the new code down to the second mulq: [code] __asm__ ( "mulq %2\n\t" // rdx:rax = T "movq %%rax,%%rbx\n\t" // rcx:rbx = T "imulq %4, %%rax\n\t" // rax = (T mod R) * N', rax = m "movq %%rdx,%%rcx\n\t" "mulq %3\n\t" // rdx:rax = m * N [/code] |
I understand why it should be faster on a K8 or K10: integers muls (mul and imul) must execute in pipe 0. The first mul produces two macro-ops which go in pipe 0 and 1, first mov goes in pipe 2, and the mov from rdx goes in pipe 0 again. That mov and the next mul have the same data dependency [SIZE="1"](EDIT: no they dont, the low word of the first mul should be available one cycle earlier... need to check)[/SIZE], so the mov delays the mul in pipe 0 by 1 cycle. I don't know the Core 2 well enough to tell why it should be faster there.
It's been a while since I wrote that code. I updated with a newer version of a mulredc macro which should be faster. Thanks for the heads-up! New version [url]http://www.loria.fr/~kruppaal/factorcyc.20090702.c[/url] Alex |
| All times are UTC. The time now is 08:01. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.