mersenneforum.org Making a 64-bit fixed N Proth siever
 Register FAQ Search Today's Posts Mark Forums Read

 2009-06-28, 01:11 #1 Ken_g6     Jan 2005 Caught in a sieve 5×79 Posts Making a 64-bit fixed N Proth siever I recently got a new Core 2 machine, with 64-bit Linux. I was sieving away with NewPGen the other day when it occurred to me that NewPGen probably isn't optimized for 64-bit Linux. So I started working on my own siever - when the hardware handles 64 bits, how hard could it be? Well, the problems have started early. I wrote some code that I thought would work, and created some unit tests to check the functions I had so far. Well, 8 out of 9 aren't working; but the first failure is surprising. See, I lifted some mulmod code from a program by some guy named Alex Kruppa. (Alex, if you're reading this, I have several more questions I'd like to pester you with. ) Anyway, it failed my unit test! So, here's Alex's code: Code: /* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long r; __asm__ ( "mulq %2\n\t" "divq %3" : "=&d" (r), "+%a" (a) : "rm" (b), "rm" (m) : "cc" ); return r; } I tested this with the following line: Code: printf("271828182846 * 314159265359 mod 141421356237 == 127891859871...%lu\n", mulmod(271828182846ul, 314159265359ul, 141421356237ul)); If the test were to pass, the numbers on either side of "..." should be the same. I compiled with g++ 4.3.3, and if I compile straight, it works. But if I compile with any optimization switch, the result is: Code: 271828182846 * 314159265359 mod 141421356237 == 127891859871...95951954835 gcalctool and dc both say 127891859871 is correct. Any ideas on how I can compile with optimization switches?
 2009-06-28, 11:36 #2 ldesnogu     Jan 2008 France 2×271 Posts The program works fine with gcc 4.4.0 and fails with 3.4.6 on my machine. If I prevent inlining from happening with 3.4.6, it works. Perhaps some assembly constraint problem?
 2009-06-28, 12:16 #3 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts Yes, the commutativity modifier "%" seems to be wrong. It's probably better to do it with two separate asm statements, to simplify the constraints. Code: /* Multiply a and b, and reduce the product modulo m. Remainder is returned */ static inline unsigned long mulmod (unsigned long a, const unsigned long b, const unsigned long m) { unsigned long q, r, t1, t2; __asm__ ( "mulq %3\n\t" : "=a" (t1), "=d" (t2) : "%0" (a), "rm" (b) : "cc"); __asm__ ( "divq %4" : "=a" (q), "=d" (r) : "0" (t1), "1" (t2), "rm" (m) : "cc" ); return r; } Alex
 2009-06-29, 00:19 #4 Ken_g6     Jan 2005 Caught in a sieve 5·79 Posts Thanks, Alex, that did the trick! Since I posted this I discovered your REDC/Montgomery multiplication code. I didn't see a license on that source code, so I hope you don't mind if I use/modify it?
 2009-06-29, 00:54 #5 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain. Alex
2009-06-29, 11:23   #6
ET_
Banned

"Luigi"
Aug 2002
Team Italia

112708 Posts

Quote:
 Originally Posted by akruppa Go ahead! I'll put up a copy tomorrow with fixed mulmod (though the trial division code does not use it any more) and a header that puts it in the public domain. Alex
Let us know where to look for!

Luigi

 2009-06-29, 12:11 #7 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts Last fiddled with by akruppa on 2009-06-29 at 12:12
2009-06-29, 12:18   #8
ET_
Banned

"Luigi"
Aug 2002
Team Italia

23×599 Posts

Quote:
 Originally Posted by akruppa

 2009-07-01, 20:01 #9 Ken_g6     Jan 2005 Caught in a sieve 1100010112 Posts Here's an update on my progress: I'm focusing on K*2^N+/-1, no other bases or weird forms. At first, I thought I should compute 2^N mod P and then compute the invmod() of that. But when I profiled the slow application that resulted, invmod, not mulmod, was taking the most time! This led me to understand a cryptic hint in NewPGen's help file: "NewPGen is much faster if the base is 2. This is because division by 2 modulo a prime is easy (you shift it right if it is even, otherwise you add the prime then shift it). Division by other bases isn't so straightforward, however." I wasn't sure why I would ever divide by 2 modulo P, until I realized that (2^N)^-1 mod P == (2^-1)^N mod P. So that led to this optimized inverse-powmod code, derived from Alex's powmod_REDC, which only works for N (confusingly named b/t in the function) >= 128: Code: // Hybrid powmod, sidestepping several loops and possible mispredicts, and with no more than one longmod! // The other thing about this is that when you cross 2^-1 mod N with Montgomery multiplication, // 2^-64 * 2^64 (mod N) = 1 (mod N). So that kills one longmod. // The other longmod is needed, 2^64 >> (b & 0x3F) (mod N). static inline unsigned long invpowmod_REDCh (const unsigned long b, const unsigned long N) { unsigned long r, t = b, Ns; /* We want R*Rs - N*Ns == 1, i.e. Ns = -N^{-1} (mod R) */ // "s" is "a" in Montgomery representation, i.e. a * 2^64 (mod N) // First of all, s = 2^-64 (mod N) * 2^64 (mod N). The numbers cancel, leaving 1! unsigned long s = 1ul; // Get the last 6 bits of b==t. (Requires b >= 128.) unsigned int i = t & 0x3f; t >>= 6; // If there were low bits in b... // (And one wonders about splitting this up into a few different functions, as b is constant...) if(i) { // r = 2^-i * 2^64 (mod N), something the C compiler can do on its own! r = 1ul << (64-i); // If i is large, there's a good chance no mod is needed! if(r >= N) r %= N; } else { // r = 2^0 * 2^64 (mod N), which needs longmod. (Or could be done with 2^63 (mod N) * 2 (mod N) if desperate). r = longmod(1ul, 0ul, N); } // Might as well calculate Ns in any latency overlap. Ns = -invmod2pow_ul (N); /* Ns = -N^{-1} % 2^64 */ for(;;) { if (t & 1) { r = mulmod_REDC (r, s, N, Ns); } if((t >>= 1) == 0) break; s = mulmod_REDC (s, s, N, Ns); } r = mulmod_REDC (r, 1UL, N, Ns); /* Should be only REDC(r, N, Ns) */ return r; } With that code, my program is running about 1/10 the speed of NewPGen, for N about 1 million, and P around 3.75 trillion. I infer that part of this is because I'm testing all P's that are not divisible by 2 or 3, 1/3 of all numbers, which was a quick hack for testing. ln(3.75 trillion) is about 30, so if I could find and test only primes, my program would be about as fast as NewPGen. So I have a couple of questions I hope somebody can answer. First, I can see how to (relatively) easily sieve a list of P's not divisible by anything less than ~120. But that would get me only 1/10 of all numbers, not 1/30. So can anyone suggest an easy and efficient way of finding only prime P's to test? Second, I expected this 64-bit code to be faster than NewPGen! (Even though NewPGen uses SSE2, I figured that was just the hard way of doing 64-bit math, because it wasn't written for 64-bit processors.) So does anybody see some other optimization I'm missing? Or maybe even something in Alex's mulmod_REDC code or something?
 2009-07-02, 03:30 #10 Ken_g6     Jan 2005 Caught in a sieve 5·79 Posts My profiler said mulmod_REDC was the main slow function. After studying that assembly code for probably too long, I found something! The original assembly code starts with: Code:  __asm__ ( "mulq %2\n\t" /* rdx:rax = T */ "movq %%rax,%%rbx\n\t" /* rcx:rbx = T */ "movq %%rdx,%%rcx\n\t" "imulq %4\n\t" /* rax = (T mod R) * N', rax = m */ "mulq %3\n\t" /* rdx:rax = m * N */ That imulq is a full 64->128 bit multiply, but rdx is discarded by the next mulq. So I tried turning it into the two-argument form, "imulq %4, %%rax\n\t". That worked, but for some reason didn't speed up the code at all. Maybe it's a decoding thing; I don't know. But I also noticed that since the two-argument form doesn't touch rdx, the imulq could go before the movq involving rdx. And that produced a marked speedup, taking my short test from 5 seconds to 4 (according to gprof, anyway). It still won't beat NewPGen - yet - but I think it's better. And it should help you out too, Alex. Edit: Technically, I don't see much real-time speedup, but those timings tend to vary a lot for me. Here's the new code down to the second mulq: Code:  __asm__ ( "mulq %2\n\t" // rdx:rax = T "movq %%rax,%%rbx\n\t" // rcx:rbx = T "imulq %4, %%rax\n\t" // rax = (T mod R) * N', rax = m "movq %%rdx,%%rcx\n\t" "mulq %3\n\t" // rdx:rax = m * N Last fiddled with by Ken_g6 on 2009-07-02 at 03:34 Reason: gprof and time don't agree
 2009-07-02, 13:21 #11 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts I understand why it should be faster on a K8 or K10: integers muls (mul and imul) must execute in pipe 0. The first mul produces two macro-ops which go in pipe 0 and 1, first mov goes in pipe 2, and the mov from rdx goes in pipe 0 again. That mov and the next mul have the same data dependency (EDIT: no they dont, the low word of the first mul should be available one cycle earlier... need to check), so the mov delays the mul in pipe 0 by 1 cycle. I don't know the Core 2 well enough to tell why it should be faster there. It's been a while since I wrote that code. I updated with a newer version of a mulredc macro which should be faster. Thanks for the heads-up! New version http://www.loria.fr/~kruppaal/factorcyc.20090702.c Alex Last fiddled with by akruppa on 2009-07-02 at 16:27

 Similar Threads Thread Thread Starter Forum Replies Last Post rogue Software 38 2018-02-11 00:08 pepi37 Software 7 2015-07-10 04:42 c10ck3r Riesel Prime Search 14 2013-02-03 00:19 Jean Penné Software 0 2011-01-22 16:47 KEP Twin Prime Search 3 2007-02-13 18:29

All times are UTC. The time now is 18:43.

Thu Jan 21 18:43:12 UTC 2021 up 49 days, 14:54, 1 user, load averages: 2.54, 2.24, 2.00