mersenneforum.org Calling hotshot coders: NFS code challenge
 User Name Remember Me? Password
 Register FAQ Search Today's Posts Mark Forums Read

2005-07-27, 15:44   #23
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22×5×373 Posts

Quote:
 Originally Posted by R.D. Silverman I inserted your routine into my siever in place of my old routine and the siever stopped working. (infinite loop) There must be a bug somewhere. Perhaps my code somewhere calls it with a = 1. Does this cause an I.L. in your routine? Bob
I added if (a == 1) return 1;

as the first line, and now it works fine.

It does seem to be about 13 to 15% faster and makes the overall siever
about 1% faster.

Yes, I know.. no big deal. But every little bit helps.

2005-07-28, 00:33   #24
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by R.D. Silverman I added if (a == 1) return 1; as the first line, and now it works fine. It does seem to be about 13 to 15% faster and makes the overall siever about 1% faster. Yes, I know.. no big deal. But every little bit helps.
HI!!!

I deliberately posted this challenge, as opposed to a number of possible
others because this is a relatively small, simple routine whose speedup would
have a good impact.

I am a bit surprised by the lake of takers.....

[appropriate thanks to those who have responded]

 2005-07-28, 10:51 #25 akruppa     "Nancy" Aug 2002 Alexandria 2,467 Posts I tinkered with the code some more. The ps variable was redundant. There's a new variable t which carries the value of rem - divisor. This way the flags set by the SUB can be used immediately for the branch condition, the extra MOV does not add to the dependency chain. I #ifdef'd out the division at the start of the function, if the a values are indeed uniformly distributed in [1, modulus-1], there's no reason to start off with code that performs well or large modulus/a ratios but poorly for the much more common small ratios. I added a few more levels to the if() cascade. Toggling parity is done by NOT instead of XOR now, the opcode is shorter and NOT does not change the carry flag that will be used for the conditional jump, so the compiler has one more operation to put between the CMP and the JG which apparantly makes it easier for the cpu to tell where the branch will go. On my Pentium-3 the code now takes an average of 511 clocks with modulus=2147483647 and a chosen in an arithmetic progression from 1 to about modulus. The original code takes ~640. I suspect a binary algorithm could perform better, but an old version I wrote some time ago is much slower than Bob's code. I think there are too many hard-to-predict brances in there. I may try streamlining that, "some time later"... Alex Edit: I tried on Pentium-3, Athlon, Opteron and UltraSparc4, this code appears to do reasonably well on these. Edit 2: XOR didn't change carry either, the subtraction did. Got that mixed up. Code: int single_modinv (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; #ifdef EXPECT_SMALL_A if (a == 1) return 1; q = modulus / a; rem = modulus - a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; #else q = 1; rem = a; dividend = modulus; divisor = a; ps1 = 1; ps2 = 0; parity = 0; #endif while (divisor > 1) { rem = dividend - divisor; t = rem - divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (parity == 0) return (ps1); else return (modulus - ps1); } /* end of single_modinv */ Last fiddled with by akruppa on 2005-07-28 at 12:18
2005-07-28, 12:28   #26
jasonp
Tribal Bullet

Oct 2004

5·709 Posts

Quote:
 Originally Posted by akruppa Code:  if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}}
gcc generates an idiv/imull/imull combination for this inner loop. If instead you wote it as

q = dividend / divisor
rem = dividend % divisor
q *= ps1

gcc will use an idiv operation to compute both q and rem.

Has anyone compared this algorithm to expressing a modular inverse as an exponentiation? inv(a,p) = a^(p-2) mod p, and since p is a fixed modulus you can precompute the integer reciprocal of p, or use floating point.

Also, if branch mispredictions are a problem, what about shortening the chain of if's and using cmov instructions to execute the body of all if statements?

jasonp

PS: Code that implements a floating point modular inverse is appended. This was used in an older version of msieve, and is twice as fast as an expo operation that uses mul/div. round_constant contains two identical magic numbers, 3*2^52 for 53-bit IEEE floating point or 3*2^63 for 64-bit intel floating point. I can also post some trivial code that decides at runtime which constants to use (basically you try to perform a rounding operation with both constants and see which one can produce the correct result).

Code:
static INLINE uint32 fast_invmod(uint32 a, uint32 p) {

/* Compute mp_expo_1(a, p-2, p) using floating point.
'a' and 'p' must be less than ~25 bits in size if
using IEEE fast rounding, but can go up to ~31 bits
if Intel fast rounding is in use. Both of these
are far larger than MPQS will ever need.

The code avoids an integer remainder by multiplying
by the floating point reciprocal and (fast-)rounding to
the nearest integral floating point value. Within
the while() loop, the accumulated result is always
bounded by +-p, and is normalized afterwards.

The exponent is scanned right to left; at each step the
modular squaring and multiply are both performed, even
if the multiply result will be thrown away */

uint32 exponent = p - 2;
int32 result;
double dp = (double)(p);
double dsquare = (double)(a);
double recip = 1.0 / dp;
double round0 = round_constant[0];
double round1 = round_constant[1];
double dresult = 1.0;

while (exponent) {
double dsquare2 = dsquare * dsquare;
double dresult2 = dresult * dsquare;

double drsquare = dsquare2 * recip + round0 - round1;
double drresult = dresult2 * recip + round0 - round1;

dsquare2 -= dp * drsquare;
dresult2 -= dp * drresult;

dsquare = dsquare2;
if (exponent & 1)
dresult = dresult2;

exponent >>= 1;
}

result = (int32)dresult;

if (result < 0)
return (uint32)(result + p);
else
return (uint32)result;
}

Last fiddled with by jasonp on 2005-07-28 at 12:38 Reason: messed up remembering the magic constants

2005-07-28, 12:31   #27
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22×5×373 Posts

Quote:
 Originally Posted by akruppa
Nice effort. I have tried a binary routine. So did A. Lenstra. They seem
slightly slower. The individual iterations are a little faster, but they do more
iterations.

I have been trying to design a hybrid. If a 'LSB' instruction existed, it
would be VERY fast (LSB = least significant bit = number of trailing zeros)
The slow part of a binary approach is that we don't know how much to
shift (right) at each iteration, so we must count the trailing zeros, one
at a time. Or if there were an instruction that would shift a non-zero
integer right until bit 0 became 1, it would also be mch faster. (and return
the shift amount in a second register)

Another instruction that would make things faster is a Most Significant Bit
instruction. We could use this to get a *fast* approximation to the partial
quotient q/r by looking at MSB(q) - MSB(r). Then we could decide to
do a division or a binary shift at each iteration.

With current opcodes, as far as I can determine, the cost of checking
whether to do a binary or Euclidean reduction at each iteration is too
expensive with current architectures.

Did anyone besides me every work on the Thinking Machines CM5??
Each node had an custom arithmetic co-processor. There must have
been input from the NSA on its design because it had LSB, MSB, and
POPCOUNT instructions (among others) that greatly aid crypto algorithms.

2005-07-28, 12:33   #28
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by R.D. Silverman Nice effort.
BTW, anyone should feel free to take/use the code posted here.

Alex: If you want a binary routine, I did post AKL's.

2005-07-28, 12:41   #29
R.D. Silverman

"Bob Silverman"
Nov 2003
North of Boston

22·5·373 Posts

Quote:
 Originally Posted by jasonp gcc generates an idiv/imull/imull combination for this inner loop. If instead you wote it as q = dividend / divisor rem = dividend % divisor q *= ps1 gcc will use an idiv operation to compute both q and rem. Has anyone compared this algorithm to expressing a modular inverse as an exponentiation? inv(a,p) = a^(p-2) mod p, and since p is a fixed modulus you can precompute the integer reciprocal of p, or use floating point.
I implemented an inverse based on a^(p-2) a long time ago. It was
quite a bit slower than the Euclidean code I presented here.

Note that the Microsoft compiler is brain dead .

If I compute

a = b/c;
d = b % c;

The compiler generates TWO divisions, even though the first division
also leaves the remainder in the edx register!!!

I wrote a short __inline assembler routine to get around this when I want
both the quotient and remainder. It is slightly faster than

q = a/b
rem = a - b * q

How does your FP version compare with Alex's modification of my code?

2005-07-28, 12:43   #30
jasonp
Tribal Bullet

Oct 2004

5×709 Posts

Quote:
 Originally Posted by R.D. Silverman Did anyone besides me every work on the Thinking Machines CM5?? Each node had an custom arithmetic co-processor. There must have been input from the NSA on its design because it had LSB, MSB, and POPCOUNT instructions (among others) that greatly aid crypto algorithms.
Alpha has all three of those instructions; powerPC has popcount and count leading zeros. x86 has BSF and BSR, but IIRC they can be quite slow.

jasonp

2005-07-28, 12:48   #31
akruppa

"Nancy"
Aug 2002
Alexandria

2,467 Posts

Quote:
 gcc generates an idiv/imull/imull combination for this inner loop. If instead you wote it as q = dividend / divisor rem = dividend % divisor q *= ps1 gcc will use an idiv operation to compute both q and rem.
I tried that at one point, gcc mysteriously still generated the div/mul combination. I tried an asm macro that uses the idiv to generate quotient and remainder, bot oddly, it was no faster.

The results of some of my optimisation attempts were pretty counter-intuitive... (which is just a way of saying that I don't really understand how the cpu works). Maybe I should try again, the code changed a little in the meantime.

Quote:
 Has anyone compared this algorithm to expressing a modular inverse as an exponentiation? inv(a,p) = a^(p-2) mod p, and since p is a fixed modulus you can precompute the integer reciprocal of p, or use floating point.
Interesting idea, but I can't see it being faster. The powering ladder will need 30 modular squarings for a 31 bit p, a Montgomery multiplication takes 3 MUL and a MUL has a latency of 4 clocks on the Pentium 3, afaik. That is 360 clocks alone, add the other arithmetic (additions) and the multiplications for set bits in the exponent and the initialisation of the Montgomery arithmetic and I'd expect this to take well above 510 clocks.

I'll try the floating point code.

Quote:
 Also, if branch mispredictions are a problem, what about shortening the chain of if's and using cmov instructions to execute the body of all if statements?
I tried that as well, but gcc keeps turning them into a sequence of forward branches, even if I write them as

rem -= (rem >= divisor) ? divisor : 0;

etc. At one point I did manage to make gcc put a few cmov in there, but that code was no faster than with the jumps. Apparantly the short forward jumps are pretty efficient.

Alex

2005-07-28, 13:31   #32
jasonp
Tribal Bullet

Oct 2004

354510 Posts

Quote:
 Originally Posted by akruppa Interesting idea, but I can't see it being faster. The powering ladder will need 30 modular squarings for a 31 bit p, a Montgomery multiplication takes 3 MUL and a MUL has a latency of 4 clocks on the Pentium 3, afaik. That is 360 clocks alone, add the other arithmetic (additions) and the multiplications for set bits in the exponent and the initialisation of the Montgomery arithmetic and I'd expect this to take well above 510 clocks. I'll try the floating point code.
For a single ~25 bit prime, random modular inverses using your code take 425 clocks (this is on a 1GHz athlon with gcc 3.4, -O3 -march=athlon -fomit-frame-pointer). Using / and % in the inner loop takes almost exactly the same amount of time as the version you posted, so I suppose the computation of rem gets overlapped with the scaling of q.

My floating point code takes 942 clocks. The latency can probably be reduced by using SSE2 integer conversion opcodes, unrolling by 2 to save some register moves, and overlapping several inverse operations. The advantage of this sort of code is that the peak floating point throughput of multiplications and divisions is much higher than the latency of one operation, if your code can be restructured to achieve it. Nonetheless, making it twice as fast is a pretty tall order.

jasonp

 2005-07-28, 13:50 #33 akruppa     "Nancy" Aug 2002 Alexandria 46438 Posts I tried the idiv idea again. When I simply write q=a/b;r=a%b; gcc still produces the div/mul combination. When I use an asm() statement with div, gcc gets all cofused about register allocation and does not put "divisor" in a register. This makes the opcodes for the t-=divisor statements longer, and the relevant code block become just a bit too large to align the target address for all the if() branches at a multiple of 16 - so it chooses the next multiple of 16, fills up the space with NOPs and throws in a JMP for good measure. The resulting code is quite a bit slower, ~560 clocks. After tweaking things a bit, I got gcc to put divisor in a register again (unfortunately it simply ignores the "register" keyword). The resulting asm code does not look too bad, but is still no faster... as you said, the first mul probably gets interleaved with the second one, and throughput for an integer mul is 0.5/clock, afaik - not too bad. I'll stop now - I'm supposed to be studying for a test... Alex

 Thread Tools

 Similar Threads Thread Thread Starter Forum Replies Last Post Prime95 GPU Computing 16 2015-09-29 18:06 ET_ GPU Computing 5 2014-01-26 13:58 frmky NFS@Home 25 2013-10-16 15:58 xenon Programming 6 2005-06-02 13:26 synergy Miscellaneous Math 17 2004-10-26 15:26

All times are UTC. The time now is 04:36.

Sat Jul 2 04:36:25 UTC 2022 up 79 days, 2:37, 0 users, load averages: 1.25, 1.44, 1.56

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2022, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔