![]() |
![]() |
#23 | |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]() Quote:
as the first line, and now it works fine. It does seem to be about 13 to 15% faster and makes the overall siever about 1% faster. Yes, I know.. no big deal. But every little bit helps. |
|
![]() |
![]() |
![]() |
#24 | |
"Bob Silverman"
Nov 2003
North of Boston
22·5·373 Posts |
![]() Quote:
I deliberately posted this challenge, as opposed to a number of possible others because this is a relatively small, simple routine whose speedup would have a good impact. I am a bit surprised by the lake of takers..... ![]() ![]() [appropriate thanks to those who have responded] ![]() ![]() |
|
![]() |
![]() |
![]() |
#25 |
"Nancy"
Aug 2002
Alexandria
2,467 Posts |
![]()
I tinkered with the code some more. The ps variable was redundant. There's a new variable t which carries the value of rem - divisor. This way the flags set by the SUB can be used immediately for the branch condition, the extra MOV does not add to the dependency chain. I #ifdef'd out the division at the start of the function, if the a values are indeed uniformly distributed in [1, modulus-1], there's no reason to start off with code that performs well or large modulus/a ratios but poorly for the much more common small ratios. I added a few more levels to the if() cascade. Toggling parity is done by NOT instead of XOR now, the opcode is shorter and NOT does not change the carry flag that will be used for the conditional jump, so the compiler has one more operation to put between the CMP and the JG which apparantly makes it easier for the cpu to tell where the branch will go.
On my Pentium-3 the code now takes an average of 511 clocks with modulus=2147483647 and a chosen in an arithmetic progression from 1 to about modulus. The original code takes ~640. I suspect a binary algorithm could perform better, but an old version I wrote some time ago is much slower than Bob's code. I think there are too many hard-to-predict brances in there. I may try streamlining that, "some time later"... Alex Edit: I tried on Pentium-3, Athlon, Opteron and UltraSparc4, this code appears to do reasonably well on these. Edit 2: XOR didn't change carry either, the subtraction did. Got that mixed up. Code:
int single_modinv (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; #ifdef EXPECT_SMALL_A if (a == 1) return 1; q = modulus / a; rem = modulus - a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; #else q = 1; rem = a; dividend = modulus; divisor = a; ps1 = 1; ps2 = 0; parity = 0; #endif while (divisor > 1) { rem = dividend - divisor; t = rem - divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (parity == 0) return (ps1); else return (modulus - ps1); } /* end of single_modinv */ Last fiddled with by akruppa on 2005-07-28 at 12:18 |
![]() |
![]() |
![]() |
#26 | |
Tribal Bullet
Oct 2004
5·709 Posts |
![]() Quote:
q = dividend / divisor rem = dividend % divisor q *= ps1 gcc will use an idiv operation to compute both q and rem. Has anyone compared this algorithm to expressing a modular inverse as an exponentiation? inv(a,p) = a^(p-2) mod p, and since p is a fixed modulus you can precompute the integer reciprocal of p, or use floating point. Also, if branch mispredictions are a problem, what about shortening the chain of if's and using cmov instructions to execute the body of all if statements? jasonp PS: Code that implements a floating point modular inverse is appended. This was used in an older version of msieve, and is twice as fast as an expo operation that uses mul/div. round_constant contains two identical magic numbers, 3*2^52 for 53-bit IEEE floating point or 3*2^63 for 64-bit intel floating point. I can also post some trivial code that decides at runtime which constants to use (basically you try to perform a rounding operation with both constants and see which one can produce the correct result). Code:
static INLINE uint32 fast_invmod(uint32 a, uint32 p) { /* Compute mp_expo_1(a, p-2, p) using floating point. 'a' and 'p' must be less than ~25 bits in size if using IEEE fast rounding, but can go up to ~31 bits if Intel fast rounding is in use. Both of these are far larger than MPQS will ever need. The code avoids an integer remainder by multiplying by the floating point reciprocal and (fast-)rounding to the nearest integral floating point value. Within the while() loop, the accumulated result is always bounded by +-p, and is normalized afterwards. The exponent is scanned right to left; at each step the modular squaring and multiply are both performed, even if the multiply result will be thrown away */ uint32 exponent = p - 2; int32 result; double dp = (double)(p); double dsquare = (double)(a); double recip = 1.0 / dp; double round0 = round_constant[0]; double round1 = round_constant[1]; double dresult = 1.0; while (exponent) { double dsquare2 = dsquare * dsquare; double dresult2 = dresult * dsquare; double drsquare = dsquare2 * recip + round0 - round1; double drresult = dresult2 * recip + round0 - round1; dsquare2 -= dp * drsquare; dresult2 -= dp * drresult; dsquare = dsquare2; if (exponent & 1) dresult = dresult2; exponent >>= 1; } result = (int32)dresult; if (result < 0) return (uint32)(result + p); else return (uint32)result; } Last fiddled with by jasonp on 2005-07-28 at 12:38 Reason: messed up remembering the magic constants |
|
![]() |
![]() |
![]() |
#27 | |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]() Quote:
slightly slower. The individual iterations are a little faster, but they do more iterations. I have been trying to design a hybrid. If a 'LSB' instruction existed, it would be VERY fast (LSB = least significant bit = number of trailing zeros) The slow part of a binary approach is that we don't know how much to shift (right) at each iteration, so we must count the trailing zeros, one at a time. Or if there were an instruction that would shift a non-zero integer right until bit 0 became 1, it would also be mch faster. (and return the shift amount in a second register) Another instruction that would make things faster is a Most Significant Bit instruction. We could use this to get a *fast* approximation to the partial quotient q/r by looking at MSB(q) - MSB(r). Then we could decide to do a division or a binary shift at each iteration. With current opcodes, as far as I can determine, the cost of checking whether to do a binary or Euclidean reduction at each iteration is too expensive with current architectures. Did anyone besides me every work on the Thinking Machines CM5?? Each node had an custom arithmetic co-processor. There must have been input from the NSA on its design because it had LSB, MSB, and POPCOUNT instructions (among others) that greatly aid crypto algorithms. |
|
![]() |
![]() |
![]() |
#28 | |
"Bob Silverman"
Nov 2003
North of Boston
22·5·373 Posts |
![]() Quote:
Alex: If you want a binary routine, I did post AKL's. |
|
![]() |
![]() |
![]() |
#29 | |
"Bob Silverman"
Nov 2003
North of Boston
22·5·373 Posts |
![]() Quote:
quite a bit slower than the Euclidean code I presented here. Note that the Microsoft compiler is ![]() ![]() If I compute a = b/c; d = b % c; The compiler generates TWO divisions, even though the first division also leaves the remainder in the edx register!!! I wrote a short __inline assembler routine to get around this when I want both the quotient and remainder. It is slightly faster than q = a/b rem = a - b * q How does your FP version compare with Alex's modification of my code? |
|
![]() |
![]() |
![]() |
#30 | |
Tribal Bullet
Oct 2004
5×709 Posts |
![]() Quote:
jasonp |
|
![]() |
![]() |
![]() |
#31 | |||
"Nancy"
Aug 2002
Alexandria
2,467 Posts |
![]() Quote:
The results of some of my optimisation attempts were pretty counter-intuitive... (which is just a way of saying that I don't really understand how the cpu works). Maybe I should try again, the code changed a little in the meantime. Quote:
I'll try the floating point code. Quote:
rem -= (rem >= divisor) ? divisor : 0; etc. At one point I did manage to make gcc put a few cmov in there, but that code was no faster than with the jumps. Apparantly the short forward jumps are pretty efficient. Alex |
|||
![]() |
![]() |
![]() |
#32 | |
Tribal Bullet
Oct 2004
354510 Posts |
![]() Quote:
My floating point code takes 942 clocks. The latency can probably be reduced by using SSE2 integer conversion opcodes, unrolling by 2 to save some register moves, and overlapping several inverse operations. The advantage of this sort of code is that the peak floating point throughput of multiplications and divisions is much higher than the latency of one operation, if your code can be restructured to achieve it. Nonetheless, making it twice as fast is a pretty tall order. jasonp |
|
![]() |
![]() |
![]() |
#33 |
"Nancy"
Aug 2002
Alexandria
46438 Posts |
![]()
I tried the idiv idea again. When I simply write q=a/b;r=a%b; gcc still produces the div/mul combination. When I use an asm() statement with div, gcc gets all cofused about register allocation and does not put "divisor" in a register. This makes the opcodes for the t-=divisor statements longer, and the relevant code block become just a bit too large to align the target address for all the if() branches at a multiple of 16 - so it chooses the next multiple of 16, fills up the space with NOPs and throws in a JMP for good measure. The resulting code is quite a bit slower, ~560 clocks.
After tweaking things a bit, I got gcc to put divisor in a register again (unfortunately it simply ignores the "register" keyword). The resulting asm code does not look too bad, but is still no faster... as you said, the first mul probably gets interleaved with the second one, and throughput for an integer mul is 0.5/clock, afaik - not too bad. I'll stop now - I'm supposed to be studying for a test... ![]() Alex |
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Calling airsquirrels | Prime95 | GPU Computing | 16 | 2015-09-29 18:06 |
Help from coders | ET_ | GPU Computing | 5 | 2014-01-26 13:58 |
Calling all 64-bit Linux sievers! | frmky | NFS@Home | 25 | 2013-10-16 15:58 |
IA-32 Assembly Coders, anyone? | xenon | Programming | 6 | 2005-06-02 13:26 |
Bob, I'm calling you out! | synergy | Miscellaneous Math | 17 | 2004-10-26 15:26 |