mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2005-07-26, 13:30   #12
xenon
 
Jul 2004

1000112 Posts
Default

Here are the results after implementing repeated subtraction.
You are right, it is much faster by using repeated subtraction.
The table below shows how effective when the number of trials is increased. The first column is how far to trial subtract (0 means use divide only). The second column is the timing in clocks/inverse for p=65521. The column is timing for p=2^31-1.
The correctness of my code is verified for the two cases of p.

Code:
0     386     486
1     326     415
2     293     375
3     282     363
4     275     352
5     268     346
6     265     342
7     264     340
8     261     340
9     258     331

Last fiddled with by xenon on 2005-07-26 at 13:33
xenon is offline   Reply With Quote
Old 2005-07-26, 13:49   #13
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

5·17·89 Posts
Thumbs up

Quote:
Originally Posted by tom11784
Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^31-1 and a=1 to 1M writing just the inverses to a file.

<snip>

a is distributed *uniformly*. If p ~2^31 and a ~ 1M, note that the
very first mod reduction has a partial quotient of about 2000. This
*immediately* reduces the problem by about 3 digits...

Try running your benchmarks for (say) a = 1.1 x 10^9 to a = 1.1 x 10^9 +
1M

The values of a that you tried are much smaller than the average value
of a.

As this routine is actually used in NFS, p is typically (on average) about 22 to
24 bits, rather than 31. In fact, if someone can produce a faster optimized
routine that works for p < 2^25, this would have great value to me and
others.

This challenge is harder than people think at first glance. I suspect that
improvements will require algorithmic as well as coding improvements.
R.D. Silverman is offline   Reply With Quote
Old 2005-07-26, 13:53   #14
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Here's what I could manage last night:

Code:
int
single_modinv_my (int a, int modulus)

{ /* start of single_modinv */

  unsigned int ps, ps1, ps2 = 1, parity, dividend, divisor, rem, q;

  q = modulus / a;
  rem = modulus - a * q;
  dividend = a;
  divisor = rem;
  ps1 = q;
  parity = 0;

  while (rem > 1)
  {
    q = ps1;
    rem = dividend - divisor;
    if (rem >= divisor)
    {
      q <<= 1;
      rem -= divisor;
      if (rem >= divisor)
      {
        q += ps1;
        rem -= divisor;
        if (rem >= divisor)
        {
          q += ps1;
          rem -= divisor;
          if (rem >= divisor)
          {
            q += ps1;
            rem -= divisor;
          }
          if (rem >= divisor)
          {
            q += ps1;
            rem -= divisor;
          }
          if (rem >= divisor)
          {
            q = dividend/divisor;
            rem = dividend - q * divisor;
            q *= ps1;
          }
        }
      }
    }
    ps = q + ps2;
    ps2 = ps1;
    ps1 = ps;
    dividend = divisor;
    divisor = rem;
    parity ^= 1;
  }

  if (parity == 0)
    return(modulus - ps1);
  else
    return (ps1);
} /* end of single_modinv */
The changes are relatively subtle but produce slightly better code - about 15% faster on a Pentium 3, compiled with gcc -O2 -march=pentium3 -fomit-frame-pointer.

Alex
akruppa is offline   Reply With Quote
Old 2005-07-26, 14:19   #15
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Forgot to mention: my code expects a>1.

Alex
akruppa is offline   Reply With Quote
Old 2005-07-26, 14:31   #16
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

166158 Posts
Default

Quote:
Originally Posted by akruppa
Here's what I could manage last night:

Code:
int
single_modinv_my (int a, int modulus)

{ /* start of single_modinv */

  unsigned int ps, ps1, ps2 = 1, parity, dividend, divisor, rem, q;

  q = modulus / a;
  rem = modulus - a * q;
  dividend = a;
  divisor = rem;
  ps1 = q;
  parity = 0;

  while (rem > 1)
  {
    q = ps1;
    rem = dividend - divisor;
    if (rem >= divisor)
    {
      q <<= 1;
      rem -= divisor;
      if (rem >= divisor)
      {
        q += ps1;
        rem -= divisor;
        if (rem >= divisor)
        {
          q += ps1;
          rem -= divisor;
          if (rem >= divisor)
          {
            q += ps1;
            rem -= divisor;
          }
          if (rem >= divisor)
          {
            q += ps1;
            rem -= divisor;
          }
          if (rem >= divisor)
          {
            q = dividend/divisor;
            rem = dividend - q * divisor;
            q *= ps1;
          }
        }
      }
    }
    ps = q + ps2;
    ps2 = ps1;
    ps1 = ps;
    dividend = divisor;
    divisor = rem;
    parity ^= 1;
  }

  if (parity == 0)
    return(modulus - ps1);
  else
    return (ps1);
} /* end of single_modinv */
The changes are relatively subtle but produce slightly better code - about 15% faster on a Pentium 3, compiled with gcc -O2 -march=pentium3 -fomit-frame-pointer.

Alex

Ah. Rather than setting q at each subtraction, then computing q*ps1 + ps2
you accumulate ps1 at each subtraction. This saves a multiplication. cute.


Is flipping parity by xor faster than by subtraction? Both should take
just 1 cycle.

BTW, here is Lenstra's code:

int binary_inv(int n, int p)

{ /* start of binary_inv */
register int n1,n2, preg, m1, m2;

/* m1*n == n1 mod p, m2*n == n2 mod p */
/* n2 is odd, n1 is odd after initialization */

n1 = n;
n2 = p;
preg = p;
m1 = 1;
m2 = 0;

while (!(n1 & 1))
{
n1 >>= 1;
m1 += preg & -(m1 & 1);
// if (m1 & 1) m1 += preg;
m1 >>= 1;
}
if (n1 == 1) return m1;
while (n1 != n2)
{
if (n1 >n2)
{
n1 -= n2;
m1 -= m2;
if (m1 < 0) m1 += preg;
do
{
n1 >>= 1;
m1 += preg & -(m1 & 1);
// if (m1 & 1) m1 += preg;
m1 >>= 1;
} while (!(n1 & 1));
if (n1 == 1) return m1;
}
else
{
n2 -= n1;
m2 -= m1;
if (m2 < 0) m2 += preg;
do
{
n2 >>= 1;
m2 += preg & -(m2 & 1);
// if (m2 & 1) m2 += preg;
m2 >>= 1;
} while (!(n2 & 1));
if (n2 == 1) return m2;
}
}
return 0;
} /* end of binary_inv */
R.D. Silverman is offline   Reply With Quote
Old 2005-07-26, 14:38   #17
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

46438 Posts
Default

>Is flipping parity by xor faster than by subtraction? Both should take
>just 1 cycle.

They do, but the subtraction needs to take place in a register. On a register cripp^wchallenged architecture like x86 that means something else needs to be moved out first, so you get several loads and stores. The xor can operate right on memory.

Alex

/Godspeed, Discovery!
akruppa is offline   Reply With Quote
Old 2005-07-26, 14:40   #18
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

5×17×89 Posts
Thumbs up

Quote:
Originally Posted by R.D. Silverman

<snip> A. Lenstra code removed
In addition to Lenstra's effort, allow me to present a binary effort of
my own.

We now have 3 alternate routines.

Hack away..

/************************************************************************/
/* */
/* Fast modular inverse; uses extended binary method */
/* */
/************************************************************************/

new_modinv(input, modulus)
int input, modulus;

{ /* start of new_modinv */
int u,v,B,D;

if (!(modulus & 1)) {(void) printf("odd mod fail in inverse\n"); exit(0); }

u = input;
v = modulus;
B = 0;
D = 1;

while (1)
{
while (! (u & 1))
{
u = u >> 1;
if (! (B & 1)) B = B << 1;
else B = (B - input) << 1;
}

while (! (v & 1))
{
v = v << 1;
if (! (D & 1)) D = D << 1;
else D = (D - input) << 1;
}

if (u >= v)
{
u = u - v;
B = B - D;
}
else
{
v = v - u;
D = D - B;
}

if (u == 0) return(D);
}

} /* end of new_modinv */
R.D. Silverman is offline   Reply With Quote
Old 2005-07-27, 11:14   #19
xenon
 
Jul 2004

5·7 Posts
Default

Quote:
Originally Posted by R.D. Silverman
while (! (v & 1))
{
v = v << 1;
if (! (D & 1)) D = D << 1;
else D = (D - input) << 1;
}
This doesn't sound right. An infinite loop.

I've coded the Lenstra binary inverse in assembly. At the moment, it's not as fast as the first code (Euclidean).

I can post a timing comparison after I have coded all three alternate routines.
xenon is offline   Reply With Quote
Old 2005-07-27, 12:02   #20
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

11101100011012 Posts
Default

Quote:
Originally Posted by xenon
This doesn't sound right. An infinite loop.

I've coded the Lenstra binary inverse in assembly. At the moment, it's not as fast as the first code (Euclidean).

I can post a timing comparison after I have coded all three alternate routines.
The code to which you refer has not been extensively tested by me. I
cobbled it together some time ago, but have not used it. You may be correct,
but note that as you keep shifting, eventually you will get 0 in the LSB.
R.D. Silverman is offline   Reply With Quote
Old 2005-07-27, 13:03   #21
xenon
 
Jul 2004

5×7 Posts
Default

I've tried and the code doesn't work (at all).
while (! (v & 1)) means while (v is even).
v = v << 1; means v = v * 2.

It certainly locks up in the loop. I haven't figure out how the code works. But I tried changing << to >> and wrong result produced.
xenon is offline   Reply With Quote
Old 2005-07-27, 15:36   #22
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

5·17·89 Posts
Thumbs up

Quote:
Originally Posted by akruppa
Forgot to mention: my code expects a>1.

Alex
I inserted your routine into my siever in place of my old routine and
the siever stopped working. (infinite loop) There must be a bug somewhere.

Perhaps my code somewhere calls it with a = 1. Does this cause an I.L.
in your routine?

Bob
R.D. Silverman is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Calling airsquirrels Prime95 GPU Computing 16 2015-09-29 18:06
Help from coders ET_ GPU Computing 5 2014-01-26 13:58
Calling all 64-bit Linux sievers! frmky NFS@Home 25 2013-10-16 15:58
IA-32 Assembly Coders, anyone? xenon Programming 6 2005-06-02 13:26
Bob, I'm calling you out! synergy Miscellaneous Math 17 2004-10-26 15:26

All times are UTC. The time now is 04:11.


Fri Jul 7 04:11:28 UTC 2023 up 323 days, 1:40, 0 users, load averages: 1.58, 1.61, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔