mersenneforum.org  

Go Back   mersenneforum.org > Factoring Projects > Factoring

Reply
 
Thread Tools
Old 2010-01-08, 13:32   #1
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

1D2416 Posts
Default NFS Optimization

I am taking another whack at improving my NFS siever code.

I'd like to start a discussion thread. Is anyone else interested?

I will start by posting some benchmark data for my current
factorization effort: 2^1161 + 1 (SNFS difficulty 234)
My code uses special-q on the integer side.

Compiler: Microsoft Visual Studio 2008

Polynomials: x-M, x^6 - x^3 + 1, M=2^129
LP bound 825million. Factor base SIZE = 2.9 million ideals for
each side. This translates to a bound of about 48.2 million.
Sieve region is 16K x 8K.

For special-q near 67 million I get the following data: (with my
best compilation effort (2.4GHz duo-core; both cores running)

Total time for one special q: 11.5 seconds

Primary integer side vector sieve: 2.6 seconds
Primary alg side vector sieve: 2.6 seconds
Primary integer side line sieve (small primes only) .26 seconds
Primary alg side line sieve (small primes only) .26 sec

Integer side resieve (for factoring norms) .52 sec
Alg side resieve (for factoring norms) .32 sec

Preparing sieve starting points:
Modular inverses: 1.15 sec
Lattice reductions : 1.56 sec

Scanning the sieve region for successful hits: .31 sec

Preparing sieve region boundaries (I use Pollard style) 1.03 sec

The rest is overhead, bookkeeping, splitting cofactors, etc.

I'd like to start by improving the mod inverse and latred routines.
I have worked on these extensively to no avail.

I will work on my siever (handling the buckets) later.
R.D. Silverman is offline   Reply With Quote
Old 2010-01-08, 14:14   #2
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

1101111101102 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
I am taking another whack at improving my NFS siever code.

I'd like to start a discussion thread. Is anyone else interested?
Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.
bsquared is offline   Reply With Quote
Old 2010-01-08, 15:26   #3
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by bsquared View Post
Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.
What is your environment? My low-level 64-bit arithmetic is written
in VC++ MASM, as opposed to gcc assemble syntax.

Right now, I am looking to improve the following:

Code:
/************************************************************************/
/*                                                                      */
/*   Routine to computer single precision modular inverses              */
/*                                                                      */
/************************************************************************/

int single_modinv_old (int a, int modulus)

{ /* start of single_modinv */

int ps1, ps2, parity, dividend, divisor, rem, q, t;
double t1;


if (TIME_STATS) t1 = get_time();
if (a == 1) return 1;
q = modulus / a;
rem = modulus - a * q;
dividend = a;
divisor = rem;
ps1 = q;
ps2 = 1;
parity = ~0;


while (divisor > 1)
   {
   rem = dividend - divisor;
   t = rem - divisor;
   if (t >= 0) {
     q += ps1;
     rem = t;
     t -= divisor;
     if (t >= 0) {
       q += ps1;
       rem = t;
       t -= divisor;
       if (t >= 0) {
         q += ps1;
         rem = t;
         t -= divisor;
         if (t >= 0) {
           q += ps1;
           rem = t;
           t -= divisor;
           if (t >= 0) {
             q += ps1;
             rem = t;
             t -= divisor;
             if (t >= 0) {
               q += ps1;
               rem = t;
               t -= divisor;
               if (t >= 0) {
                 q += ps1;
                 rem = t;
                 t -= divisor;
                 if (t >= 0) {
                   q += ps1;
                   rem = t;
                   if (rem >= divisor) {
                     q = dividend/divisor;
                     rem = dividend - q * divisor;
                     q *= ps1;
                   }}}}}}}}}
   q += ps2;
   parity = ~parity;
   dividend = divisor;
   divisor = rem;
   ps2 = ps1;
   ps1 = q;
   }
 
if (TIME_STATS) invtime += (get_time() - t1);
if (parity == 0) return (ps1);
else return (modulus - ps1);

} /* end of single_modinv */


/************************************************************************/
/*                                                                      */
/*   routine to compute two reduced lattice vectors                     */
/*  modified from AKL; which explains stylistic differences in code     */
/*                                                                      */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{   /* start; of reduce_lattice */
double stime;
long x1; 
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3; 
double s2;
double s3;
long q; 
long sx2;
long parity = 0;
//int v1a[2], v2a[2];

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

if (TIME_STATS) stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
   {
   x3 = x1;
   y3 = yy1;
   sx2 = (x2 << 4);
   while (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 4);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 3);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 2);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 1);
      }
   if (x3 > x2 )
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 > (x2 >> 1))
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 < 0)
      {
      x3 = -x3;
      y3 = -y3;
      parity ^= 1;
      }

   s3 = x3 * (double) x3 + y3 * (double) y3;
   if (s3 >= s2) break;
   s2 = s3;
   x1 = x2;
   yy1 = y2;
   x2 = x3;
   y2 = y3;
   parity ^= 1;
   }

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
   {
   yy1 = 1;
   s3 = -s3;
   }
else
   {
   yy1 = 0;
   }
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
   {
   s3 -= s2_16;
   q += 16;
   }
if (s3 > s2_8)
   {
   s3 -= s2_8;
   q += 8;
   }
if (s3 > s2_4)
   {
   s3 -= s2_4;
   q += 4;
   }
if (s3 > s2_2)
   {
   s3 -= s2_2;
   q += 2;
   }
if (s3 > s2)
   {
   s3 -= s2;
   q ++;
   }
#ifndef NODIVLATRED
if (s3 > s2)
   {
   if (s3 < (s2_16+s2_16)) goto again;
   q += (long)mrint(s3/s2);
   }
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0)      /* make 'b'  positive */
   {
   v2[0] = -v2[0];
   v2[1] = -v2[1];
   }

if (v1[1] < 0)      /* make 'b'  positive */
   {
   v1[0] = -v1[0];
   v1[1] = -v1[1];
   }

if (TIME_STATS) latred_time += (get_time() - stime);
//qtime = (get_time() - stime);

//time = get_time();
//newest_latred_exp(special_q, root, v1a, v2a);
//ttime = (get_time() - stime);

//printf("old, new = %lf %lf\n", qtime, ttime);
//printf("RATIO: %lf\n", qtime/ttime);
//tot1 += qtime;
//tot2 += ttime;

//cmp_latred(special_q, root, v1,v2, v1a, v2a);

return(parity);
}   /* end of reduce_lattice */

I have a different mod inverse routine that does NOT do the subtractions;
They don't seem to make much difference. I think that I may write
the entire routine in MASM.
R.D. Silverman is offline   Reply With Quote
Old 2010-01-08, 15:42   #4
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

746010 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
What is your environment? My low-level 64-bit arithmetic is written
in VC++ MASM, as opposed to gcc assemble syntax.

Right now, I am looking to improve the following:

Code:
/************************************************************************/
/*                                                                      */
/*   Routine to computer single precision modular inverses              */
/*                                                                      */
/************************************************************************/

int single_modinv_old (int a, int modulus)

{ /* start of single_modinv */

int ps1, ps2, parity, dividend, divisor, rem, q, t;
double t1;


if (TIME_STATS) t1 = get_time();
if (a == 1) return 1;
q = modulus / a;
rem = modulus - a * q;
dividend = a;
divisor = rem;
ps1 = q;
ps2 = 1;
parity = ~0;


while (divisor > 1)
   {
   rem = dividend - divisor;
   t = rem - divisor;
   if (t >= 0) {
     q += ps1;
     rem = t;
     t -= divisor;
     if (t >= 0) {
       q += ps1;
       rem = t;
       t -= divisor;
       if (t >= 0) {
         q += ps1;
         rem = t;
         t -= divisor;
         if (t >= 0) {
           q += ps1;
           rem = t;
           t -= divisor;
           if (t >= 0) {
             q += ps1;
             rem = t;
             t -= divisor;
             if (t >= 0) {
               q += ps1;
               rem = t;
               t -= divisor;
               if (t >= 0) {
                 q += ps1;
                 rem = t;
                 t -= divisor;
                 if (t >= 0) {
                   q += ps1;
                   rem = t;
                   if (rem >= divisor) {
                     q = dividend/divisor;
                     rem = dividend - q * divisor;
                     q *= ps1;
                   }}}}}}}}}
   q += ps2;
   parity = ~parity;
   dividend = divisor;
   divisor = rem;
   ps2 = ps1;
   ps1 = q;
   }
 
if (TIME_STATS) invtime += (get_time() - t1);
if (parity == 0) return (ps1);
else return (modulus - ps1);

} /* end of single_modinv */


/************************************************************************/
/*                                                                      */
/*   routine to compute two reduced lattice vectors                     */
/*  modified from AKL; which explains stylistic differences in code     */
/*                                                                      */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{   /* start; of reduce_lattice */
double stime;
long x1; 
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3; 
double s2;
double s3;
long q; 
long sx2;
long parity = 0;
//int v1a[2], v2a[2];

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

if (TIME_STATS) stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
   {
   x3 = x1;
   y3 = yy1;
   sx2 = (x2 << 4);
   while (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 4);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 3);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 2);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 1);
      }
   if (x3 > x2 )
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 > (x2 >> 1))
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 < 0)
      {
      x3 = -x3;
      y3 = -y3;
      parity ^= 1;
      }

   s3 = x3 * (double) x3 + y3 * (double) y3;
   if (s3 >= s2) break;
   s2 = s3;
   x1 = x2;
   yy1 = y2;
   x2 = x3;
   y2 = y3;
   parity ^= 1;
   }

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
   {
   yy1 = 1;
   s3 = -s3;
   }
else
   {
   yy1 = 0;
   }
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
   {
   s3 -= s2_16;
   q += 16;
   }
if (s3 > s2_8)
   {
   s3 -= s2_8;
   q += 8;
   }
if (s3 > s2_4)
   {
   s3 -= s2_4;
   q += 4;
   }
if (s3 > s2_2)
   {
   s3 -= s2_2;
   q += 2;
   }
if (s3 > s2)
   {
   s3 -= s2;
   q ++;
   }
#ifndef NODIVLATRED
if (s3 > s2)
   {
   if (s3 < (s2_16+s2_16)) goto again;
   q += (long)mrint(s3/s2);
   }
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0)      /* make 'b'  positive */
   {
   v2[0] = -v2[0];
   v2[1] = -v2[1];
   }

if (v1[1] < 0)      /* make 'b'  positive */
   {
   v1[0] = -v1[0];
   v1[1] = -v1[1];
   }

if (TIME_STATS) latred_time += (get_time() - stime);
//qtime = (get_time() - stime);

//time = get_time();
//newest_latred_exp(special_q, root, v1a, v2a);
//ttime = (get_time() - stime);

//printf("old, new = %lf %lf\n", qtime, ttime);
//printf("RATIO: %lf\n", qtime/ttime);
//tot1 += qtime;
//tot2 += ttime;

//cmp_latred(special_q, root, v1,v2, v1a, v2a);

return(parity);
}   /* end of reduce_lattice */

I have a different mod inverse routine that does NOT do the subtractions;
They don't seem to make much difference. I think that I may write
the entire routine in MASM.
These routines are responsible for a large chunk of the overall
run time. Here is where it happens:

To prepare the sieve we must compute, FOR EACH PRIME p IN THE FACTOR BASE(s) an expression of the form:

r = (a1 + b1 * R)/(a0 + b0 * R) mod p

then compute a reduced lattice from (p,r)

p can be as large as about 10^8. R can be as large as p, and the a,b
values come from the special q reduced lattice. Therefore, the numerator
and denominator are often larger than 32-bits (but always smaller than 64).
The denominator can be negative.

I compute this by:

numer = modmultadd(b1, R, a1, p)
temp_denom = modmultadd(b0, R, a0, p)
denom = single_modinv(temp_denom, p)
root = mod_mult(numer, denom, p)
lattice_reduce(p, r, vector1, vector2)

I gave the modinverse and lattice reduction routines already.
modmultadd is an assembler routine that computes (c*d + e) mod p
mod_mult does the same without the addition. I doubt whether I can
make these any faster. I will post code for all of this if requested.
R.D. Silverman is offline   Reply With Quote
Old 2010-01-08, 16:27   #5
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

67668 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
What is your environment? My low-level 64-bit arithmetic is written
in VC++ MASM, as opposed to gcc assemble syntax.
I have MSVC 2010 beta 2 on my home machine, and MSVC Express Edition 08 at work, which is what I used to build your previous code you sent me. I think either can deal with your assembler code.


Quote:
Originally Posted by R.D. Silverman View Post
Right now, I am looking to improve the following:

Code:
/************************************************************************/
/*                                                                      */
/*   Routine to computer single precision modular inverses              */
/*                                                                      */
/************************************************************************/

int single_modinv_old (int a, int modulus)

{ /* start of single_modinv */

int ps1, ps2, parity, dividend, divisor, rem, q, t;
double t1;


if (TIME_STATS) t1 = get_time();
if (a == 1) return 1;
q = modulus / a;
rem = modulus - a * q;
dividend = a;
divisor = rem;
ps1 = q;
ps2 = 1;
parity = ~0;


while (divisor > 1)
   {
   rem = dividend - divisor;
   t = rem - divisor;
   if (t >= 0) {
     q += ps1;
     rem = t;
     t -= divisor;
     if (t >= 0) {
       q += ps1;
       rem = t;
       t -= divisor;
       if (t >= 0) {
         q += ps1;
         rem = t;
         t -= divisor;
         if (t >= 0) {
           q += ps1;
           rem = t;
           t -= divisor;
           if (t >= 0) {
             q += ps1;
             rem = t;
             t -= divisor;
             if (t >= 0) {
               q += ps1;
               rem = t;
               t -= divisor;
               if (t >= 0) {
                 q += ps1;
                 rem = t;
                 t -= divisor;
                 if (t >= 0) {
                   q += ps1;
                   rem = t;
                   if (rem >= divisor) {
                     q = dividend/divisor;
                     rem = dividend - q * divisor;
                     q *= ps1;
                   }}}}}}}}}
   q += ps2;
   parity = ~parity;
   dividend = divisor;
   divisor = rem;
   ps2 = ps1;
   ps1 = q;
   }
 
if (TIME_STATS) invtime += (get_time() - t1);
if (parity == 0) return (ps1);
else return (modulus - ps1);

} /* end of single_modinv */


/************************************************************************/
/*                                                                      */
/*   routine to compute two reduced lattice vectors                     */
/*  modified from AKL; which explains stylistic differences in code     */
/*                                                                      */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{   /* start; of reduce_lattice */
double stime;
long x1; 
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3; 
double s2;
double s3;
long q; 
long sx2;
long parity = 0;
//int v1a[2], v2a[2];

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

if (TIME_STATS) stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
   {
   x3 = x1;
   y3 = yy1;
   sx2 = (x2 << 4);
   while (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 4);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 3);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 2);
      }
   sx2 >>= 1;
   if (x3 > sx2 )
      {
      x3 -= sx2;
      y3 -= (y2 << 1);
      }
   if (x3 > x2 )
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 > (x2 >> 1))
      {
      x3 -= x2;
      y3 -= y2;
      }
   if (x3 < 0)
      {
      x3 = -x3;
      y3 = -y3;
      parity ^= 1;
      }

   s3 = x3 * (double) x3 + y3 * (double) y3;
   if (s3 >= s2) break;
   s2 = s3;
   x1 = x2;
   yy1 = y2;
   x2 = x3;
   y2 = y3;
   parity ^= 1;
   }

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
   {
   yy1 = 1;
   s3 = -s3;
   }
else
   {
   yy1 = 0;
   }
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
   {
   s3 -= s2_16;
   q += 16;
   }
if (s3 > s2_8)
   {
   s3 -= s2_8;
   q += 8;
   }
if (s3 > s2_4)
   {
   s3 -= s2_4;
   q += 4;
   }
if (s3 > s2_2)
   {
   s3 -= s2_2;
   q += 2;
   }
if (s3 > s2)
   {
   s3 -= s2;
   q ++;
   }
#ifndef NODIVLATRED
if (s3 > s2)
   {
   if (s3 < (s2_16+s2_16)) goto again;
   q += (long)mrint(s3/s2);
   }
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0)      /* make 'b'  positive */
   {
   v2[0] = -v2[0];
   v2[1] = -v2[1];
   }

if (v1[1] < 0)      /* make 'b'  positive */
   {
   v1[0] = -v1[0];
   v1[1] = -v1[1];
   }

if (TIME_STATS) latred_time += (get_time() - stime);
//qtime = (get_time() - stime);

//time = get_time();
//newest_latred_exp(special_q, root, v1a, v2a);
//ttime = (get_time() - stime);

//printf("old, new = %lf %lf\n", qtime, ttime);
//printf("RATIO: %lf\n", qtime/ttime);
//tot1 += qtime;
//tot2 += ttime;

//cmp_latred(special_q, root, v1,v2, v1a, v2a);

return(parity);
}   /* end of reduce_lattice */

I have a different mod inverse routine that does NOT do the subtractions;
They don't seem to make much difference. I think that I may write
the entire routine in MASM.
I'm familiar with the modular inversion routine; I use it in YAFU as well although obviously there it is not really a bottleneck at all so I haven't tried to improve it. I have tried to beat optimizing compilers before with assembly code and I usually fail... but maybe cmov would help here if the compiler isn't using it.

Also, you could get rid of two comparison statements by using pre-processor #if/#else around TIME_STATS in the modular inversion, but with the obvious branch prediction and speculative execution it probably doesn't matter.

If I get time I'll look more at the reduce_lattice routine.
bsquared is offline   Reply With Quote
Old 2010-01-08, 22:00   #6
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

32×112 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
What is your environment? My low-level 64-bit arithmetic is written
in VC++ MASM, as opposed to gcc assemble syntax.
Bob,

Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects.

Richard
Wacky is offline   Reply With Quote
Old 2010-01-09, 01:00   #7
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

746010 Posts
Default

Quote:
Originally Posted by Wacky View Post
Bob,

Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects.

Richard
Note that we ALL would benefit from improving this code, since
everyone who has siever code could use any improvements.

The time intensive loop for preparing the sieve looks like the following:
v1, v2 are the reduced lattice vectors, root is the polynomial root mod p,

loop over p in factor base: {

numer = modmultadd(v2[1], root, v2[0], p);
denom = modmultadd(v1[1], root, v1[0], p);
inverse = mod_inverse(denom, p);
root = mod_mult(numer, inverse, p);
lattice_reduce(p, root, subv1, subv2);
}

I posted my code for the mod_inverse and lattice_reduce already.
Here is code for modmultadd:

Code:
/************************************************************************/
/*																		*/
/*	modmultadd(a,b,c,d);  returns (ab + c) mod d; a,b,c,d  32 bit		*/
/*	integers;  returns positive value									*/
/*																		*/
/************************************************************************/

modmultadd_asm(a,b,c,d)
int a,b,c,d;

{   /* start of modmultadd_asm */

_asm
	{
	mov eax, a			/* compute a*b  in edx:eax*/
	imul b
	
	mov ecx,c
	mov ebx,c
	sar ebx,31			/* sign extend, then add c*/
	add eax,ecx
	adc edx,ebx			/* carry bit to high order word*/

	idiv d				/* div by d, remainder in edx	*/
	cmp edx, 0			/* is it > 0?					*/
	jge SHORT $LABEL1	
	add edx, d			/* if < 0 add the divisor		*/
$LABEL1:
	mov eax , edx		/* return the remainder in eax	*/
	}

}   /* end of modmultadd_asm */
R.D. Silverman is offline   Reply With Quote
Old 2010-01-09, 01:11   #8
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
Note that we ALL would benefit from improving this code, since
everyone who has siever code could use any improvements.

The time intensive loop for preparing the sieve looks like the following:
v1, v2 are the reduced lattice vectors, root is the polynomial root mod p,

loop over p in factor base: {

numer = modmultadd(v2[1], root, v2[0], p);
denom = modmultadd(v1[1], root, v1[0], p);
inverse = mod_inverse(denom, p);
root = mod_mult(numer, inverse, p);
lattice_reduce(p, root, subv1, subv2);
}

I posted my code for the mod_inverse and lattice_reduce already.
Here is code for modmultadd:

Code:
/************************************************************************/
/*																		*/
/*	modmultadd(a,b,c,d);  returns (ab + c) mod d; a,b,c,d  32 bit		*/
/*	integers;  returns positive value									*/
/*																		*/
/************************************************************************/

modmultadd_asm(a,b,c,d)
int a,b,c,d;

{   /* start of modmultadd_asm */

_asm
	{
	mov eax, a			/* compute a*b  in edx:eax*/
	imul b
	
	mov ecx,c
	mov ebx,c
	sar ebx,31			/* sign extend, then add c*/
	add eax,ecx
	adc edx,ebx			/* carry bit to high order word*/

	idiv d				/* div by d, remainder in edx	*/
	cmp edx, 0			/* is it > 0?					*/
	jge SHORT $LABEL1	
	add edx, d			/* if < 0 add the divisor		*/
$LABEL1:
	mov eax , edx		/* return the remainder in eax	*/
	}

}   /* end of modmultadd_asm */

With respect to modinverse:

The time intensive part is the core computation of:

q = dividend/divisor;
rem = dividend - q*divisor

Note however, that the Intel/AMD processors return the quotient
automatically whenever a division is done. There should be no need
to do a mult and subt. We should be able to retrieve the remainder.

Suppose we inline the following:

Code:
__inline int divwithremainder(a,b, &quo)

_asm {
    mov eax, a
   div b
   mov edi, x
   mov DWORD PTR [edi], eax   //  get quotient
   mov  eax, edx     // return remainder
   }

Any guesses as to whether this will be faster than the mult & subt?
If we do

quo = a/b;
rem = a%b;

the division will be done twice. I don't know any compiler that is
smart enough to recognize that once the division is done, then the
remainder is already in a register and hence 'free' (except for a mov
instruction)

Comments?
R.D. Silverman is offline   Reply With Quote
Old 2010-01-09, 01:14   #9
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
If we do

quo = a/b;
rem = a%b;

the division will be done twice. I don't know any compiler that is
smart enough to recognize that once the division is done, then the
remainder is already in a register and hence 'free' (except for a mov
instruction)
gcc has been able to do this for years on x86; but prudence dictates checking the generated asm every time.
jasonp is offline   Reply With Quote
Old 2010-01-09, 01:15   #10
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by R.D. Silverman View Post
With respect to modinverse:

The time intensive part is the core computation of:

q = dividend/divisor;
rem = dividend - q*divisor

Note however, that the Intel/AMD processors return the quotient
automatically whenever a division is done. There should be no need
to do a mult and subt. We should be able to retrieve the remainder.

Suppose we inline the following:

Code:
__inline int divwithremainder(a,b, &quo)

_asm {
    mov eax, a
   div b
   mov edi, x
   mov DWORD PTR [edi], eax   //  get quotient
   mov  eax, edx     // return remainder
   }

Any guesses as to whether this will be faster than the mult & subt?
If we do

quo = a/b;
rem = a%b;

the division will be done twice. I don't know any compiler that is
smart enough to recognize that once the division is done, then the
remainder is already in a register and hence 'free' (except for a mov
instruction)

Comments?
Note that mod_inverse and lat_reduce are too big to inline.
Any guesses as to how much inlining modmultadd might save?

(of course I will do the experiment, but that will take time; I'd like to
see some comments before trying out any new code)
R.D. Silverman is offline   Reply With Quote
Old 2010-01-09, 01:18   #11
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3·1,181 Posts
Default

Also, some of the technical report here deals with modular inverses.
jasonp is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
gcc optimization/intrinsics R.D. Silverman Factoring 12 2015-09-15 08:51
Program optimization henryzz Programming 17 2012-09-26 13:21
Possible optimization for GIMPS alpertron Math 3 2012-08-13 16:08
Size optimization Sleepy Msieve 14 2011-10-20 10:27
ASM Optimization Cyclamen Persicum Hardware 4 2004-05-26 07:51

All times are UTC. The time now is 19:20.


Fri Sep 24 19:20:58 UTC 2021 up 63 days, 13:49, 1 user, load averages: 2.13, 2.00, 1.88

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.