mersenneforum.org NFS Optimization
 Register FAQ Search Today's Posts Mark Forums Read

 2010-01-08, 13:32 #1 R.D. Silverman     Nov 2003 1D2416 Posts NFS Optimization I am taking another whack at improving my NFS siever code. I'd like to start a discussion thread. Is anyone else interested? I will start by posting some benchmark data for my current factorization effort: 2^1161 + 1 (SNFS difficulty 234) My code uses special-q on the integer side. Compiler: Microsoft Visual Studio 2008 Polynomials: x-M, x^6 - x^3 + 1, M=2^129 LP bound 825million. Factor base SIZE = 2.9 million ideals for each side. This translates to a bound of about 48.2 million. Sieve region is 16K x 8K. For special-q near 67 million I get the following data: (with my best compilation effort (2.4GHz duo-core; both cores running) Total time for one special q: 11.5 seconds Primary integer side vector sieve: 2.6 seconds Primary alg side vector sieve: 2.6 seconds Primary integer side line sieve (small primes only) .26 seconds Primary alg side line sieve (small primes only) .26 sec Integer side resieve (for factoring norms) .52 sec Alg side resieve (for factoring norms) .32 sec Preparing sieve starting points: Modular inverses: 1.15 sec Lattice reductions : 1.56 sec Scanning the sieve region for successful hits: .31 sec Preparing sieve region boundaries (I use Pollard style) 1.03 sec The rest is overhead, bookkeeping, splitting cofactors, etc. I'd like to start by improving the mod inverse and latred routines. I have worked on these extensively to no avail. I will work on my siever (handling the buckets) later.
2010-01-08, 14:14   #2
bsquared

"Ben"
Feb 2007

1101111101102 Posts

Quote:
 Originally Posted by R.D. Silverman I am taking another whack at improving my NFS siever code. I'd like to start a discussion thread. Is anyone else interested?
Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.

2010-01-08, 15:26   #3
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by bsquared Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.
What is your environment? My low-level 64-bit arithmetic is written
in VC++ MASM, as opposed to gcc assemble syntax.

Right now, I am looking to improve the following:

Code:
/************************************************************************/
/*                                                                      */
/*   Routine to computer single precision modular inverses              */
/*                                                                      */
/************************************************************************/

int single_modinv_old (int a, int modulus)

{ /* start of single_modinv */

int ps1, ps2, parity, dividend, divisor, rem, q, t;
double t1;

if (TIME_STATS) t1 = get_time();
if (a == 1) return 1;
q = modulus / a;
rem = modulus - a * q;
dividend = a;
divisor = rem;
ps1 = q;
ps2 = 1;
parity = ~0;

while (divisor > 1)
{
rem = dividend - divisor;
t = rem - divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
t -= divisor;
if (t >= 0) {
q += ps1;
rem = t;
if (rem >= divisor) {
q = dividend/divisor;
rem = dividend - q * divisor;
q *= ps1;
}}}}}}}}}
q += ps2;
parity = ~parity;
dividend = divisor;
divisor = rem;
ps2 = ps1;
ps1 = q;
}

if (TIME_STATS) invtime += (get_time() - t1);
if (parity == 0) return (ps1);
else return (modulus - ps1);

} /* end of single_modinv */

/************************************************************************/
/*                                                                      */
/*   routine to compute two reduced lattice vectors                     */
/*  modified from AKL; which explains stylistic differences in code     */
/*                                                                      */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{   /* start; of reduce_lattice */
double stime;
long x1;
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3;
double s2;
double s3;
long q;
long sx2;
long parity = 0;
//int v1a[2], v2a[2];

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

if (TIME_STATS) stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
{
x3 = x1;
y3 = yy1;
sx2 = (x2 << 4);
while (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 4);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 3);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 2);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 1);
}
if (x3 > x2 )
{
x3 -= x2;
y3 -= y2;
}
if (x3 > (x2 >> 1))
{
x3 -= x2;
y3 -= y2;
}
if (x3 < 0)
{
x3 = -x3;
y3 = -y3;
parity ^= 1;
}

s3 = x3 * (double) x3 + y3 * (double) y3;
if (s3 >= s2) break;
s2 = s3;
x1 = x2;
yy1 = y2;
x2 = x3;
y2 = y3;
parity ^= 1;
}

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
{
yy1 = 1;
s3 = -s3;
}
else
{
yy1 = 0;
}
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
{
s3 -= s2_16;
q += 16;
}
if (s3 > s2_8)
{
s3 -= s2_8;
q += 8;
}
if (s3 > s2_4)
{
s3 -= s2_4;
q += 4;
}
if (s3 > s2_2)
{
s3 -= s2_2;
q += 2;
}
if (s3 > s2)
{
s3 -= s2;
q ++;
}
#ifndef NODIVLATRED
if (s3 > s2)
{
if (s3 < (s2_16+s2_16)) goto again;
q += (long)mrint(s3/s2);
}
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0)      /* make 'b'  positive */
{
v2[0] = -v2[0];
v2[1] = -v2[1];
}

if (v1[1] < 0)      /* make 'b'  positive */
{
v1[0] = -v1[0];
v1[1] = -v1[1];
}

if (TIME_STATS) latred_time += (get_time() - stime);
//qtime = (get_time() - stime);

//time = get_time();
//ttime = (get_time() - stime);

//printf("old, new = %lf %lf\n", qtime, ttime);
//printf("RATIO: %lf\n", qtime/ttime);
//tot1 += qtime;
//tot2 += ttime;

//cmp_latred(special_q, root, v1,v2, v1a, v2a);

return(parity);
}   /* end of reduce_lattice */

I have a different mod inverse routine that does NOT do the subtractions;
They don't seem to make much difference. I think that I may write
the entire routine in MASM.

2010-01-08, 15:42   #4
R.D. Silverman

Nov 2003

746010 Posts

Quote:
 Originally Posted by R.D. Silverman What is your environment? My low-level 64-bit arithmetic is written in VC++ MASM, as opposed to gcc assemble syntax. Right now, I am looking to improve the following: Code: /************************************************************************/ /* */ /* Routine to computer single precision modular inverses */ /* */ /************************************************************************/ int single_modinv_old (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; double t1; if (TIME_STATS) t1 = get_time(); if (a == 1) return 1; q = modulus / a; rem = modulus - a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; while (divisor > 1) { rem = dividend - divisor; t = rem - divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (TIME_STATS) invtime += (get_time() - t1); if (parity == 0) return (ps1); else return (modulus - ps1); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* modified from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start; of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; //int v1a[2], v2a[2]; #if (FEWERDIVLATRED | NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif if (TIME_STATS) stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 1); } if (x3 > x2 ) { x3 -= x2; y3 -= y2; } if (x3 > (x2 >> 1)) { x3 -= x2; y3 -= y2; } if (x3 < 0) { x3 = -x3; y3 = -y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED | NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = -s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 -= s2_16; q += 16; } if (s3 > s2_8) { s3 -= s2_8; q += 8; } if (s3 > s2_4) { s3 -= s2_4; q += 4; } if (s3 > s2_2) { s3 -= s2_2; q += 2; } if (s3 > s2) { s3 -= s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = -q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3 - x2 * q; v1[1] = y3 - y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = -v2[0]; v2[1] = -v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = -v1[0]; v1[1] = -v1[1]; } if (TIME_STATS) latred_time += (get_time() - stime); //qtime = (get_time() - stime); //time = get_time(); //newest_latred_exp(special_q, root, v1a, v2a); //ttime = (get_time() - stime); //printf("old, new = %lf %lf\n", qtime, ttime); //printf("RATIO: %lf\n", qtime/ttime); //tot1 += qtime; //tot2 += ttime; //cmp_latred(special_q, root, v1,v2, v1a, v2a); return(parity); } /* end of reduce_lattice */ I have a different mod inverse routine that does NOT do the subtractions; They don't seem to make much difference. I think that I may write the entire routine in MASM.
These routines are responsible for a large chunk of the overall
run time. Here is where it happens:

To prepare the sieve we must compute, FOR EACH PRIME p IN THE FACTOR BASE(s) an expression of the form:

r = (a1 + b1 * R)/(a0 + b0 * R) mod p

then compute a reduced lattice from (p,r)

p can be as large as about 10^8. R can be as large as p, and the a,b
values come from the special q reduced lattice. Therefore, the numerator
and denominator are often larger than 32-bits (but always smaller than 64).
The denominator can be negative.

I compute this by:

numer = modmultadd(b1, R, a1, p)
temp_denom = modmultadd(b0, R, a0, p)
denom = single_modinv(temp_denom, p)
root = mod_mult(numer, denom, p)
lattice_reduce(p, r, vector1, vector2)

I gave the modinverse and lattice reduction routines already.
modmultadd is an assembler routine that computes (c*d + e) mod p
mod_mult does the same without the addition. I doubt whether I can
make these any faster. I will post code for all of this if requested.

2010-01-08, 16:27   #5
bsquared

"Ben"
Feb 2007

67668 Posts

Quote:
 Originally Posted by R.D. Silverman What is your environment? My low-level 64-bit arithmetic is written in VC++ MASM, as opposed to gcc assemble syntax.
I have MSVC 2010 beta 2 on my home machine, and MSVC Express Edition 08 at work, which is what I used to build your previous code you sent me. I think either can deal with your assembler code.

Quote:
 Originally Posted by R.D. Silverman Right now, I am looking to improve the following: Code: /************************************************************************/ /* */ /* Routine to computer single precision modular inverses */ /* */ /************************************************************************/ int single_modinv_old (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; double t1; if (TIME_STATS) t1 = get_time(); if (a == 1) return 1; q = modulus / a; rem = modulus - a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; while (divisor > 1) { rem = dividend - divisor; t = rem - divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (TIME_STATS) invtime += (get_time() - t1); if (parity == 0) return (ps1); else return (modulus - ps1); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* modified from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start; of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; //int v1a[2], v2a[2]; #if (FEWERDIVLATRED | NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif if (TIME_STATS) stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 1); } if (x3 > x2 ) { x3 -= x2; y3 -= y2; } if (x3 > (x2 >> 1)) { x3 -= x2; y3 -= y2; } if (x3 < 0) { x3 = -x3; y3 = -y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED | NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = -s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 -= s2_16; q += 16; } if (s3 > s2_8) { s3 -= s2_8; q += 8; } if (s3 > s2_4) { s3 -= s2_4; q += 4; } if (s3 > s2_2) { s3 -= s2_2; q += 2; } if (s3 > s2) { s3 -= s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = -q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3 - x2 * q; v1[1] = y3 - y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = -v2[0]; v2[1] = -v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = -v1[0]; v1[1] = -v1[1]; } if (TIME_STATS) latred_time += (get_time() - stime); //qtime = (get_time() - stime); //time = get_time(); //newest_latred_exp(special_q, root, v1a, v2a); //ttime = (get_time() - stime); //printf("old, new = %lf %lf\n", qtime, ttime); //printf("RATIO: %lf\n", qtime/ttime); //tot1 += qtime; //tot2 += ttime; //cmp_latred(special_q, root, v1,v2, v1a, v2a); return(parity); } /* end of reduce_lattice */ I have a different mod inverse routine that does NOT do the subtractions; They don't seem to make much difference. I think that I may write the entire routine in MASM.
I'm familiar with the modular inversion routine; I use it in YAFU as well although obviously there it is not really a bottleneck at all so I haven't tried to improve it. I have tried to beat optimizing compilers before with assembly code and I usually fail... but maybe cmov would help here if the compiler isn't using it.

Also, you could get rid of two comparison statements by using pre-processor #if/#else around TIME_STATS in the modular inversion, but with the obvious branch prediction and speculative execution it probably doesn't matter.

If I get time I'll look more at the reduce_lattice routine.

2010-01-08, 22:00   #6
Wacky

Jun 2003
The Texas Hill Country

32×112 Posts

Quote:
 Originally Posted by R.D. Silverman What is your environment? My low-level 64-bit arithmetic is written in VC++ MASM, as opposed to gcc assemble syntax.
Bob,

Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects.

Richard

2010-01-09, 01:00   #7
R.D. Silverman

Nov 2003

746010 Posts

Quote:
 Originally Posted by Wacky Bob, Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects. Richard
Note that we ALL would benefit from improving this code, since
everyone who has siever code could use any improvements.

The time intensive loop for preparing the sieve looks like the following:
v1, v2 are the reduced lattice vectors, root is the polynomial root mod p,

loop over p in factor base: {

numer = modmultadd(v2[1], root, v2[0], p);
denom = modmultadd(v1[1], root, v1[0], p);
inverse = mod_inverse(denom, p);
root = mod_mult(numer, inverse, p);
lattice_reduce(p, root, subv1, subv2);
}

I posted my code for the mod_inverse and lattice_reduce already.

Code:
/************************************************************************/
/*																		*/
/*	modmultadd(a,b,c,d);  returns (ab + c) mod d; a,b,c,d  32 bit		*/
/*	integers;  returns positive value									*/
/*																		*/
/************************************************************************/

int a,b,c,d;

{   /* start of modmultadd_asm */

_asm
{
mov eax, a			/* compute a*b  in edx:eax*/
imul b

mov ecx,c
mov ebx,c
sar ebx,31			/* sign extend, then add c*/
adc edx,ebx			/* carry bit to high order word*/

idiv d				/* div by d, remainder in edx	*/
cmp edx, 0			/* is it > 0?					*/
jge SHORT $LABEL1 add edx, d /* if < 0 add the divisor */$LABEL1:
mov eax , edx		/* return the remainder in eax	*/
}

}   /* end of modmultadd_asm */

2010-01-09, 01:11   #8
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by R.D. Silverman Note that we ALL would benefit from improving this code, since everyone who has siever code could use any improvements. The time intensive loop for preparing the sieve looks like the following: v1, v2 are the reduced lattice vectors, root is the polynomial root mod p, loop over p in factor base: { numer = modmultadd(v2[1], root, v2[0], p); denom = modmultadd(v1[1], root, v1[0], p); inverse = mod_inverse(denom, p); root = mod_mult(numer, inverse, p); lattice_reduce(p, root, subv1, subv2); } I posted my code for the mod_inverse and lattice_reduce already. Here is code for modmultadd: Code: /************************************************************************/ /* */ /* modmultadd(a,b,c,d); returns (ab + c) mod d; a,b,c,d 32 bit */ /* integers; returns positive value */ /* */ /************************************************************************/ modmultadd_asm(a,b,c,d) int a,b,c,d; { /* start of modmultadd_asm */ _asm { mov eax, a /* compute a*b in edx:eax*/ imul b mov ecx,c mov ebx,c sar ebx,31 /* sign extend, then add c*/ add eax,ecx adc edx,ebx /* carry bit to high order word*/ idiv d /* div by d, remainder in edx */ cmp edx, 0 /* is it > 0? */ jge SHORT $LABEL1 add edx, d /* if < 0 add the divisor */$LABEL1: mov eax , edx /* return the remainder in eax */ } } /* end of modmultadd_asm */

With respect to modinverse:

The time intensive part is the core computation of:

q = dividend/divisor;
rem = dividend - q*divisor

Note however, that the Intel/AMD processors return the quotient
automatically whenever a division is done. There should be no need
to do a mult and subt. We should be able to retrieve the remainder.

Suppose we inline the following:

Code:
__inline int divwithremainder(a,b, &quo)

_asm {
mov eax, a
div b
mov edi, x
mov DWORD PTR [edi], eax   //  get quotient
mov  eax, edx     // return remainder
}

Any guesses as to whether this will be faster than the mult & subt?
If we do

quo = a/b;
rem = a%b;

the division will be done twice. I don't know any compiler that is
smart enough to recognize that once the division is done, then the
remainder is already in a register and hence 'free' (except for a mov
instruction)

2010-01-09, 01:14   #9
jasonp
Tribal Bullet

Oct 2004

3×1,181 Posts

Quote:
 Originally Posted by R.D. Silverman If we do quo = a/b; rem = a%b; the division will be done twice. I don't know any compiler that is smart enough to recognize that once the division is done, then the remainder is already in a register and hence 'free' (except for a mov instruction)
gcc has been able to do this for years on x86; but prudence dictates checking the generated asm every time.

2010-01-09, 01:15   #10
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by R.D. Silverman With respect to modinverse: The time intensive part is the core computation of: q = dividend/divisor; rem = dividend - q*divisor Note however, that the Intel/AMD processors return the quotient automatically whenever a division is done. There should be no need to do a mult and subt. We should be able to retrieve the remainder. Suppose we inline the following: Code: __inline int divwithremainder(a,b, &quo) _asm { mov eax, a div b mov edi, x mov DWORD PTR [edi], eax // get quotient mov eax, edx // return remainder } Any guesses as to whether this will be faster than the mult & subt? If we do quo = a/b; rem = a%b; the division will be done twice. I don't know any compiler that is smart enough to recognize that once the division is done, then the remainder is already in a register and hence 'free' (except for a mov instruction) Comments?
Note that mod_inverse and lat_reduce are too big to inline.
Any guesses as to how much inlining modmultadd might save?

(of course I will do the experiment, but that will take time; I'd like to
see some comments before trying out any new code)

 2010-01-09, 01:18 #11 jasonp Tribal Bullet     Oct 2004 3·1,181 Posts Also, some of the technical report here deals with modular inverses.

 Similar Threads Thread Thread Starter Forum Replies Last Post R.D. Silverman Factoring 12 2015-09-15 08:51 henryzz Programming 17 2012-09-26 13:21 alpertron Math 3 2012-08-13 16:08 Sleepy Msieve 14 2011-10-20 10:27 Cyclamen Persicum Hardware 4 2004-05-26 07:51

All times are UTC. The time now is 19:20.

Fri Sep 24 19:20:58 UTC 2021 up 63 days, 13:49, 1 user, load averages: 2.13, 2.00, 1.88