20100108, 13:32  #1 
Nov 2003
1D24_{16} Posts 
NFS Optimization
I am taking another whack at improving my NFS siever code.
I'd like to start a discussion thread. Is anyone else interested? I will start by posting some benchmark data for my current factorization effort: 2^1161 + 1 (SNFS difficulty 234) My code uses specialq on the integer side. Compiler: Microsoft Visual Studio 2008 Polynomials: xM, x^6  x^3 + 1, M=2^129 LP bound 825million. Factor base SIZE = 2.9 million ideals for each side. This translates to a bound of about 48.2 million. Sieve region is 16K x 8K. For specialq near 67 million I get the following data: (with my best compilation effort (2.4GHz duocore; both cores running) Total time for one special q: 11.5 seconds Primary integer side vector sieve: 2.6 seconds Primary alg side vector sieve: 2.6 seconds Primary integer side line sieve (small primes only) .26 seconds Primary alg side line sieve (small primes only) .26 sec Integer side resieve (for factoring norms) .52 sec Alg side resieve (for factoring norms) .32 sec Preparing sieve starting points: Modular inverses: 1.15 sec Lattice reductions : 1.56 sec Scanning the sieve region for successful hits: .31 sec Preparing sieve region boundaries (I use Pollard style) 1.03 sec The rest is overhead, bookkeeping, splitting cofactors, etc. I'd like to start by improving the mod inverse and latred routines. I have worked on these extensively to no avail. I will work on my siever (handling the buckets) later. 
20100108, 14:14  #2 
"Ben"
Feb 2007
110111110110_{2} Posts 
Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.

20100108, 15:26  #3  
Nov 2003
2^{2}×5×373 Posts 
Quote:
in VC++ MASM, as opposed to gcc assemble syntax. Right now, I am looking to improve the following: Code:
/************************************************************************/ /* */ /* Routine to computer single precision modular inverses */ /* */ /************************************************************************/ int single_modinv_old (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; double t1; if (TIME_STATS) t1 = get_time(); if (a == 1) return 1; q = modulus / a; rem = modulus  a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; while (divisor > 1) { rem = dividend  divisor; t = rem  divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; t = divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend  q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (TIME_STATS) invtime += (get_time()  t1); if (parity == 0) return (ps1); else return (modulus  ps1); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* modified from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start; of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; //int v1a[2], v2a[2]; #if (FEWERDIVLATRED  NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif if (TIME_STATS) stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 1); } if (x3 > x2 ) { x3 = x2; y3 = y2; } if (x3 > (x2 >> 1)) { x3 = x2; y3 = y2; } if (x3 < 0) { x3 = x3; y3 = y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED  NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 = s2_16; q += 16; } if (s3 > s2_8) { s3 = s2_8; q += 8; } if (s3 > s2_4) { s3 = s2_4; q += 4; } if (s3 > s2_2) { s3 = s2_2; q += 2; } if (s3 > s2) { s3 = s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3  x2 * q; v1[1] = y3  y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = v2[0]; v2[1] = v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = v1[0]; v1[1] = v1[1]; } if (TIME_STATS) latred_time += (get_time()  stime); //qtime = (get_time()  stime); //time = get_time(); //newest_latred_exp(special_q, root, v1a, v2a); //ttime = (get_time()  stime); //printf("old, new = %lf %lf\n", qtime, ttime); //printf("RATIO: %lf\n", qtime/ttime); //tot1 += qtime; //tot2 += ttime; //cmp_latred(special_q, root, v1,v2, v1a, v2a); return(parity); } /* end of reduce_lattice */ I have a different mod inverse routine that does NOT do the subtractions; They don't seem to make much difference. I think that I may write the entire routine in MASM. 

20100108, 15:42  #4  
Nov 2003
7460_{10} Posts 
Quote:
run time. Here is where it happens: To prepare the sieve we must compute, FOR EACH PRIME p IN THE FACTOR BASE(s) an expression of the form: r = (a1 + b1 * R)/(a0 + b0 * R) mod p then compute a reduced lattice from (p,r) p can be as large as about 10^8. R can be as large as p, and the a,b values come from the special q reduced lattice. Therefore, the numerator and denominator are often larger than 32bits (but always smaller than 64). The denominator can be negative. I compute this by: numer = modmultadd(b1, R, a1, p) temp_denom = modmultadd(b0, R, a0, p) denom = single_modinv(temp_denom, p) root = mod_mult(numer, denom, p) lattice_reduce(p, r, vector1, vector2) I gave the modinverse and lattice reduction routines already. modmultadd is an assembler routine that computes (c*d + e) mod p mod_mult does the same without the addition. I doubt whether I can make these any faster. I will post code for all of this if requested. 

20100108, 16:27  #5  
"Ben"
Feb 2007
6766_{8} Posts 
Quote:
Quote:
Also, you could get rid of two comparison statements by using preprocessor #if/#else around TIME_STATS in the modular inversion, but with the obvious branch prediction and speculative execution it probably doesn't matter. If I get time I'll look more at the reduce_lattice routine. 

20100108, 22:00  #6  
Jun 2003
The Texas Hill Country
3^{2}×11^{2} Posts 
Quote:
Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects. Richard 

20100109, 01:00  #7  
Nov 2003
7460_{10} Posts 
Quote:
everyone who has siever code could use any improvements. The time intensive loop for preparing the sieve looks like the following: v1, v2 are the reduced lattice vectors, root is the polynomial root mod p, loop over p in factor base: { numer = modmultadd(v2[1], root, v2[0], p); denom = modmultadd(v1[1], root, v1[0], p); inverse = mod_inverse(denom, p); root = mod_mult(numer, inverse, p); lattice_reduce(p, root, subv1, subv2); } I posted my code for the mod_inverse and lattice_reduce already. Here is code for modmultadd: Code:
/************************************************************************/ /* */ /* modmultadd(a,b,c,d); returns (ab + c) mod d; a,b,c,d 32 bit */ /* integers; returns positive value */ /* */ /************************************************************************/ modmultadd_asm(a,b,c,d) int a,b,c,d; { /* start of modmultadd_asm */ _asm { mov eax, a /* compute a*b in edx:eax*/ imul b mov ecx,c mov ebx,c sar ebx,31 /* sign extend, then add c*/ add eax,ecx adc edx,ebx /* carry bit to high order word*/ idiv d /* div by d, remainder in edx */ cmp edx, 0 /* is it > 0? */ jge SHORT $LABEL1 add edx, d /* if < 0 add the divisor */ $LABEL1: mov eax , edx /* return the remainder in eax */ } } /* end of modmultadd_asm */ 

20100109, 01:11  #8  
Nov 2003
2^{2}×5×373 Posts 
Quote:
With respect to modinverse: The time intensive part is the core computation of: q = dividend/divisor; rem = dividend  q*divisor Note however, that the Intel/AMD processors return the quotient automatically whenever a division is done. There should be no need to do a mult and subt. We should be able to retrieve the remainder. Suppose we inline the following: Code:
__inline int divwithremainder(a,b, &quo) _asm { mov eax, a div b mov edi, x mov DWORD PTR [edi], eax // get quotient mov eax, edx // return remainder } Any guesses as to whether this will be faster than the mult & subt? quo = a/b; rem = a%b; the division will be done twice. I don't know any compiler that is smart enough to recognize that once the division is done, then the remainder is already in a register and hence 'free' (except for a mov instruction) Comments? 

20100109, 01:14  #9 
Tribal Bullet
Oct 2004
3×1,181 Posts 
gcc has been able to do this for years on x86; but prudence dictates checking the generated asm every time.

20100109, 01:15  #10  
Nov 2003
2^{2}×5×373 Posts 
Quote:
Any guesses as to how much inlining modmultadd might save? (of course I will do the experiment, but that will take time; I'd like to see some comments before trying out any new code) 

Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gcc optimization/intrinsics  R.D. Silverman  Factoring  12  20150915 08:51 
Program optimization  henryzz  Programming  17  20120926 13:21 
Possible optimization for GIMPS  alpertron  Math  3  20120813 16:08 
Size optimization  Sleepy  Msieve  14  20111020 10:27 
ASM Optimization  Cyclamen Persicum  Hardware  4  20040526 07:51 