![]() |
![]() |
#1 |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]()
I am taking another whack at improving my NFS siever code.
I'd like to start a discussion thread. Is anyone else interested? I will start by posting some benchmark data for my current factorization effort: 2^1161 + 1 (SNFS difficulty 234) My code uses special-q on the integer side. Compiler: Microsoft Visual Studio 2008 Polynomials: x-M, x^6 - x^3 + 1, M=2^129 LP bound 825million. Factor base SIZE = 2.9 million ideals for each side. This translates to a bound of about 48.2 million. Sieve region is 16K x 8K. For special-q near 67 million I get the following data: (with my best compilation effort (2.4GHz duo-core; both cores running) Total time for one special q: 11.5 seconds Primary integer side vector sieve: 2.6 seconds Primary alg side vector sieve: 2.6 seconds Primary integer side line sieve (small primes only) .26 seconds Primary alg side line sieve (small primes only) .26 sec Integer side resieve (for factoring norms) .52 sec Alg side resieve (for factoring norms) .32 sec Preparing sieve starting points: Modular inverses: 1.15 sec Lattice reductions : 1.56 sec Scanning the sieve region for successful hits: .31 sec Preparing sieve region boundaries (I use Pollard style) 1.03 sec The rest is overhead, bookkeeping, splitting cofactors, etc. I'd like to start by improving the mod inverse and latred routines. I have worked on these extensively to no avail. I will work on my siever (handling the buckets) later. |
![]() |
![]() |
![]() |
#2 |
"Ben"
Feb 2007
361910 Posts |
![]()
Sure, I might not be terribly useful to you from a code development standpoint but I'm interested in hearing about it. I can also help build/run benchmark tests if you like.
|
![]() |
![]() |
![]() |
#3 | |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]() Quote:
in VC++ MASM, as opposed to gcc assemble syntax. Right now, I am looking to improve the following: Code:
/************************************************************************/ /* */ /* Routine to computer single precision modular inverses */ /* */ /************************************************************************/ int single_modinv_old (int a, int modulus) { /* start of single_modinv */ int ps1, ps2, parity, dividend, divisor, rem, q, t; double t1; if (TIME_STATS) t1 = get_time(); if (a == 1) return 1; q = modulus / a; rem = modulus - a * q; dividend = a; divisor = rem; ps1 = q; ps2 = 1; parity = ~0; while (divisor > 1) { rem = dividend - divisor; t = rem - divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; t -= divisor; if (t >= 0) { q += ps1; rem = t; if (rem >= divisor) { q = dividend/divisor; rem = dividend - q * divisor; q *= ps1; }}}}}}}}} q += ps2; parity = ~parity; dividend = divisor; divisor = rem; ps2 = ps1; ps1 = q; } if (TIME_STATS) invtime += (get_time() - t1); if (parity == 0) return (ps1); else return (modulus - ps1); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* modified from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start; of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; //int v1a[2], v2a[2]; #if (FEWERDIVLATRED | NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif if (TIME_STATS) stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 1); } if (x3 > x2 ) { x3 -= x2; y3 -= y2; } if (x3 > (x2 >> 1)) { x3 -= x2; y3 -= y2; } if (x3 < 0) { x3 = -x3; y3 = -y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED | NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = -s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 -= s2_16; q += 16; } if (s3 > s2_8) { s3 -= s2_8; q += 8; } if (s3 > s2_4) { s3 -= s2_4; q += 4; } if (s3 > s2_2) { s3 -= s2_2; q += 2; } if (s3 > s2) { s3 -= s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = -q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3 - x2 * q; v1[1] = y3 - y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = -v2[0]; v2[1] = -v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = -v1[0]; v1[1] = -v1[1]; } if (TIME_STATS) latred_time += (get_time() - stime); //qtime = (get_time() - stime); //time = get_time(); //newest_latred_exp(special_q, root, v1a, v2a); //ttime = (get_time() - stime); //printf("old, new = %lf %lf\n", qtime, ttime); //printf("RATIO: %lf\n", qtime/ttime); //tot1 += qtime; //tot2 += ttime; //cmp_latred(special_q, root, v1,v2, v1a, v2a); return(parity); } /* end of reduce_lattice */ I have a different mod inverse routine that does NOT do the subtractions; They don't seem to make much difference. I think that I may write the entire routine in MASM. |
|
![]() |
![]() |
![]() |
#4 | |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]() Quote:
run time. Here is where it happens: To prepare the sieve we must compute, FOR EACH PRIME p IN THE FACTOR BASE(s) an expression of the form: r = (a1 + b1 * R)/(a0 + b0 * R) mod p then compute a reduced lattice from (p,r) p can be as large as about 10^8. R can be as large as p, and the a,b values come from the special q reduced lattice. Therefore, the numerator and denominator are often larger than 32-bits (but always smaller than 64). The denominator can be negative. I compute this by: numer = modmultadd(b1, R, a1, p) temp_denom = modmultadd(b0, R, a0, p) denom = single_modinv(temp_denom, p) root = mod_mult(numer, denom, p) lattice_reduce(p, r, vector1, vector2) I gave the modinverse and lattice reduction routines already. modmultadd is an assembler routine that computes (c*d + e) mod p mod_mult does the same without the addition. I doubt whether I can make these any faster. I will post code for all of this if requested. |
|
![]() |
![]() |
![]() |
#5 | ||
"Ben"
Feb 2007
E2316 Posts |
![]() Quote:
Quote:
Also, you could get rid of two comparison statements by using pre-processor #if/#else around TIME_STATS in the modular inversion, but with the obvious branch prediction and speculative execution it probably doesn't matter. If I get time I'll look more at the reduce_lattice routine. |
||
![]() |
![]() |
![]() |
#6 | |
Jun 2003
The Texas Hill Country
100010000012 Posts |
![]() Quote:
Fortunately, in almost all cases, the assembler syntax does not affect the efficiency of the resulting code. We can, manually if necessary, translate between the dialects. Richard |
|
![]() |
![]() |
![]() |
#7 | |
"Bob Silverman"
Nov 2003
North of Boston
746010 Posts |
![]() Quote:
everyone who has siever code could use any improvements. The time intensive loop for preparing the sieve looks like the following: v1, v2 are the reduced lattice vectors, root is the polynomial root mod p, loop over p in factor base: { numer = modmultadd(v2[1], root, v2[0], p); denom = modmultadd(v1[1], root, v1[0], p); inverse = mod_inverse(denom, p); root = mod_mult(numer, inverse, p); lattice_reduce(p, root, subv1, subv2); } I posted my code for the mod_inverse and lattice_reduce already. Here is code for modmultadd: Code:
/************************************************************************/ /* */ /* modmultadd(a,b,c,d); returns (ab + c) mod d; a,b,c,d 32 bit */ /* integers; returns positive value */ /* */ /************************************************************************/ modmultadd_asm(a,b,c,d) int a,b,c,d; { /* start of modmultadd_asm */ _asm { mov eax, a /* compute a*b in edx:eax*/ imul b mov ecx,c mov ebx,c sar ebx,31 /* sign extend, then add c*/ add eax,ecx adc edx,ebx /* carry bit to high order word*/ idiv d /* div by d, remainder in edx */ cmp edx, 0 /* is it > 0? */ jge SHORT $LABEL1 add edx, d /* if < 0 add the divisor */ $LABEL1: mov eax , edx /* return the remainder in eax */ } } /* end of modmultadd_asm */ |
|
![]() |
![]() |
![]() |
#8 | |
"Bob Silverman"
Nov 2003
North of Boston
22·5·373 Posts |
![]() Quote:
With respect to modinverse: The time intensive part is the core computation of: q = dividend/divisor; rem = dividend - q*divisor Note however, that the Intel/AMD processors return the quotient automatically whenever a division is done. There should be no need to do a mult and subt. We should be able to retrieve the remainder. Suppose we inline the following: Code:
__inline int divwithremainder(a,b, &quo) _asm { mov eax, a div b mov edi, x mov DWORD PTR [edi], eax // get quotient mov eax, edx // return remainder } Any guesses as to whether this will be faster than the mult & subt? quo = a/b; rem = a%b; the division will be done twice. I don't know any compiler that is smart enough to recognize that once the division is done, then the remainder is already in a register and hence 'free' (except for a mov instruction) Comments? |
|
![]() |
![]() |
![]() |
#9 |
Tribal Bullet
Oct 2004
5×709 Posts |
![]()
gcc has been able to do this for years on x86; but prudence dictates checking the generated asm every time.
|
![]() |
![]() |
![]() |
#10 | |
"Bob Silverman"
Nov 2003
North of Boston
22×5×373 Posts |
![]() Quote:
Any guesses as to how much inlining modmultadd might save? (of course I will do the experiment, but that will take time; I'd like to see some comments before trying out any new code) |
|
![]() |
![]() |
![]() |
Thread Tools | |
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
gcc optimization/intrinsics | R.D. Silverman | Factoring | 12 | 2015-09-15 08:51 |
Program optimization | henryzz | Programming | 17 | 2012-09-26 13:21 |
Possible optimization for GIMPS | alpertron | Math | 3 | 2012-08-13 16:08 |
Size optimization | Sleepy | Msieve | 14 | 2011-10-20 10:27 |
ASM Optimization | Cyclamen Persicum | Hardware | 4 | 2004-05-26 07:51 |