 mersenneforum.org Calling hotshot coders: NFS code challenge
 Register FAQ Search Today's Posts Mark Forums Read  2005-07-25, 13:39 #1 R.D. Silverman   Nov 2003 22×5×373 Posts Calling hotshot coders: NFS code challenge George Woltman has made some nice improvements to my lattice siever. I have made some as well. To speed one part of the code, it is essential that we speed two routines in particular. The first is a single precision modular inverse that calculates 1/a mod p, for p prime. I include my best effort below. The second is a routine that does lattice basis reduction. I include a routine below (courtesy of Arjen Lenstra) Would anyone like to take a wack at speeding these up???? It would be very useful. Think of it as a challenge. /************************************************************************/ /* */ /* returns Multiplicative inverse of a mod modulus */ /* */ /************************************************************************/ single_modinv(a,modulus) int a,modulus; { /* start of single_modinv */ register unsigned int ps,ps1,ps2=1,parity,dividend,divisor,rem,q; double t1; q = modulus/a; rem = modulus - a*q; dividend = a; divisor = rem; ps1 = q; parity = 0; while (rem != 0) { q = 1; rem = dividend - divisor; if (rem >= divisor) { q = 2; rem -= divisor; if (rem >= divisor) { q = 3; rem -= divisor; if (rem >= divisor) { q = 4; rem -= divisor; if (rem >= divisor) { q = 5; rem -= divisor; } if (rem >= divisor) { q = 6; rem -= divisor; } if (rem >= divisor) { q = dividend/divisor; rem = dividend - q*divisor; } } } } ps = q*ps1 + ps2; ps2 = ps1; ps1 = ps; dividend = divisor; divisor = rem; parity = 1 - parity; } if (parity == 0) return(ps2); else return(modulus-ps2); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* borrowed from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; #if (FEWERDIVLATRED | NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 1); } if (x3 > x2 ) { x3 -= x2; y3 -= y2; } if (x3 > (x2 >> 1)) { x3 -= x2; y3 -= y2; } if (x3 < 0) { x3 = -x3; y3 = -y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED | NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = -s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 -= s2_16; q += 16; } if (s3 > s2_8) { s3 -= s2_8; q += 8; } if (s3 > s2_4) { s3 -= s2_4; q += 4; } if (s3 > s2_2) { s3 -= s2_2; q += 2; } if (s3 > s2) { s3 -= s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = -q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2 = x2; v2 = y2; v1 = x3 - x2 * q; v1 = y3 - y2 * q; if (v2 < 0) /* make 'b' positive */ { v2 = -v2; v2 = -v2; } if (v1 < 0) /* make 'b' positive */ { v1 = -v1; v1 = -v1; } latred_time += (get_time() - stime); return(parity); } /* end of reduce_lattice */   2005-07-25, 15:21   #2
R.D. Silverman

Nov 2003

164448 Posts Quote:
 Originally Posted by R.D. Silverman
Is anyone interested in this challenge????   2005-07-25, 17:05 #3 tom11784   Aug 2003 Upstate NY, USA 14616 Posts do you have examples of (a,modulus) pairs you plan to run through the first program? that would help when it comes to what should be done for loop structure and whether to break into cases, etc btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps Last fiddled with by tom11784 on 2005-07-25 at 17:12   2005-07-25, 17:20 #4 akruppa   "Nancy" Aug 2002 Alexandria 2,467 Posts Hi Bob, how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ? Alex   2005-07-25, 17:42   #5
R.D. Silverman

Nov 2003

11101001001002 Posts Quote:
 Originally Posted by akruppa Hi Bob, how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ? Alex
Assume p <= 2^31-1.

Note: faster routines of this type would be useful to you too...    2005-07-25, 17:45   #6
R.D. Silverman

Nov 2003

22·5·373 Posts Quote:
 Originally Posted by tom11784 do you have examples of (a,modulus) pairs you plan to run through the first program? that would help when it comes to what should be done for loop structure and whether to break into cases, etc btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps
NFS is the Number Field Sieve.

The routine should be as fast as possible AVERAGED over MANY different
a,p pairs. Modulus is <= 2^31-1 a is uniformly distributed in [1, p-1]   2005-07-26, 01:34 #7 xenon   Jul 2004 5·7 Posts Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject. If a is in [1, p-1], does that mean my code do not have to handle negative values of a? Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1? Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks. Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.   2005-07-26, 11:20 #8 tom11784   Aug 2003 Upstate NY, USA 2×163 Posts I hadn't thought about that, but had noticed with the same loop unroll you can get rid of the need for the "parity" variable by adding a conditional return in the middle. In a bit I'm going to scribe this code as written above to my laptop to run some time trials and comapre against what I wrote from 2 years ago.   2005-07-26, 11:45   #9
R.D. Silverman

Nov 2003

22×5×373 Posts Quote:
 Originally Posted by xenon Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject. If a is in [1, p-1], does that mean my code do not have to handle negative values of a? Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1? Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks. Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.
Hi,

you may assume that a is positive.
The repeated subtraction is MUCH faster than division.

(1) Branch misprediction does not happen often with these short jumps
(2) There is a high probability that the subtractions will occur. See
Knuth. The partial quotient is 1 something like 40% of the time, and
less than 5 about 80% of the time; division takes 37 clocks.

The level to which I take the subtractions has been carefully studied
and benchmarked. The last subtraction in the code gives a very marginal
speed improvement.

Unrolling might help.

Strangely, A. Lenstra has a similar routine, but his uses a variant of
Lehmer's binary method. The time for his and mine are almost exactly the
same. I am considering some kind of hybrid/combination of the two
methods; a binary approach that first computes p mod a, (to take
advantage of the time when a is somewhat small relative to p), then
combines binary shifts with partial subtractions. The binary method has
a faster iteration time, but does more iterations.

I strongly prefer C. I too could do it in assembler, but prefer to
use assembler only for "small" leaf routines. Of course, if someone can
double the speed (unlikely) by using assembler, I would be quite
grateful.   2005-07-26, 12:07 #10 xenon   Jul 2004 5×7 Posts Algorithm wise, I can't help much. This is the only algorithm I know. I can do it in assembler, that's the only thing I can help now. I did my first attempt, using Pentium III, timing using RDTSC. The timing below shows how many CPU cycles taken to do one inverse, on average. p=65521, averaging a from [1,65520]. Result = 405 clocks/inverse p=2^31-1, averaging a from [1,1 million]. Result = 510 clocks/inverse This is without repeated subtraction. I try with repeated subtraction now.   2005-07-26, 13:14 #11 tom11784   Aug 2003 Upstate NY, USA 14616 Posts Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^31-1 and a=1 to 1M writing just the inverses to a file. It says it can do a=1 to 10M in less than 12 seconds on my 1.8GHz machine if I don't do any file or screen output, but I can't believe that (2.5 cycles per?) Here is the following list of times for p=2^31-1, with your main loop unrolled which costs a comparison, but allows for a nicer ps1,ps2 relation as a VS.net console project: No sort of confirmation - a=1 to 10M - 12 seconds Inverses only to a file - a=1 to 1M - 36 seconds "The inverse of a modulus modulus is inverse" to file - a=1 to 500K - 53 seconds Inverses only to the screen - a=1 to 100K - 27 seconds "The inverse of a modulus modulus is inverse" to screen - a=1 to 50K - 68 seconds All changes I made to remove the subtraction, etc. yielded a less than 1 second difference. Sorry I couldn't help more.   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post Prime95 GPU Computing 16 2015-09-29 18:06 ET_ GPU Computing 5 2014-01-26 13:58 frmky NFS@Home 25 2013-10-16 15:58 xenon Programming 6 2005-06-02 13:26 synergy Miscellaneous Math 17 2004-10-26 15:26

All times are UTC. The time now is 18:38.

Sun Nov 28 18:38:55 UTC 2021 up 128 days, 13:07, 0 users, load averages: 0.89, 0.94, 1.02