20050725, 13:39  #1 
Nov 2003
7460_{10} Posts 
Calling hotshot coders: NFS code challenge
George Woltman has made some nice improvements to my lattice siever.
I have made some as well. To speed one part of the code, it is essential that we speed two routines in particular. The first is a single precision modular inverse that calculates 1/a mod p, for p prime. I include my best effort below. The second is a routine that does lattice basis reduction. I include a routine below (courtesy of Arjen Lenstra) Would anyone like to take a wack at speeding these up???? It would be very useful. Think of it as a challenge. /************************************************************************/ /* */ /* returns Multiplicative inverse of a mod modulus */ /* */ /************************************************************************/ single_modinv(a,modulus) int a,modulus; { /* start of single_modinv */ register unsigned int ps,ps1,ps2=1,parity,dividend,divisor,rem,q; double t1; q = modulus/a; rem = modulus  a*q; dividend = a; divisor = rem; ps1 = q; parity = 0; while (rem != 0) { q = 1; rem = dividend  divisor; if (rem >= divisor) { q = 2; rem = divisor; if (rem >= divisor) { q = 3; rem = divisor; if (rem >= divisor) { q = 4; rem = divisor; if (rem >= divisor) { q = 5; rem = divisor; } if (rem >= divisor) { q = 6; rem = divisor; } if (rem >= divisor) { q = dividend/divisor; rem = dividend  q*divisor; } } } } ps = q*ps1 + ps2; ps2 = ps1; ps1 = ps; dividend = divisor; divisor = rem; parity = 1  parity; } if (parity == 0) return(ps2); else return(modulusps2); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* borrowed from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; #if (FEWERDIVLATRED  NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 = sx2; y3 = (y2 << 1); } if (x3 > x2 ) { x3 = x2; y3 = y2; } if (x3 > (x2 >> 1)) { x3 = x2; y3 = y2; } if (x3 < 0) { x3 = x3; y3 = y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED  NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 = s2_16; q += 16; } if (s3 > s2_8) { s3 = s2_8; q += 8; } if (s3 > s2_4) { s3 = s2_4; q += 4; } if (s3 > s2_2) { s3 = s2_2; q += 2; } if (s3 > s2) { s3 = s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3  x2 * q; v1[1] = y3  y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = v2[0]; v2[1] = v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = v1[0]; v1[1] = v1[1]; } latred_time += (get_time()  stime); return(parity); } /* end of reduce_lattice */ 
20050725, 15:21  #2  
Nov 2003
2^{2}×5×373 Posts 
Quote:


20050725, 17:05  #3 
Aug 2003
Upstate NY, USA
101000110_{2} Posts 
do you have examples of (a,modulus) pairs you plan to run through the first program?
that would help when it comes to what should be done for loop structure and whether to break into cases, etc btw  is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps Last fiddled with by tom11784 on 20050725 at 17:12 
20050725, 17:20  #4 
"Nancy"
Aug 2002
Alexandria
2,467 Posts 
Hi Bob,
how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^321 ? Alex 
20050725, 17:42  #5  
Nov 2003
2^{2}·5·373 Posts 
Quote:
Note: faster routines of this type would be useful to you too... 

20050725, 17:45  #6  
Nov 2003
2^{2}×5×373 Posts 
Quote:
The routine should be as fast as possible AVERAGED over MANY different a,p pairs. Modulus is <= 2^311 a is uniformly distributed in [1, p1] 

20050726, 01:34  #7 
Jul 2004
43_{8} Posts 
Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.
If a is in [1, p1], does that mean my code do not have to handle negative values of a? Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1? Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks. Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet. 
20050726, 11:20  #8 
Aug 2003
Upstate NY, USA
2×163 Posts 
I hadn't thought about that, but had noticed with the same loop unroll you can get rid of the need for the "parity" variable by adding a conditional return in the middle.
In a bit I'm going to scribe this code as written above to my laptop to run some time trials and comapre against what I wrote from 2 years ago. 
20050726, 11:45  #9  
Nov 2003
1D24_{16} Posts 
Quote:
you may assume that a is positive. The repeated subtraction is MUCH faster than division. (1) Branch misprediction does not happen often with these short jumps (2) There is a high probability that the subtractions will occur. See Knuth. The partial quotient is 1 something like 40% of the time, and less than 5 about 80% of the time; division takes 37 clocks. The level to which I take the subtractions has been carefully studied and benchmarked. The last subtraction in the code gives a very marginal speed improvement. Unrolling might help. Strangely, A. Lenstra has a similar routine, but his uses a variant of Lehmer's binary method. The time for his and mine are almost exactly the same. I am considering some kind of hybrid/combination of the two methods; a binary approach that first computes p mod a, (to take advantage of the time when a is somewhat small relative to p), then combines binary shifts with partial subtractions. The binary method has a faster iteration time, but does more iterations. I strongly prefer C. I too could do it in assembler, but prefer to use assembler only for "small" leaf routines. Of course, if someone can double the speed (unlikely) by using assembler, I would be quite grateful. 

20050726, 12:07  #10 
Jul 2004
5×7 Posts 
Algorithm wise, I can't help much. This is the only algorithm I know.
I can do it in assembler, that's the only thing I can help now. I did my first attempt, using Pentium III, timing using RDTSC. The timing below shows how many CPU cycles taken to do one inverse, on average. p=65521, averaging a from [1,65520]. Result = 405 clocks/inverse p=2^311, averaging a from [1,1 million]. Result = 510 clocks/inverse This is without repeated subtraction. I try with repeated subtraction now. 
20050726, 13:14  #11 
Aug 2003
Upstate NY, USA
2·163 Posts 
Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^311 and a=1 to 1M writing just the inverses to a file.
It says it can do a=1 to 10M in less than 12 seconds on my 1.8GHz machine if I don't do any file or screen output, but I can't believe that (2.5 cycles per?) Here is the following list of times for p=2^311, with your main loop unrolled which costs a comparison, but allows for a nicer ps1,ps2 relation as a VS.net console project: No sort of confirmation  a=1 to 10M  12 seconds Inverses only to a file  a=1 to 1M  36 seconds "The inverse of a modulus modulus is inverse" to file  a=1 to 500K  53 seconds Inverses only to the screen  a=1 to 100K  27 seconds "The inverse of a modulus modulus is inverse" to screen  a=1 to 50K  68 seconds All changes I made to remove the subtraction, etc. yielded a less than 1 second difference. Sorry I couldn't help more. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Calling airsquirrels  Prime95  GPU Computing  16  20150929 18:06 
Help from coders  ET_  GPU Computing  5  20140126 13:58 
Calling all 64bit Linux sievers!  frmky  NFS@Home  25  20131016 15:58 
IA32 Assembly Coders, anyone?  xenon  Programming  6  20050602 13:26 
Bob, I'm calling you out!  synergy  Miscellaneous Math  17  20041026 15:26 