![]() |
Calling hotshot coders: NFS code challenge
George Woltman has made some nice improvements to my lattice siever.
I have made some as well. :bounce: To speed one part of the code, it is essential that we speed two routines in particular. The first is a single precision modular inverse that calculates 1/a mod p, for p prime. I include my best effort below. The second is a routine that does lattice basis reduction. I include a routine below (courtesy of Arjen Lenstra) Would anyone like to take a wack at speeding these up???? It would be very useful. Think of it as a challenge. :banana: /************************************************************************/ /* */ /* returns Multiplicative inverse of a mod modulus */ /* */ /************************************************************************/ single_modinv(a,modulus) int a,modulus; { /* start of single_modinv */ register unsigned int ps,ps1,ps2=1,parity,dividend,divisor,rem,q; double t1; q = modulus/a; rem = modulus - a*q; dividend = a; divisor = rem; ps1 = q; parity = 0; while (rem != 0) { q = 1; rem = dividend - divisor; if (rem >= divisor) { q = 2; rem -= divisor; if (rem >= divisor) { q = 3; rem -= divisor; if (rem >= divisor) { q = 4; rem -= divisor; if (rem >= divisor) { q = 5; rem -= divisor; } if (rem >= divisor) { q = 6; rem -= divisor; } if (rem >= divisor) { q = dividend/divisor; rem = dividend - q*divisor; } } } } ps = q*ps1 + ps2; ps2 = ps1; ps1 = ps; dividend = divisor; divisor = rem; parity = 1 - parity; } if (parity == 0) return(ps2); else return(modulus-ps2); } /* end of single_modinv */ /************************************************************************/ /* */ /* routine to compute two reduced lattice vectors */ /* borrowed from AKL; which explains stylistic differences in code */ /* */ /************************************************************************/ reduce_lattice(special_q, root, v1, v2) int special_q, root; int *v1, *v2; { /* start of reduce_lattice */ double stime; long x1; long yy1 = 0; long x2; long y2 = 1; long x3; long y3; double s2; double s3; long q; long sx2; long parity = 0; #if (FEWERDIVLATRED | NODIVLATRED) double s2_2; double s2_4; double s2_8; double s2_16; #endif stime = get_time(); x1 = special_q; x2 = root; s2 = x2 * (double) x2 + y2 * (double) y2; for (;;) { x3 = x1; y3 = yy1; sx2 = (x2 << 4); while (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 4); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 3); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 2); } sx2 >>= 1; if (x3 > sx2 ) { x3 -= sx2; y3 -= (y2 << 1); } if (x3 > x2 ) { x3 -= x2; y3 -= y2; } if (x3 > (x2 >> 1)) { x3 -= x2; y3 -= y2; } if (x3 < 0) { x3 = -x3; y3 = -y3; parity ^= 1; } s3 = x3 * (double) x3 + y3 * (double) y3; if (s3 >= s2) break; s2 = s3; x1 = x2; yy1 = y2; x2 = x3; y2 = y3; parity ^= 1; } #if (FEWERDIVLATRED | NODIVLATRED) s3 = x2 * (double) x3 + y2 * (double) y3; if (s3 < 0) { yy1 = 1; s3 = -s3; } else { yy1 = 0; } s2_2 = s2 + s2; s2_4 = s2_2 + s2_2; s2_8 = s2_4 + s2_4; s2_16 = s2_8 + s2_8; q = 0; #ifdef NODIVLATRED while (s3 > s2_16) #else again: if (s3 > s2_16) #endif { s3 -= s2_16; q += 16; } if (s3 > s2_8) { s3 -= s2_8; q += 8; } if (s3 > s2_4) { s3 -= s2_4; q += 4; } if (s3 > s2_2) { s3 -= s2_2; q += 2; } if (s3 > s2) { s3 -= s2; q ++; } #ifndef NODIVLATRED if (s3 > s2) { if (s3 < (s2_16+s2_16)) goto again; q += (long)mrint(s3/s2); } else #endif if ((s3+s3) >= s2) q ++; if (yy1) q = -q; #else q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 ); #endif v2[0] = x2; v2[1] = y2; v1[0] = x3 - x2 * q; v1[1] = y3 - y2 * q; if (v2[1] < 0) /* make 'b' positive */ { v2[0] = -v2[0]; v2[1] = -v2[1]; } if (v1[1] < 0) /* make 'b' positive */ { v1[0] = -v1[0]; v1[1] = -v1[1]; } latred_time += (get_time() - stime); return(parity); } /* end of reduce_lattice */ |
[QUOTE=R.D. Silverman]
<snip> [/QUOTE] Is anyone interested in this challenge???? |
do you have examples of (a,modulus) pairs you plan to run through the first program?
that would help when it comes to what should be done for loop structure and whether to break into cases, etc btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps |
Hi Bob,
how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ? Alex |
[QUOTE=akruppa]Hi Bob,
how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ? Alex[/QUOTE] Assume p <= 2^31-1. Note: faster routines of this type would be useful to you too... :smile: |
[QUOTE=tom11784]do you have examples of (a,modulus) pairs you plan to run through the first program?
that would help when it comes to what should be done for loop structure and whether to break into cases, etc btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps[/QUOTE] NFS is the Number Field Sieve. The routine should be as fast as possible AVERAGED over MANY different a,p pairs. Modulus is <= 2^31-1 a is uniformly distributed in [1, p-1] |
Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.
If a is in [1, p-1], does that mean my code do not have to handle negative values of a? Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1? Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks. Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet. |
I hadn't thought about that, but had noticed with the same loop unroll you can get rid of the need for the "parity" variable by adding a conditional return in the middle.
In a bit I'm going to scribe this code as written above to my laptop to run some time trials and comapre against what I wrote from 2 years ago. |
[QUOTE=xenon]Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.
If a is in [1, p-1], does that mean my code do not have to handle negative values of a? Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1? Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks. Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.[/QUOTE] Hi, you may assume that a is positive. The repeated subtraction is MUCH faster than division. (1) Branch misprediction does not happen often with these short jumps (2) There is a high probability that the subtractions will occur. See Knuth. The partial quotient is 1 something like 40% of the time, and less than 5 about 80% of the time; division takes 37 clocks. The level to which I take the subtractions has been carefully studied and benchmarked. The last subtraction in the code gives a very marginal speed improvement. Unrolling might help. Strangely, A. Lenstra has a similar routine, but his uses a variant of Lehmer's binary method. The time for his and mine are almost exactly the same. I am considering some kind of hybrid/combination of the two methods; a binary approach that first computes p mod a, (to take advantage of the time when a is somewhat small relative to p), then combines binary shifts with partial subtractions. The binary method has a faster iteration time, but does more iterations. I strongly prefer C. I too could do it in assembler, but prefer to use assembler only for "small" leaf routines. Of course, if someone can double the speed (unlikely) by using assembler, I would be quite grateful. |
Algorithm wise, I can't help much. This is the only algorithm I know.
I can do it in assembler, that's the only thing I can help now. I did my first attempt, using Pentium III, timing using RDTSC. The timing below shows how many CPU cycles taken to do one inverse, on average. p=65521, averaging a from [1,65520]. Result = 405 clocks/inverse p=2^31-1, averaging a from [1,1 million]. Result = 510 clocks/inverse This is without repeated subtraction. I try with repeated subtraction now. |
Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^31-1 and a=1 to 1M writing just the inverses to a file.
It says it can do a=1 to 10M in less than 12 seconds on my 1.8GHz machine if I don't do any file or screen output, but I can't believe that (2.5 cycles per?) Here is the following list of times for p=2^31-1, with your main loop unrolled which costs a comparison, but allows for a nicer ps1,ps2 relation as a VS.net console project: No sort of confirmation - a=1 to 10M - 12 seconds Inverses only to a file - a=1 to 1M - 36 seconds "The inverse of [I]a[/I] modulus [i]modulus[/i] is [i]inverse[/i]" to file - a=1 to 500K - 53 seconds Inverses only to the screen - a=1 to 100K - 27 seconds "The inverse of [I]a[/I] modulus [i]modulus[/i] is [i]inverse[/i]" to screen - a=1 to 50K - 68 seconds All changes I made to remove the subtraction, etc. yielded a less than 1 second difference. Sorry I couldn't help more. |
| All times are UTC. The time now is 04:11. |
Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.