mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   Calling hotshot coders: NFS code challenge (https://www.mersenneforum.org/showthread.php?t=4403)

R.D. Silverman 2005-07-25 13:39

Calling hotshot coders: NFS code challenge
 
George Woltman has made some nice improvements to my lattice siever.
I have made some as well. :bounce:

To speed one part of the code, it is essential that we speed two routines
in particular. The first is a single precision modular inverse that calculates
1/a mod p, for p prime. I include my best effort below.

The second is a routine that does lattice basis reduction. I include a
routine below (courtesy of Arjen Lenstra)

Would anyone like to take a wack at speeding these up???? It would be
very useful. Think of it as a challenge. :banana:

/************************************************************************/
/* */
/* returns Multiplicative inverse of a mod modulus */
/* */
/************************************************************************/

single_modinv(a,modulus)
int a,modulus;

{ /* start of single_modinv */

register unsigned int ps,ps1,ps2=1,parity,dividend,divisor,rem,q;
double t1;

q = modulus/a;
rem = modulus - a*q;
dividend = a;
divisor = rem;
ps1 = q;
parity = 0;

while (rem != 0)
{
q = 1;
rem = dividend - divisor;
if (rem >= divisor)
{
q = 2;
rem -= divisor;
if (rem >= divisor)
{
q = 3;
rem -= divisor;
if (rem >= divisor)
{
q = 4;
rem -= divisor;
if (rem >= divisor)
{
q = 5;
rem -= divisor;
}
if (rem >= divisor)
{
q = 6;
rem -= divisor;
}
if (rem >= divisor)
{
q = dividend/divisor;
rem = dividend - q*divisor;
}
}
}
}
ps = q*ps1 + ps2;
ps2 = ps1;
ps1 = ps;
dividend = divisor;
divisor = rem;
parity = 1 - parity;
}

if (parity == 0) return(ps2);
else return(modulus-ps2);

} /* end of single_modinv */



/************************************************************************/
/* */
/* routine to compute two reduced lattice vectors */
/* borrowed from AKL; which explains stylistic differences in code */
/* */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{ /* start of reduce_lattice */
double stime;
long x1;
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3;
double s2;
double s3;
long q;
long sx2;
long parity = 0;

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
{
x3 = x1;
y3 = yy1;
sx2 = (x2 << 4);
while (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 4);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 3);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 2);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 1);
}
if (x3 > x2 )
{
x3 -= x2;
y3 -= y2;
}
if (x3 > (x2 >> 1))
{
x3 -= x2;
y3 -= y2;
}
if (x3 < 0)
{
x3 = -x3;
y3 = -y3;
parity ^= 1;
}

s3 = x3 * (double) x3 + y3 * (double) y3;
if (s3 >= s2) break;
s2 = s3;
x1 = x2;
yy1 = y2;
x2 = x3;
y2 = y3;
parity ^= 1;
}

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
{
yy1 = 1;
s3 = -s3;
}
else
{
yy1 = 0;
}
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
{
s3 -= s2_16;
q += 16;
}
if (s3 > s2_8)
{
s3 -= s2_8;
q += 8;
}
if (s3 > s2_4)
{
s3 -= s2_4;
q += 4;
}
if (s3 > s2_2)
{
s3 -= s2_2;
q += 2;
}
if (s3 > s2)
{
s3 -= s2;
q ++;
}
#ifndef NODIVLATRED
if (s3 > s2)
{
if (s3 < (s2_16+s2_16)) goto again;
q += (long)mrint(s3/s2);
}
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0) /* make 'b' positive */
{
v2[0] = -v2[0];
v2[1] = -v2[1];
}

if (v1[1] < 0) /* make 'b' positive */
{
v1[0] = -v1[0];
v1[1] = -v1[1];
}

latred_time += (get_time() - stime);

return(parity);
} /* end of reduce_lattice */

R.D. Silverman 2005-07-25 15:21

[QUOTE=R.D. Silverman]

<snip>


[/QUOTE]

Is anyone interested in this challenge????

tom11784 2005-07-25 17:05

do you have examples of (a,modulus) pairs you plan to run through the first program?

that would help when it comes to what should be done for loop structure and whether to break into cases, etc

btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps

akruppa 2005-07-25 17:20

Hi Bob,

how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ?

Alex

R.D. Silverman 2005-07-25 17:42

[QUOTE=akruppa]Hi Bob,

how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ?

Alex[/QUOTE]

Assume p <= 2^31-1.

Note: faster routines of this type would be useful to you too... :smile:

R.D. Silverman 2005-07-25 17:45

[QUOTE=tom11784]do you have examples of (a,modulus) pairs you plan to run through the first program?

that would help when it comes to what should be done for loop structure and whether to break into cases, etc

btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps[/QUOTE]

NFS is the Number Field Sieve.

The routine should be as fast as possible AVERAGED over MANY different
a,p pairs. Modulus is <= 2^31-1 a is uniformly distributed in [1, p-1]

xenon 2005-07-26 01:34

Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.

If a is in [1, p-1], does that mean my code do not have to handle negative values of a?

Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1?

Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks.

Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.

tom11784 2005-07-26 11:20

I hadn't thought about that, but had noticed with the same loop unroll you can get rid of the need for the "parity" variable by adding a conditional return in the middle.

In a bit I'm going to scribe this code as written above to my laptop to run some time trials and comapre against what I wrote from 2 years ago.

R.D. Silverman 2005-07-26 11:45

[QUOTE=xenon]Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.

If a is in [1, p-1], does that mean my code do not have to handle negative values of a?

Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1?

Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks.

Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.[/QUOTE]

Hi,

you may assume that a is positive.
The repeated subtraction is MUCH faster than division.

(1) Branch misprediction does not happen often with these short jumps
(2) There is a high probability that the subtractions will occur. See
Knuth. The partial quotient is 1 something like 40% of the time, and
less than 5 about 80% of the time; division takes 37 clocks.

The level to which I take the subtractions has been carefully studied
and benchmarked. The last subtraction in the code gives a very marginal
speed improvement.

Unrolling might help.

Strangely, A. Lenstra has a similar routine, but his uses a variant of
Lehmer's binary method. The time for his and mine are almost exactly the
same. I am considering some kind of hybrid/combination of the two
methods; a binary approach that first computes p mod a, (to take
advantage of the time when a is somewhat small relative to p), then
combines binary shifts with partial subtractions. The binary method has
a faster iteration time, but does more iterations.

I strongly prefer C. I too could do it in assembler, but prefer to
use assembler only for "small" leaf routines. Of course, if someone can
double the speed (unlikely) by using assembler, I would be quite
grateful.

xenon 2005-07-26 12:07

Algorithm wise, I can't help much. This is the only algorithm I know.
I can do it in assembler, that's the only thing I can help now.
I did my first attempt, using Pentium III, timing using RDTSC. The timing below shows how many CPU cycles taken to do one inverse, on average.

p=65521, averaging a from [1,65520]. Result = 405 clocks/inverse
p=2^31-1, averaging a from [1,1 million]. Result = 510 clocks/inverse

This is without repeated subtraction. I try with repeated subtraction now.

tom11784 2005-07-26 13:14

Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^31-1 and a=1 to 1M writing just the inverses to a file.

It says it can do a=1 to 10M in less than 12 seconds on my 1.8GHz machine if I don't do any file or screen output, but I can't believe that (2.5 cycles per?)

Here is the following list of times for p=2^31-1, with your main loop unrolled which costs a comparison, but allows for a nicer ps1,ps2 relation as a VS.net console project:
No sort of confirmation - a=1 to 10M - 12 seconds
Inverses only to a file - a=1 to 1M - 36 seconds
"The inverse of [I]a[/I] modulus [i]modulus[/i] is [i]inverse[/i]" to file - a=1 to 500K - 53 seconds
Inverses only to the screen - a=1 to 100K - 27 seconds
"The inverse of [I]a[/I] modulus [i]modulus[/i] is [i]inverse[/i]" to screen - a=1 to 50K - 68 seconds

All changes I made to remove the subtraction, etc. yielded a less than 1 second difference.

Sorry I couldn't help more.


All times are UTC. The time now is 04:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.