mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2005-07-25, 13:39   #1
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Question Calling hotshot coders: NFS code challenge

George Woltman has made some nice improvements to my lattice siever.
I have made some as well.

To speed one part of the code, it is essential that we speed two routines
in particular. The first is a single precision modular inverse that calculates
1/a mod p, for p prime. I include my best effort below.

The second is a routine that does lattice basis reduction. I include a
routine below (courtesy of Arjen Lenstra)

Would anyone like to take a wack at speeding these up???? It would be
very useful. Think of it as a challenge.

/************************************************************************/
/* */
/* returns Multiplicative inverse of a mod modulus */
/* */
/************************************************************************/

single_modinv(a,modulus)
int a,modulus;

{ /* start of single_modinv */

register unsigned int ps,ps1,ps2=1,parity,dividend,divisor,rem,q;
double t1;

q = modulus/a;
rem = modulus - a*q;
dividend = a;
divisor = rem;
ps1 = q;
parity = 0;

while (rem != 0)
{
q = 1;
rem = dividend - divisor;
if (rem >= divisor)
{
q = 2;
rem -= divisor;
if (rem >= divisor)
{
q = 3;
rem -= divisor;
if (rem >= divisor)
{
q = 4;
rem -= divisor;
if (rem >= divisor)
{
q = 5;
rem -= divisor;
}
if (rem >= divisor)
{
q = 6;
rem -= divisor;
}
if (rem >= divisor)
{
q = dividend/divisor;
rem = dividend - q*divisor;
}
}
}
}
ps = q*ps1 + ps2;
ps2 = ps1;
ps1 = ps;
dividend = divisor;
divisor = rem;
parity = 1 - parity;
}

if (parity == 0) return(ps2);
else return(modulus-ps2);

} /* end of single_modinv */



/************************************************************************/
/* */
/* routine to compute two reduced lattice vectors */
/* borrowed from AKL; which explains stylistic differences in code */
/* */
/************************************************************************/

reduce_lattice(special_q, root, v1, v2)
int special_q, root;
int *v1, *v2;

{ /* start of reduce_lattice */
double stime;
long x1;
long yy1 = 0;
long x2;
long y2 = 1;
long x3;
long y3;
double s2;
double s3;
long q;
long sx2;
long parity = 0;

#if (FEWERDIVLATRED | NODIVLATRED)
double s2_2;
double s2_4;
double s2_8;
double s2_16;
#endif

stime = get_time();

x1 = special_q;
x2 = root;

s2 = x2 * (double) x2 + y2 * (double) y2;

for (;;)
{
x3 = x1;
y3 = yy1;
sx2 = (x2 << 4);
while (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 4);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 3);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 2);
}
sx2 >>= 1;
if (x3 > sx2 )
{
x3 -= sx2;
y3 -= (y2 << 1);
}
if (x3 > x2 )
{
x3 -= x2;
y3 -= y2;
}
if (x3 > (x2 >> 1))
{
x3 -= x2;
y3 -= y2;
}
if (x3 < 0)
{
x3 = -x3;
y3 = -y3;
parity ^= 1;
}

s3 = x3 * (double) x3 + y3 * (double) y3;
if (s3 >= s2) break;
s2 = s3;
x1 = x2;
yy1 = y2;
x2 = x3;
y2 = y3;
parity ^= 1;
}

#if (FEWERDIVLATRED | NODIVLATRED)
s3 = x2 * (double) x3 + y2 * (double) y3;
if (s3 < 0)
{
yy1 = 1;
s3 = -s3;
}
else
{
yy1 = 0;
}
s2_2 = s2 + s2;
s2_4 = s2_2 + s2_2;
s2_8 = s2_4 + s2_4;
s2_16 = s2_8 + s2_8;
q = 0;
#ifdef NODIVLATRED
while (s3 > s2_16)
#else
again:
if (s3 > s2_16)
#endif
{
s3 -= s2_16;
q += 16;
}
if (s3 > s2_8)
{
s3 -= s2_8;
q += 8;
}
if (s3 > s2_4)
{
s3 -= s2_4;
q += 4;
}
if (s3 > s2_2)
{
s3 -= s2_2;
q += 2;
}
if (s3 > s2)
{
s3 -= s2;
q ++;
}
#ifndef NODIVLATRED
if (s3 > s2)
{
if (s3 < (s2_16+s2_16)) goto again;
q += (long)mrint(s3/s2);
}
else
#endif
if ((s3+s3) >= s2) q ++;
if (yy1) q = -q;
#else
q = (long) mrint( (x2 * (double) x3 + y2 * (double) y3) / s2 );
#endif

v2[0] = x2;
v2[1] = y2;
v1[0] = x3 - x2 * q;
v1[1] = y3 - y2 * q;

if (v2[1] < 0) /* make 'b' positive */
{
v2[0] = -v2[0];
v2[1] = -v2[1];
}

if (v1[1] < 0) /* make 'b' positive */
{
v1[0] = -v1[0];
v1[1] = -v1[1];
}

latred_time += (get_time() - stime);

return(parity);
} /* end of reduce_lattice */
R.D. Silverman is offline   Reply With Quote
Old 2005-07-25, 15:21   #2
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

164448 Posts
Question

Quote:
Originally Posted by R.D. Silverman

<snip>

Is anyone interested in this challenge????
R.D. Silverman is offline   Reply With Quote
Old 2005-07-25, 17:05   #3
tom11784
 
tom11784's Avatar
 
Aug 2003
Upstate NY, USA

14616 Posts
Default

do you have examples of (a,modulus) pairs you plan to run through the first program?

that would help when it comes to what should be done for loop structure and whether to break into cases, etc

btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps

Last fiddled with by tom11784 on 2005-07-25 at 17:12
tom11784 is offline   Reply With Quote
Old 2005-07-25, 17:20   #4
akruppa
 
akruppa's Avatar
 
"Nancy"
Aug 2002
Alexandria

2,467 Posts
Default

Hi Bob,

how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ?

Alex
akruppa is offline   Reply With Quote
Old 2005-07-25, 17:42   #5
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

11101001001002 Posts
Thumbs up

Quote:
Originally Posted by akruppa
Hi Bob,

how large will the modulus for the modinv get? Less than 2^31 on 32 bit machines, do does it need to work for moduli between 2^31 and 2^32-1 ?

Alex
Assume p <= 2^31-1.

Note: faster routines of this type would be useful to you too...
R.D. Silverman is offline   Reply With Quote
Old 2005-07-25, 17:45   #6
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22·5·373 Posts
Default

Quote:
Originally Posted by tom11784
do you have examples of (a,modulus) pairs you plan to run through the first program?

that would help when it comes to what should be done for loop structure and whether to break into cases, etc

btw - is NFS a language? i have an old C++ version i wrote using recursion for a cs class if that helps
NFS is the Number Field Sieve.

The routine should be as fast as possible AVERAGED over MANY different
a,p pairs. Modulus is <= 2^31-1 a is uniformly distributed in [1, p-1]
R.D. Silverman is offline   Reply With Quote
Old 2005-07-26, 01:34   #7
xenon
 
Jul 2004

5·7 Posts
Default

Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.

If a is in [1, p-1], does that mean my code do not have to handle negative values of a?

Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1?

Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks.

Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.
xenon is offline   Reply With Quote
Old 2005-07-26, 11:20   #8
tom11784
 
tom11784's Avatar
 
Aug 2003
Upstate NY, USA

2×163 Posts
Default

I hadn't thought about that, but had noticed with the same loop unroll you can get rid of the need for the "parity" variable by adding a conditional return in the middle.

In a bit I'm going to scribe this code as written above to my laptop to run some time trials and comapre against what I wrote from 2 years ago.
tom11784 is offline   Reply With Quote
Old 2005-07-26, 11:45   #9
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by xenon
Yes, I am interested. Now, for this post, I talk about the multiplicative inverse only. This is just a quick post. I haven't analyze deeply into this subject.

If a is in [1, p-1], does that mean my code do not have to handle negative values of a?

Can I write in assembly? I saw a code that does multiplicative inverse in Prime95 source factor32.asm. Extended Euclidean algorithm is my choice, I cannot think of a better option. I see that from your code above, you stop when rem==0. Why not stop one iteration earlier, when rem==1?

Your code is trying to avoid division by doing repetitive subtraction. I feel that it may not help to speed up when it is coded in assembly. Branch misprediction can be very expensive, especially on Pentium 4. A single division instruction does not take too much time, 50 clocks according to Agner. Besides, it gives remainder at the same time. But a branch misprediction, I guess is more than 20 clocks.

Just a small idea that pops out, what if we unroll the loop by two, so that we don't have to exchange between ps1 and ps2? Well, I haven't try myself this idea yet.
Hi,

you may assume that a is positive.
The repeated subtraction is MUCH faster than division.

(1) Branch misprediction does not happen often with these short jumps
(2) There is a high probability that the subtractions will occur. See
Knuth. The partial quotient is 1 something like 40% of the time, and
less than 5 about 80% of the time; division takes 37 clocks.

The level to which I take the subtractions has been carefully studied
and benchmarked. The last subtraction in the code gives a very marginal
speed improvement.

Unrolling might help.

Strangely, A. Lenstra has a similar routine, but his uses a variant of
Lehmer's binary method. The time for his and mine are almost exactly the
same. I am considering some kind of hybrid/combination of the two
methods; a binary approach that first computes p mod a, (to take
advantage of the time when a is somewhat small relative to p), then
combines binary shifts with partial subtractions. The binary method has
a faster iteration time, but does more iterations.

I strongly prefer C. I too could do it in assembler, but prefer to
use assembler only for "small" leaf routines. Of course, if someone can
double the speed (unlikely) by using assembler, I would be quite
grateful.
R.D. Silverman is offline   Reply With Quote
Old 2005-07-26, 12:07   #10
xenon
 
Jul 2004

5×7 Posts
Default

Algorithm wise, I can't help much. This is the only algorithm I know.
I can do it in assembler, that's the only thing I can help now.
I did my first attempt, using Pentium III, timing using RDTSC. The timing below shows how many CPU cycles taken to do one inverse, on average.

p=65521, averaging a from [1,65520]. Result = 405 clocks/inverse
p=2^31-1, averaging a from [1,1 million]. Result = 510 clocks/inverse

This is without repeated subtraction. I try with repeated subtraction now.
xenon is offline   Reply With Quote
Old 2005-07-26, 13:14   #11
tom11784
 
tom11784's Avatar
 
Aug 2003
Upstate NY, USA

14616 Posts
Default

Does 70 cycles per exponent sound right? It seems a tad low to me, but that's what I get for an average when I run p=2^31-1 and a=1 to 1M writing just the inverses to a file.

It says it can do a=1 to 10M in less than 12 seconds on my 1.8GHz machine if I don't do any file or screen output, but I can't believe that (2.5 cycles per?)

Here is the following list of times for p=2^31-1, with your main loop unrolled which costs a comparison, but allows for a nicer ps1,ps2 relation as a VS.net console project:
No sort of confirmation - a=1 to 10M - 12 seconds
Inverses only to a file - a=1 to 1M - 36 seconds
"The inverse of a modulus modulus is inverse" to file - a=1 to 500K - 53 seconds
Inverses only to the screen - a=1 to 100K - 27 seconds
"The inverse of a modulus modulus is inverse" to screen - a=1 to 50K - 68 seconds

All changes I made to remove the subtraction, etc. yielded a less than 1 second difference.

Sorry I couldn't help more.
tom11784 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Calling airsquirrels Prime95 GPU Computing 16 2015-09-29 18:06
Help from coders ET_ GPU Computing 5 2014-01-26 13:58
Calling all 64-bit Linux sievers! frmky NFS@Home 25 2013-10-16 15:58
IA-32 Assembly Coders, anyone? xenon Programming 6 2005-06-02 13:26
Bob, I'm calling you out! synergy Miscellaneous Math 17 2004-10-26 15:26

All times are UTC. The time now is 18:38.


Sun Nov 28 18:38:55 UTC 2021 up 128 days, 13:07, 0 users, load averages: 0.89, 0.94, 1.02

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.