mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   Coding Challenges (https://www.mersenneforum.org/showthread.php?t=4460)

R.D. Silverman 2005-08-05 16:50

Coding Challenges
 
Hello everybody,

I want to extend my thanks to everyone who participated in my little
"challenge". Some nice speedups were obtained. Kudos to those who
helped :banana: . It is unfortunate that the fastest code produced had to
come from going to assembler; this clearly makes it less portable. (but thanks
to xenon for nice work and thanks to Alex for improvements to the C code
:bow: :bow: )

[Yes, we are all aware that going completely to assembler would yield
faster code]

I am a bit surprised that noone responded to the challenge to improve the
lattice reduction code. :question: It is not by coincidence that I presented
both routines because the algorithms are * almost* identical.

I reworked Arjen's conditional shift & subtract code for the lattice reduction
into something closer to the modinverse code. I give that code here:

/************************************************************************/
/* */
/* New routine to do lattice basis reduction via Euclidean algorithm */
/* */
/************************************************************************/

new_latred(q,r, v1, v2)
int q,r, v1[], v2[];

{ /* start of new_latred */
int a11, a12, a21, a22, t, rem;
double norm_row1, tmp_norm, dot_prod;
int accum1, accum2, t1,t2;
int tmp_a11, tmp_a12;

a11 = q;
a21 = r;
a12 = 0;
a22 = 1;

//printf("| %d %d |\n", a11, a12);
//printf("| %d %d |\n\n", a21, a22);

/* we take advantage of the fact that most partial quotients are small */
/* by doing the division by successive conditional subtraction. */

norm_row1 = (double)a11 * (double)a11 + (double)a12 * (double)a12;

while(1)
{
accum1 = a21;
accum2 = a22;
rem = a11 - a21;
if (rem >= a21)
{
accum1 += a21;
accum2 += a22;
rem -= a21;
if (rem >= a21)
{
accum1 += a21;
accum2 += a22;
rem -= a21;
if (rem >= a21)
{
accum1 += a21;
accum2 += a22;
rem -= a21;
if (rem >= a21)
{
accum1 += a21;
accum2 += a22;
rem -= a21;
if (rem >= a21)
{
accum1 += a21;
accum2 += a22;
rem -= a21;
if (rem >= a21)
{
t = a11/a21;
accum1 = t * a21;
accum2 = t * a22;
}
}
}
}
}
}
tmp_a11 = a11 - accum1; // update row 1
tmp_a12 = a12 - accum2;
tmp_norm = (double)tmp_a11 * (double)tmp_a11 + (double)tmp_a12 * (double)tmp_a12;
if (tmp_norm > norm_row1) break; // we are done

t1 = tmp_a11; // swap rows
t2 = tmp_a12;
a11 = a21;
a12 = a22;
a21 = t1;
a22 = t2;
norm_row1 = tmp_norm;

// printf("| %d %d |\n", a11, a12);
// printf("| %d %d |\n\n", a21, a22);
}

/* Need to do a single Gram-Schmidt row reduction if possible */

dot_prod = (double)a11 * a21 + (double)a12 * a22;
t = round_double(dot_prod/norm_row1);

//printf("G-S t value = %d\n",t);

v1[0] = a11 - t * a21;;
v1[1] = a12 - t * a22;
v2[0] = a21;
v2[1] = a22;

//printf("final result:\n");
//printf("| %d %d |\n", v1[0], v1[1]);
//printf("| %d %d |\n\n", v2[0], v2[1]);

return(0);

} /* end of new_latred */
--------------------------------------------------------------------

I then compared the two routines for 3 different values of p as r
varied over 10000 different, consecutive values. Here are the time
RATIOS for the two routines (old routine/new routine) . I give values
for 20 different r values for each prime.

r: 2.153262
r: 2.536551
r: 2.333333
r: 2.467522
r: 2.265208
r: 2.054390
r: 2.488469
r: 2.240045
r: 2.071930
r: 2.285150
r: 2.020040
r: 2.409091
r: 2.093275
r: 2.271066
r: 2.068036
r: 2.264728
r: 2.444172
r: 2.147155
r: 2.220281
-----------------
r: 1.811325
r: 2.209565
r: 1.966776
r: 2.392782
r: 1.994787
r: 1.534354
r: 2.500306
r: 2.003802
r: 2.159395
r: 1.934477
r: 2.259038
r: 1.589700
r: 2.106504
r: 1.920269
r: 1.893444
r: 2.049665
r: 1.839205
r: 1.827455
r: 1.873463
-----------------
r: 2.010581
r: 2.088218
r: 2.262241
r: 2.306903
r: 2.110598
r: 1.971844
r: 2.464969
r: 1.773949
r: 2.348313
r: 2.564947
r: 2.322762
r: 2.154452
r: 2.184591
r: 2.193857
r: 2.186851
r: 1.842859
r: 2.255534
r: 2.199179
r: 3.877670


===============================================

The new routine is more than twice as fast as the old one!!!!! :banana: :banana:

I see some additional improvements that might be made.
Rather than just proceeding with the final Gram-Schmidt reduction,
we can (perhaps?) avoid the final floating point division by checking
that t > 1/2 before the division simply by comparing the two norms.
i.e. 2*dot_prod > norm_row1. Whether this check gains anything
depends on how often t is non-zero.

We can also do some more subtracts in the inner loop before succumbing
to doing the division etc.

Would anyone like to take a crack at the improved version? :question: :question:

dsouza123 2005-08-05 18:42

For the lattice basis reduction:
What values of p and range of r ?
How can the results be verified ?

My reasoning behind modinv versus latred:
The mathematics behind modinv routine
was very understandable, familar, easily verifiable.
The code was understandable and short.

The code for lattice basis reduction was harder to follow,
and contained conditional compiler code
and was twice as long as the modinv.

You provided very helpful, detailed information
on the range of the parameters for modinv.

(I did redo the modiv routine in Delphi and MASM32,
but your using C/C++ and xenon handled the x86 Assembly.)

Is there any unoptimized code to look at ?

The 2X+ speed increase is already quite good.

Code in x86 assembly or C or both ?

R.D. Silverman 2005-08-05 19:14

[QUOTE=dsouza123]For the lattice basis reduction:
What values of p and range of r ?

<snip>


[/QUOTE]


For the lattice basis reduction:
What values of p and range of r ?


Single precision. Same range of p and r as in modinv.

How can the results be verified ?

The matrix determinant is p. It remains invariant.
Each reduction step is equivalent to multiplying by a
uni-modular marix, and hence the determinant does not change.
The coefficients in the matrix get smaller, until the min is reached,
then further "reductions" result in one row becoming very small
while the other grows big again. The condition on the L2 norm
in the code says when to stop.


My reasoning behind modinv versus latred:
The mathematics behind modinv routine
was very understandable, familar, easily verifiable.


But the math for the lattice reduction is the same: Euclidean Algorithm
[reduction follws with a Gram-Schmidt orthogonalization at the end
if it is useful]

The code was understandable and short.

Thank you for the compliment.

Code in x86 assembly or C or both ?

C is strongly preferred, but of course both are acceptable.

Note: I already tried the test I suggested that avoids the final
floating div at the expense of an extra branch. It does give a slight
speedup.

dsouza123 2005-08-05 20:43

The newer code is easier to follow.
I'll try in assembly,
have some implementation ideas but no algorithmic ones.

modinv used a, modulus with modulus the prime
new_latred has q , r
so is r for a and q for modulus ?

R.D. Silverman 2005-08-05 20:52

[QUOTE=dsouza123]The newer code is easier to follow.
I'll try in assembly,
have some implementation ideas but no algorithmic ones.

modinv used a, modulus with modulus the prime
new_latred has q , r
so is r for a and q for modulus ?[/QUOTE]

Sort of. p is a prime in both cases. a and r play similar roles
algorithmically, but their meaning is different. r stands for 'root'

It is the root of a polynomial (which itself is irrelevant to the
algorithm) taken mod p.

Arjen's old code basically tried to replace the division operation
that computes t at each iteration with a series of trial shifts and
subtracts... Better variable names would have aided the clarity......

xenon 2005-08-06 05:04

[QUOTE=R.D. Silverman]Sort of. p is a prime in both cases. a and r play similar roles algorithmically, but their meaning is different. r stands for 'root'
[/QUOTE]

Please give a few p values for benchmarking. Can you confirm that r is guaranteed to be in the interval [1, p-1]? Well, I see that your code uses "q" and we talk much of "p", I assume we mean the same thing.

R.D. Silverman 2005-08-06 12:52

[QUOTE=xenon]Please give a few p values for benchmarking. Can you confirm that r is guaranteed to be in the interval [1, p-1]? Well, I see that your code uses "q" and we talk much of "p", I assume we mean the same thing.[/QUOTE]

'q' arises from its standard usage in NFS literature as 'special-q' prime.
r is indeed in [1,p-1]. q is typically 15 to 27 bits; choose some at random.

BTW, I have already improved the code with:

dot_prod = (double)a11 * (double)a21 + (double)a12 * (double)a22;

if ((dot_prod + dot_prod) < norm_row1)
{
v1[0] = a11;
v1[1] = a12;
v2[0] = a21;
v2[1] = a22;
return(0);
}

t = round_double(dot_prod/norm_row1); etc.

The if test avoids quite a few divisions at low cost.

I tried adding a couple of more conditional subtracts in the main loop,
but saw little difference.

Thanks again for the help!!!!

xenon 2005-08-06 13:34

The code doesn't work with r=1. Perhaps you mean r in [2, p-1].

Before looking at timings, I want to get the result correct. While modular inverse has one correct answer, lattice reduction can have more than one. You didn't specify exactly what you want. I assume shorter vectors are better. But I found that your code doesn't produce the shortest vectors when r is small. I get shorter vector by computing t = nearest_int(dot_product/norm) in every iteration (this is slow).

About verifying the results, any other method besides computing determinant? Computing determinant isn't "strict" enough to catch all errors. i.e. if r changes, the determinant doesn't change.

dsouza123 2005-08-06 18:05

[QUOTE=R.D. Silverman]
I then compared the two routines for 3 different values of p as r
varied over 10000 different, consecutive values. Here are the time
RATIOS for the two routines (old routine/new routine) . I give values
for 20 different r values for each prime.
[/QUOTE]

What were the 3 different values of p ( aka q ) along with its starting r ?

If the same input for testing is used by everyone it wil be easier
to verify accuracy and compare the efficiency of improvements.

========================================================
A few results with somewhat random q and r ( each q is prime and r is not).
Are these results from Delphi conversion correct ? (a11 == q a21 == r)

| 134217689 0| a11 a12
| 92342327 1| a21 a22
No need for G-S t
| 92342327 1| v1[0] v1[1]
| 41875362 -92342328| v2[0] v2[1]

| 524309 0| a11 a12
| 10003 1| a21 a22
G-S t value = 2
| 1697 105| v1[0] v1[1]
| 4153 -52| v2[0] v2[1]

The MASM32 version gives completely different answers.

| 524309 0| a11 a12
| 10003 1| a21 a22
G-S t value = 0
| 759 -262| v1[0] v1[1]
| 179 629| v2[0] v2[1]

haven't added the if ((dot_prod + dot_prod) < norm_row1) code
to the MASM32 version yet.

R.D. Silverman 2005-08-06 20:38

[QUOTE=xenon]The code doesn't work with r=1. Perhaps you mean r in [2, p-1].

Before looking at timings, I want to get the result correct. While modular inverse has one correct answer, lattice reduction can have more than one. You didn't specify exactly what you want. I assume shorter vectors are better. But I found that your code doesn't produce the shortest vectors when r is small. I get shorter vector by computing t = nearest_int(dot_product/norm) in every iteration (this is slow).

About verifying the results, any other method besides computing determinant? Computing determinant isn't "strict" enough to catch all errors. i.e. if r changes, the determinant doesn't change.[/QUOTE]

If G-S is carried forward completely, it will always produce a basis
with the shortest possible vector. But this comes at the expense of
the other vector being quite large.

To determine if you have a correct final answer consider the original
and final row vectors (say) V1, V2 and V1F , V2F

if e * V1 + f * V2 = e1 * V1F + f1 * V2F has a solution (e1, f1) for
all (e,f), then the final basis is valid.

xenon 2005-08-07 00:38

[QUOTE=R.D. Silverman]But this comes at the expense of the other vector being quite large.[/QUOTE]
[CODE]
Example: q=1299721, r=30

| 30 1 | |1441 -43276 |
| 1 -43324 | | 30 1 |

option 1 option 2
[/CODE]
The second one is shorter. But your code produces the first one. Am I correct?


All times are UTC. The time now is 13:11.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.