mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mac.. (https://www.mersenneforum.org/showthread.php?t=4962)

Greenbank 2005-11-09 16:39

fmadd certainly exists for the G4.

Google for "MPCFPE32B". The 3.5MB PDF that's first in the list is the Programming Environments Manual for the G4.

[EDIT] I've also spotted jasong's msieve comments and fast mulmod for < 2^50 bit numbers. It's faster than my mulmod code so it'll fit in nicely (if it all works properly) for sieving p < 2^50.

fatphil 2005-11-10 10:03

[QUOTE=Greenbank]fmadd certainly exists for the G4.

Google for "MPCFPE32B". The 3.5MB PDF that's first in the list is the Programming Environments Manual for the G4.
[/QUOTE]

Thanks, I'll grab that.

I'll also ask around here, and see if there are any 'internal' documents that I can get my hands on too. (G4 only, alas)

ET_ 2005-11-10 11:34

[QUOTE=Greenbank]
I've also spotted jasong's msieve comments and fast mulmod for < 2^50 bit numbers. It's faster than my mulmod code so it'll fit in nicely (if it all works properly) for sieving p < 2^50.[/QUOTE]

Wasn't he jasonp? :rolleyes:

Luigi

Greenbank 2005-11-10 12:27

Indeed it was jasonp, my attribution was from memory.

On a related note, here's why G5's rock and G4's lead to madness (or at least hideous amounts of complication).

Looking at my 32-bit code for the core of the montgomery multiplication.

For a good reference see Handbook of Applied Cryptography Chapter 14.
Google for "HAC" and it is the second link. PDF's can be downloaded free for personal use (see copyright notice).

For calculating a*b (mod m) :-

It assumes:

a and b are numbers < 2^64 (i.e they fit in 2 32-bit words)
B = 2^32
n = 2 (since a and b have 2 words)
R= B^n = (2^32)^2 = 2^64
m' = (-m)^-1 (mod B) has been calculated
p = m (I tend to use these terms interchangably!)
x = a
y = bR (mod m) has been calculated.

The code is inside the (method == 1 ) code block of:

[url]http://www.greenbank.org/sob/macasm/montmul.c[/url]

20 muls, loads of adds to handle numerous carries. A nightmare but still quite quick.

Most importantly, no division!

Now, to do this on a G5 you need this:-

Since the G5 is 64-bit we can use B=2^64
Since both values (a and b) are 64-bit they only fit in one double-word so n=1.
R = B^n = B^1 = B = 2^64.
This also means we don't need the loop inside the algorithm.
Since we only need one iteration of the loop we can assume that A is 0 all the way until we set it with the result at the end simplifying it more.

Now we just need to do (untested):-

mulld r2,x,y // r2 = low_word(x*y)
mulld r3,r2,m' // r3 = x * y * m' (mod B) : mod implicit as we only use low double-word of result
mulhdu r4,x,y // r4 = high_word(x*y)
mulld r5,r3,p // r5 = low_word(u*m)
mulhdu r6,r3,p // r6 = high_word(u*m)
add r7,r4,r6 // Worry about carry? Not if p < 2^63. What if any of a,b,p are 64-bit?
addco r2,r2,r5 // r2 now useless, we just want the carry bit
addze r7,r7 // Add carry from low_word addition on to r7

So for 64 bit processors it comes down to 5 multiplies and 3 adds. The only other bit to do is the final subtraction if r7 > p :-

cmpld r7,p // (Logical/unsigned) Compare result to p
blt done // If it is less than the skip the subtract
subf r7,r7,p // r7 = r7 - p
done: // the end

64-bit * 64-bit (mod 64-bit) with 5 muls, 3 adds, 1 comparison and a conditional subtract. No divides.

Nice.

ewmayer 2005-11-10 19:00

[QUOTE=Greenbank]Now we just need to do (untested):-

mulld r2,x,y // r2 = low_word(x*y)
mulld r3,r2,m' // r3 = x * y * m' (mod B) : mod implicit as we only use low double-word of result
mulhdu r4,x,y // r4 = high_word(x*y)
mulld r5,r3,p // r5 = low_word(u*m)
mulhdu r6,r3,p // r6 = high_word(u*m)
add r7,r4,r6 // Worry about carry? Not if p < 2^63. What if any of a,b,p are 64-bit?
addco r2,r2,r5 // r2 now useless, we just want the carry bit
addze r7,r7 // Add carry from low_word addition on to r7

So for 64 bit processors it comes down to 5 multiplies and 3 adds.[/QUOTE]You should only need 4 MULs - I don't understand why you need both the low and high half of u*m. Here is a simple version of the code I use for 64-bit Montgomery modmul - q is the modulus, qinv*q == 1 mod 2^64, and x stores the datum one wants. This is taken from the guts of a left-to-right binary exponentiation loop, that's why the initial double-wide product is a squaring in my case. The (x > hi)? normalization is followed by a mod-q conditional doubling, depending on the current bit of the exponent of the modular powering - I've precomputed qhalf = q/2 for that:

[code]
SQR_LOHI(x,lo,hi); /* mulld, mulhdu */
lo *= qinv; /* mulld */
MUL_HIGH(q,lo,lo); /* mulhdu */
x = hi - lo;

if(x > hi)
{
x += q; /* had a borrow */
}

if({current_bit_set})
{
if(x > qhalf) /* Combines overflow-on-add and need-to-subtract-q-from-sum checks */
{
x = x + x;
x -= q;
}
else
x = x + x;
}
[/code]

Orgasmic Troll 2005-11-10 19:11

hey phil, I've got a G5 iMac at home, can you put it to work?

(I'm also working on a Dual 2.7 PowerMac with 4GB of RAM here at work, but I don't think I can appropriate cycles. The best part is the 30" cinema display and the 23" off to the side.. fricken sweeeet :grin: )

Greenbank 2005-11-10 19:51

[QUOTE=ewmayer]You should only need 4 MULs - I don't understand why you need both the low and high half of u*m.[/QUOTE]

OK, if a and b fit in one double-word then:-

m' = (-m)^-1 mod 2^64

u = x*y*m' mod 2^64

u = (x * y mod 2^64) * m' (mod 2^64)

So for calculating u you can get away with ignoring the high double-word.

However:-

A = (x*y + u*m ) / 2^64

does not imply you can do:-

high(x*y) + high(u*m)

You need the low half of u*m because if low(u*m)+low(x*y) > 2^64 then your answer will be off by one.

Let's see if can concoct an example with m=13. b=16 (2^4).

m' = (-13)^-1 (mod 16)
m' = 3^-1 (mod 16)

3*11 = 33 = 1 (mod 16)

m' = 11

Now pick a trivial a and b.

x=a=1
b=1

y = bR (mod m) = 1*16 (mod 13) = 3

u = x*y*m' mod 16

u = 1*3*11 mod 16 = 33 mod 16 = 1

A = (x*y + u*m ) / 16

A = (1*3 + 1*13) / 16

Without looking at the low word of (u*m) you're doing:-

A = (1*3)/16 + (1*13)/16 = 0 + 0 = 0

The correct way does:-

A = (1*3 + 1*13) / 16 = (3 + 13) / 16 = 16/16 = 1

Tomorrow I'll find an example for 50-bit values of a,b and m if you like.

ewmayer 2005-11-10 20:21

My code is adapted from some sieving code I got several years ago from Peter Montgomery himself: here is the relevant core operation sequence from his code comments (his only allows operands < 2^63, so is a few ALU ops cheaper):
[code]
t0 = UMULH(power2, power2);
if (t0 >= q) t0 -= q;
power2 = UMULH(q, MULQ(qinv64, MULQ(power2, power2))) - t0;
power2 = (q >> (bitnow & power2)) - (power2 >> bitnow);
[/code]
Again, just 4 MULs - 2 lower-half and 2 upper-half. My and Peter's codes have been extensively self-tested as far as the core modmul routines go, so I think we'd know if carries were getting dropped with O(1) probability.

xilman 2005-11-10 21:02

[QUOTE=ewmayer] so I think we'd know if carries were getting dropped with O(1) probability.[/QUOTE]Better than o(1) probability ...


Paul

Greenbank 2005-11-11 08:58

Heh, I'm willing to be proven wrong (and learn at the same time). Plus it means I could get rid of a MUL.

I'll find a 50-bit example and then we can trace through the code...

Greenbank 2005-11-11 11:10

Hmm, try as I might I can't find a situation where:

mullw(x*y)+mullw(u*m) < 2^64.

for m=882861709928411 and random a,b values. My test program has been running for 25 minutes and it checks 100,000 random values of a and b a second.

Which leads me to believe that it is assumed that the carry will always be present. However I can find a case for m=13 (will post that in a sec).

Here's an example where it is present:-

R=2^64
a=72185797819121
b=151612213027919
m=882861709928411
mprime=12466608506227884973

(a*b)%m
344317125598003

y=(b*2^64)%m
105419857829375

u=(a*y*mprime)%B
7194243690555063843

A=(a*y + u*m) / 2^64
= 6351529896101669193770756995022848 / 2^64 = 344317125598003

A_other = (x*y)/2^64 + (u*m)/2^64
= 412528981 + 344316713069021 = 344317125598002 (out by one)


All times are UTC. The time now is 22:51.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.