mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Software (https://www.mersenneforum.org/forumdisplay.php?f=10)
-   -   mac.. (https://www.mersenneforum.org/showthread.php?t=4962)

kuli 2005-11-09 00:27

mac..
 
Is there any software that allows me to search using mac os/x? or do I have to go advanced with installing some cryptic linux stuff, and run the program from there? In short: what is the easiest way for me to use my G4 powerbook to search for the mersenne primes? :help:

rogue 2005-11-09 00:38

[QUOTE=kuli]Is there any software that allows me to search using mac os/x? or do I have to go advanced with installing some cryptic linux stuff, and run the program from there? In short: what is the easiest way for me to use my G4 powerbook to search for the mersenne primes? :help:[/QUOTE]

I assume you have OS X. If so, then get Glucas and Mlucas sources. You can build them with the Terminal app (in the Applications/Utilities folder). Run the selftest to determine which will be faster, then use that one.

BTW, there are other projects that could give you more bang for the buck. I would recommend the PIES project. With your G4 you should be able to find a 100000 digit prime within a few days.

fatphil 2005-11-09 09:48

[QUOTE=rogue]BTW, there are other projects that could give you more bang for the buck. I would recommend the PIES project. With your G4 you should be able to find a 100000 digit prime within a few days.[/QUOTE]

Out by one error, alas. G5 is days, G4 is weeks.

rogue 2005-11-09 13:35

[QUOTE=fatphil]Out by one error, alas. G5 is days, G4 is weeks.[/QUOTE]

I don't have a G4, is it really that much slower than a G5? In either case, chances of finding a Mersenne in your lifetime on a G4 are fairly slim.

fatphil 2005-11-09 14:06

[QUOTE=rogue]I don't have a G4, is it really that much slower than a G5?[/QUOTE]

Yup. 32 bit processors are just toys. Ones with only 6 general purpose registers quadruply so. Crazed old loons like myself have been saying this for years.

I guess I should add the disclaimer that this is not the official opinion of Freescale "Launched by Motorola" Semiconductor... :-|

I believe that a G4 gently spanks the behind of a similarly clocked P3 when it comes to small FFTs used in prime number finding (as opposed to the large FFTs used in prime number searching). I suspect that G4s would make superb sieving machines though, if someone were to put the effort into tuning code for them. (Not enough instruction throughput to suffer from register starvation, unlike the _too fast_ G5, where 32 registers just isn't enough.)

Greenbank 2005-11-09 14:34

Hey Rogue,

Not sure how the cycle counts differ (does anyone have a link to cycle count stuff for the G4 and G5 chips? I've got both PEM's but I can't find the cycle count info) but:-

For multiplications you cut the number of muls in a 64bitx64bit down from 8 to 2 and remove all of the addition/carry stuff:-

Here's a 32-bit mul for 2 64-bit values:-

(Note that it isn't optimised as far as pipelining and it uses far too many registers. Plus I need to make it use addze for the carries. ;-)

[code]
/* Do mul (%0,%1)*(%2,%3) put result in r18,r19,r20,r21*/
/* r[0]=mullw(al,bl) */
"mullw r21,%1,%3\n\t"

/* r[1]=mulhw(al,bl) */
"mulhwu r2,%1,%3\n\t"
/* t[1]+=mullw( al, bh ) */
"mullw r3,%1,%2\n\t"
/* t[1]+=mullw( ah, bl ); */
"mullw r4,%0,%3\n\t"

/* t[2]+=mullw( ah, bh ); */
"mullw r5,%0,%2\n\t"
/* t[2]+=mulhw( al, bh ); */
"mulhwu r6,%1,%2\n\t"
/* t[2]+=mulhw( ah, bl ); */
"mulhwu r7,%0,%3\n\t"

/* t[3]+=mulhw( ah, bh ); */
"mulhwu r18,%0,%2\n\t"

"li r11, 0\n\t"
"li r12, 0\n\t"

"li r11, 0\n\t"
"li r12, 0\n\t"

/* r[1]=r2+r3+r4 */
"addco r20,r2,r3\n\t"
"addeo r11, r11, r12\n\t"
"addco r20,r20,r4\n\t"
"addeo r11, r11, r12\n\t"

/* overflow in r11 */
/* r[2]=r5+r11 + r6+r7 */
"addco r2,r11,r5\n\t"
"addeo r18,r18,r12\n\t"
"addco r11,r6,r7\n\t"
"addeo r18,r18,r12\n\t"
"addco r19,r2,r11\n\t"
"addeo r18,r18,r12\n\t"
[/code]

Now look at the one for 64-bit (multiply two 64-bit values (r2 and r3) and put the result in r4(high),r5(low)):-

[code]
mulld r5,r2,r3
mulhdu r4,r2,r3
[/code]

That's all that is needed. No carries, no messing around at all.

Looking at some of my code (see the links I posted on the SoB forum) the 64-bit versions are going to be about a fifth of the size and much faster.

The 32-bit mulmod routine works by dividing the top 32 bits of the value by the top 16 bits of the modulus (+1). It then multiplies this value by the modulus, shifts it up and subtracts it. This is repeated until the value is small enough that one final subtraction of the modulus may be necessary.

It effectively removes up to 16 bits (but usually only 14) each iteration.

Now move this to 64-bit. I can now divide the top 64-bits of the number by the top 32-bits of the modulus. Combine this with the fact that the subsequent multiplication is quicker, the subtraction is even simpler, etc.

Greenbank 2005-11-09 14:45

[QUOTE=fatphil]I suspect that G4s would make superb sieving machines though, if someone were to put the effort into tuning code for them. (Not enough instruction throughput to suffer from register starvation, unlike the _too fast_ G5, where 32 registers just isn't enough.)[/QUOTE]

That's where some of the code above comes from (work in progress).

I got the source for proth_sieve from Mikael Klasson and converted it to run against GMP (that slowed it down :-).

So far I've written mulmod, mulmod_montgomery, pow2mod in G4 assembly:

pow2mod: [url]http://www.greenbank.org/sob/macasm/asm_pow2mod.c[/url]

(The mulmod routine is hidden inside it. It's hideous to look at. I ran out of registers (crap register management by me) and had to store a loop counter and the initial link register in the 8 bytes for the output!)

Still got to do sqrmod(), powmod (non base 2), invert (a^-1 mod p) as I need this for mulmod_montgomery. Then gcd() and I think I might need jacobi/legendre too.

Speeds are between 2 and 5 times faster than GMP for most of these operations. Here are the figures for 2M iterations of 2^x (mod p) for various x values. (p was static somewhere up at 882T) for GMP's mpz_powm() and my assembly:

GMP: 2^27 (mod p) = 6.247 seconds
GMP: 2^15840 (mod p) = 10.785 seconds
GMP: 2^158401584015480 (mod p) = 33.634 seconds

ASM: 2^27 (mod p) = 0.213 seconds
ASM: 2^15840 (mod p) = 3.910 seconds
ASM: 2^158401584015840 (mod p) = 15.813 seconds.

About to order a new Quad PowerMac G5 (a Christmas present to myself) and then I'll rewrite it all for 64-bit G5 and watch it really fly.

fatphil 2005-11-09 15:23

[QUOTE=Greenbank]That's where some of the code above comes from (work in progress).
[/QUOTE]

Oh - does the G4 not have a muladd? My above statements were on that assumption. Without it, it'll be just another 32-bit arch, and one without much of the stuff (SSE2 on 2 doubles, single operand 32*32->64) that's found in intel kit, spit, spit.

Having said that, one thing that I wrote last night, a factorial+/-1 siever with G5 in mind, didn't actually assume anything G5 specific, and looks like it has the potential to soundly beat my x86-tuned code per clock tick.

Can I enroll you as a G4 tester, pretty please?

[QUOTE=Greenbank]About to order a new Quad PowerMac G5 (a Christmas present to myself) and then I'll rewrite it all for 64-bit G5 and watch it really fly.[/QUOTE]

Drool...

(must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)

rogue 2005-11-09 15:33

[QUOTE=Greenbank]Hey Rogue,

Not sure how the cycle counts differ (does anyone have a link to cycle count stuff for the G4 and G5 chips?[/QUOTE]

Yes, send an e-mail to rogue (at) wi.rr.com and I'll send you the PDF your are looking for.

Greenbank 2005-11-09 16:01

[b]fatphil wrote:-
Oh - does the G4 not have a muladd? My above statements were on that assumption. Without it, it'll be just another 32-bit arch, and one without much of the stuff (SSE2 on 2 doubles, single operand 32*32->64) that's found in intel kit, spit, spit.[/b]

No muladd. No single operand 32*32->64. You've got mullw and mulhw[u] to get high/low 32-bits seperately..

G5 has (checks PEM)...no muladd either, but of course you can just use mulld to do 32x32->64 in one instruction.

[b]Can I enroll you as a G4 tester, pretty please?[/b]

Yup. It's only a Mac Mini 1.42GHz with 512MB RAM and I'd rather not push it too hard as I bought it to act as a quiet server for home. It hides in the cupboard next to the ADSL router.

I tried your fp mulmod but pure integer non-fp assembly is faster on the G4. Plus I want to make sure this code has as few limits as possible. The current code is good for p up to, and including, 63-bits (some of the routines assume the top bit is free to shift into).

There's lots of other work that Chuck and Joe_O over at the SoB and Riesel forums are doing. They're rewriting the client to allow compilation on any architecture with architecture specific optimisations if they are present (they are doing 64-bit x86 specific assembly with CMOV/SSE2/SSE3 improvements.) plus a failsafe long long C implementation that the compiler will optimise so that clients are available for any OS/architecture. The current proth_sieve is entirely 32-bit, there are no 64-bit optimisations so you can imagine the possible speed improvements.

I'm just about to sell two of my Athlon XP 2200+ boxes and replace them with a dual-core Athlon X2 4400+. Time to abandon 32-bit.

They've also removed lots of the page faults in the eratosthenes sieve (for next p and factors for SPH). Can't remember all of the details but the current C++ code in proth_sieve uses vector stuff which screws up the cache usage. It has gone from 250-500 page faults/sec down to 0-1 page faults/sec. I don't have this code but I'll be contributing my G4 assembly to the cause.

> Quad PowerMac G5

[b]Drool...[/b]

Indeed. 10GHz of 64-bit G5 lovelyness in one box. A snip at 2500 UKP.

[b](must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)[/b]

:-) All my resources are devoted to SoB at the moment. Ideally the speed increases would allow people to combine the sieving with other projects (Riesel, PSP, 321) and still be sieving faster than before. Faster sieving means more efficient PRP-ing means faster prime finding means even faster sieveing. Lather, rinse and repeat...

fatphil 2005-11-09 16:16

[QUOTE=Greenbank][b]fatphil wrote:-
Oh - does the G4 not have a muladd? My above statements were on that assumption.[/b]

No muladd. No single operand 32*32->64. You've got mullw and mulhw[u] to get high/low 32-bits seperately..

G5 has (checks PEM)...no muladd either, but of course you can just use mulld to do 32x32->64 in one instruction.
[/QUOTE]

Sorry, I wasn't clear - I live in a FPU world (something I learnt from years of being a 21164 owner!), not an integer one, when it comes to sieving. I was curious about the G4's FPU muladd capability.

[QUOTE=Greenbank][b](must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)[/b]

:-) All my resources are devoted to SoB at the moment. Ideally the speed increases would allow people to combine the sieving with other projects (Riesel, PSP, 321) and still be sieving faster than before. Faster sieving means more efficient PRP-ing means faster prime finding means even faster sieveing. Lather, rinse and repeat...[/QUOTE]

The utility of sieving is commonly misunderstood in both directions. Most people underestimate its usefulness, but some overestimate too. A vastly faster sieve means only fractionally faster prime finding. A factor of several in sieve speed maps onto a factor of a few percent in PRPing rate.

Greenbank 2005-11-09 16:39

fmadd certainly exists for the G4.

Google for "MPCFPE32B". The 3.5MB PDF that's first in the list is the Programming Environments Manual for the G4.

[EDIT] I've also spotted jasong's msieve comments and fast mulmod for < 2^50 bit numbers. It's faster than my mulmod code so it'll fit in nicely (if it all works properly) for sieving p < 2^50.

fatphil 2005-11-10 10:03

[QUOTE=Greenbank]fmadd certainly exists for the G4.

Google for "MPCFPE32B". The 3.5MB PDF that's first in the list is the Programming Environments Manual for the G4.
[/QUOTE]

Thanks, I'll grab that.

I'll also ask around here, and see if there are any 'internal' documents that I can get my hands on too. (G4 only, alas)

ET_ 2005-11-10 11:34

[QUOTE=Greenbank]
I've also spotted jasong's msieve comments and fast mulmod for < 2^50 bit numbers. It's faster than my mulmod code so it'll fit in nicely (if it all works properly) for sieving p < 2^50.[/QUOTE]

Wasn't he jasonp? :rolleyes:

Luigi

Greenbank 2005-11-10 12:27

Indeed it was jasonp, my attribution was from memory.

On a related note, here's why G5's rock and G4's lead to madness (or at least hideous amounts of complication).

Looking at my 32-bit code for the core of the montgomery multiplication.

For a good reference see Handbook of Applied Cryptography Chapter 14.
Google for "HAC" and it is the second link. PDF's can be downloaded free for personal use (see copyright notice).

For calculating a*b (mod m) :-

It assumes:

a and b are numbers < 2^64 (i.e they fit in 2 32-bit words)
B = 2^32
n = 2 (since a and b have 2 words)
R= B^n = (2^32)^2 = 2^64
m' = (-m)^-1 (mod B) has been calculated
p = m (I tend to use these terms interchangably!)
x = a
y = bR (mod m) has been calculated.

The code is inside the (method == 1 ) code block of:

[url]http://www.greenbank.org/sob/macasm/montmul.c[/url]

20 muls, loads of adds to handle numerous carries. A nightmare but still quite quick.

Most importantly, no division!

Now, to do this on a G5 you need this:-

Since the G5 is 64-bit we can use B=2^64
Since both values (a and b) are 64-bit they only fit in one double-word so n=1.
R = B^n = B^1 = B = 2^64.
This also means we don't need the loop inside the algorithm.
Since we only need one iteration of the loop we can assume that A is 0 all the way until we set it with the result at the end simplifying it more.

Now we just need to do (untested):-

mulld r2,x,y // r2 = low_word(x*y)
mulld r3,r2,m' // r3 = x * y * m' (mod B) : mod implicit as we only use low double-word of result
mulhdu r4,x,y // r4 = high_word(x*y)
mulld r5,r3,p // r5 = low_word(u*m)
mulhdu r6,r3,p // r6 = high_word(u*m)
add r7,r4,r6 // Worry about carry? Not if p < 2^63. What if any of a,b,p are 64-bit?
addco r2,r2,r5 // r2 now useless, we just want the carry bit
addze r7,r7 // Add carry from low_word addition on to r7

So for 64 bit processors it comes down to 5 multiplies and 3 adds. The only other bit to do is the final subtraction if r7 > p :-

cmpld r7,p // (Logical/unsigned) Compare result to p
blt done // If it is less than the skip the subtract
subf r7,r7,p // r7 = r7 - p
done: // the end

64-bit * 64-bit (mod 64-bit) with 5 muls, 3 adds, 1 comparison and a conditional subtract. No divides.

Nice.

ewmayer 2005-11-10 19:00

[QUOTE=Greenbank]Now we just need to do (untested):-

mulld r2,x,y // r2 = low_word(x*y)
mulld r3,r2,m' // r3 = x * y * m' (mod B) : mod implicit as we only use low double-word of result
mulhdu r4,x,y // r4 = high_word(x*y)
mulld r5,r3,p // r5 = low_word(u*m)
mulhdu r6,r3,p // r6 = high_word(u*m)
add r7,r4,r6 // Worry about carry? Not if p < 2^63. What if any of a,b,p are 64-bit?
addco r2,r2,r5 // r2 now useless, we just want the carry bit
addze r7,r7 // Add carry from low_word addition on to r7

So for 64 bit processors it comes down to 5 multiplies and 3 adds.[/QUOTE]You should only need 4 MULs - I don't understand why you need both the low and high half of u*m. Here is a simple version of the code I use for 64-bit Montgomery modmul - q is the modulus, qinv*q == 1 mod 2^64, and x stores the datum one wants. This is taken from the guts of a left-to-right binary exponentiation loop, that's why the initial double-wide product is a squaring in my case. The (x > hi)? normalization is followed by a mod-q conditional doubling, depending on the current bit of the exponent of the modular powering - I've precomputed qhalf = q/2 for that:

[code]
SQR_LOHI(x,lo,hi); /* mulld, mulhdu */
lo *= qinv; /* mulld */
MUL_HIGH(q,lo,lo); /* mulhdu */
x = hi - lo;

if(x > hi)
{
x += q; /* had a borrow */
}

if({current_bit_set})
{
if(x > qhalf) /* Combines overflow-on-add and need-to-subtract-q-from-sum checks */
{
x = x + x;
x -= q;
}
else
x = x + x;
}
[/code]

Orgasmic Troll 2005-11-10 19:11

hey phil, I've got a G5 iMac at home, can you put it to work?

(I'm also working on a Dual 2.7 PowerMac with 4GB of RAM here at work, but I don't think I can appropriate cycles. The best part is the 30" cinema display and the 23" off to the side.. fricken sweeeet :grin: )

Greenbank 2005-11-10 19:51

[QUOTE=ewmayer]You should only need 4 MULs - I don't understand why you need both the low and high half of u*m.[/QUOTE]

OK, if a and b fit in one double-word then:-

m' = (-m)^-1 mod 2^64

u = x*y*m' mod 2^64

u = (x * y mod 2^64) * m' (mod 2^64)

So for calculating u you can get away with ignoring the high double-word.

However:-

A = (x*y + u*m ) / 2^64

does not imply you can do:-

high(x*y) + high(u*m)

You need the low half of u*m because if low(u*m)+low(x*y) > 2^64 then your answer will be off by one.

Let's see if can concoct an example with m=13. b=16 (2^4).

m' = (-13)^-1 (mod 16)
m' = 3^-1 (mod 16)

3*11 = 33 = 1 (mod 16)

m' = 11

Now pick a trivial a and b.

x=a=1
b=1

y = bR (mod m) = 1*16 (mod 13) = 3

u = x*y*m' mod 16

u = 1*3*11 mod 16 = 33 mod 16 = 1

A = (x*y + u*m ) / 16

A = (1*3 + 1*13) / 16

Without looking at the low word of (u*m) you're doing:-

A = (1*3)/16 + (1*13)/16 = 0 + 0 = 0

The correct way does:-

A = (1*3 + 1*13) / 16 = (3 + 13) / 16 = 16/16 = 1

Tomorrow I'll find an example for 50-bit values of a,b and m if you like.

ewmayer 2005-11-10 20:21

My code is adapted from some sieving code I got several years ago from Peter Montgomery himself: here is the relevant core operation sequence from his code comments (his only allows operands < 2^63, so is a few ALU ops cheaper):
[code]
t0 = UMULH(power2, power2);
if (t0 >= q) t0 -= q;
power2 = UMULH(q, MULQ(qinv64, MULQ(power2, power2))) - t0;
power2 = (q >> (bitnow & power2)) - (power2 >> bitnow);
[/code]
Again, just 4 MULs - 2 lower-half and 2 upper-half. My and Peter's codes have been extensively self-tested as far as the core modmul routines go, so I think we'd know if carries were getting dropped with O(1) probability.

xilman 2005-11-10 21:02

[QUOTE=ewmayer] so I think we'd know if carries were getting dropped with O(1) probability.[/QUOTE]Better than o(1) probability ...


Paul

Greenbank 2005-11-11 08:58

Heh, I'm willing to be proven wrong (and learn at the same time). Plus it means I could get rid of a MUL.

I'll find a 50-bit example and then we can trace through the code...

Greenbank 2005-11-11 11:10

Hmm, try as I might I can't find a situation where:

mullw(x*y)+mullw(u*m) < 2^64.

for m=882861709928411 and random a,b values. My test program has been running for 25 minutes and it checks 100,000 random values of a and b a second.

Which leads me to believe that it is assumed that the carry will always be present. However I can find a case for m=13 (will post that in a sec).

Here's an example where it is present:-

R=2^64
a=72185797819121
b=151612213027919
m=882861709928411
mprime=12466608506227884973

(a*b)%m
344317125598003

y=(b*2^64)%m
105419857829375

u=(a*y*mprime)%B
7194243690555063843

A=(a*y + u*m) / 2^64
= 6351529896101669193770756995022848 / 2^64 = 344317125598003

A_other = (x*y)/2^64 + (u*m)/2^64
= 412528981 + 344316713069021 = 344317125598002 (out by one)

Greenbank 2005-11-11 11:16

OK. Here's one where the carry isn't present:-

a=8
b=10
m=13
B=R=16
n=1
mprime = (-m)^-1 (mod 16) = 11

x=a
y=bR (mod m)=160 mod 13 = 4

u=x*y*m' (mod B) = 8*4*11 mod 16 = 0

Ahhh....is this why?


All times are UTC. The time now is 22:51.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.