mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2005-11-09, 00:27   #1
kuli
 

217358 Posts
Exclamation mac..

Is there any software that allows me to search using mac os/x? or do I have to go advanced with installing some cryptic linux stuff, and run the program from there? In short: what is the easiest way for me to use my G4 powerbook to search for the mersenne primes?
  Reply With Quote
Old 2005-11-09, 00:38   #2
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22×7×227 Posts
Default

Quote:
Originally Posted by kuli
Is there any software that allows me to search using mac os/x? or do I have to go advanced with installing some cryptic linux stuff, and run the program from there? In short: what is the easiest way for me to use my G4 powerbook to search for the mersenne primes?
I assume you have OS X. If so, then get Glucas and Mlucas sources. You can build them with the Terminal app (in the Applications/Utilities folder). Run the selftest to determine which will be faster, then use that one.

BTW, there are other projects that could give you more bang for the buck. I would recommend the PIES project. With your G4 you should be able to find a 100000 digit prime within a few days.
rogue is offline   Reply With Quote
Old 2005-11-09, 09:48   #3
fatphil
 
fatphil's Avatar
 
May 2003

E716 Posts
Default

Quote:
Originally Posted by rogue
BTW, there are other projects that could give you more bang for the buck. I would recommend the PIES project. With your G4 you should be able to find a 100000 digit prime within a few days.
Out by one error, alas. G5 is days, G4 is weeks.
fatphil is offline   Reply With Quote
Old 2005-11-09, 13:35   #4
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22×7×227 Posts
Default

Quote:
Originally Posted by fatphil
Out by one error, alas. G5 is days, G4 is weeks.
I don't have a G4, is it really that much slower than a G5? In either case, chances of finding a Mersenne in your lifetime on a G4 are fairly slim.

Last fiddled with by rogue on 2005-11-09 at 13:36
rogue is offline   Reply With Quote
Old 2005-11-09, 14:06   #5
fatphil
 
fatphil's Avatar
 
May 2003

3×7×11 Posts
Default

Quote:
Originally Posted by rogue
I don't have a G4, is it really that much slower than a G5?
Yup. 32 bit processors are just toys. Ones with only 6 general purpose registers quadruply so. Crazed old loons like myself have been saying this for years.

I guess I should add the disclaimer that this is not the official opinion of Freescale "Launched by Motorola" Semiconductor... :-|

I believe that a G4 gently spanks the behind of a similarly clocked P3 when it comes to small FFTs used in prime number finding (as opposed to the large FFTs used in prime number searching). I suspect that G4s would make superb sieving machines though, if someone were to put the effort into tuning code for them. (Not enough instruction throughput to suffer from register starvation, unlike the _too fast_ G5, where 32 registers just isn't enough.)
fatphil is offline   Reply With Quote
Old 2005-11-09, 14:34   #6
Greenbank
 
Greenbank's Avatar
 
Jul 2005

2·193 Posts
Default

Hey Rogue,

Not sure how the cycle counts differ (does anyone have a link to cycle count stuff for the G4 and G5 chips? I've got both PEM's but I can't find the cycle count info) but:-

For multiplications you cut the number of muls in a 64bitx64bit down from 8 to 2 and remove all of the addition/carry stuff:-

Here's a 32-bit mul for 2 64-bit values:-

(Note that it isn't optimised as far as pipelining and it uses far too many registers. Plus I need to make it use addze for the carries. ;-)

Code:
                /* Do mul (%0,%1)*(%2,%3) put result in r18,r19,r20,r21*/
                /* r[0]=mullw(al,bl) */
                "mullw r21,%1,%3\n\t"

                /* r[1]=mulhw(al,bl) */
                "mulhwu r2,%1,%3\n\t"
                /* t[1]+=mullw( al, bh ) */
                "mullw r3,%1,%2\n\t"
                /* t[1]+=mullw( ah, bl ); */
                "mullw r4,%0,%3\n\t"

                /* t[2]+=mullw( ah, bh ); */
                "mullw r5,%0,%2\n\t"
                /* t[2]+=mulhw( al, bh ); */
                "mulhwu r6,%1,%2\n\t"
                /* t[2]+=mulhw( ah, bl ); */
                "mulhwu r7,%0,%3\n\t"

                /* t[3]+=mulhw( ah, bh ); */
                "mulhwu r18,%0,%2\n\t"

                "li r11, 0\n\t"
                "li r12, 0\n\t"

                "li r11, 0\n\t"
                "li r12, 0\n\t"

                /* r[1]=r2+r3+r4 */
                "addco r20,r2,r3\n\t"
                "addeo r11, r11, r12\n\t"
                "addco r20,r20,r4\n\t"
                "addeo r11, r11, r12\n\t"

                /* overflow in r11 */
                /* r[2]=r5+r11 + r6+r7 */
                "addco r2,r11,r5\n\t"
                "addeo r18,r18,r12\n\t"
                "addco r11,r6,r7\n\t"
                "addeo r18,r18,r12\n\t"
                "addco r19,r2,r11\n\t"
                "addeo r18,r18,r12\n\t"
Now look at the one for 64-bit (multiply two 64-bit values (r2 and r3) and put the result in r4(high),r5(low)):-

Code:
mulld r5,r2,r3
mulhdu r4,r2,r3
That's all that is needed. No carries, no messing around at all.

Looking at some of my code (see the links I posted on the SoB forum) the 64-bit versions are going to be about a fifth of the size and much faster.

The 32-bit mulmod routine works by dividing the top 32 bits of the value by the top 16 bits of the modulus (+1). It then multiplies this value by the modulus, shifts it up and subtracts it. This is repeated until the value is small enough that one final subtraction of the modulus may be necessary.

It effectively removes up to 16 bits (but usually only 14) each iteration.

Now move this to 64-bit. I can now divide the top 64-bits of the number by the top 32-bits of the modulus. Combine this with the fact that the subsequent multiplication is quicker, the subtraction is even simpler, etc.

Last fiddled with by Greenbank on 2005-11-09 at 14:46
Greenbank is offline   Reply With Quote
Old 2005-11-09, 14:45   #7
Greenbank
 
Greenbank's Avatar
 
Jul 2005

2·193 Posts
Default

Quote:
Originally Posted by fatphil
I suspect that G4s would make superb sieving machines though, if someone were to put the effort into tuning code for them. (Not enough instruction throughput to suffer from register starvation, unlike the _too fast_ G5, where 32 registers just isn't enough.)
That's where some of the code above comes from (work in progress).

I got the source for proth_sieve from Mikael Klasson and converted it to run against GMP (that slowed it down :-).

So far I've written mulmod, mulmod_montgomery, pow2mod in G4 assembly:

pow2mod: http://www.greenbank.org/sob/macasm/asm_pow2mod.c

(The mulmod routine is hidden inside it. It's hideous to look at. I ran out of registers (crap register management by me) and had to store a loop counter and the initial link register in the 8 bytes for the output!)

Still got to do sqrmod(), powmod (non base 2), invert (a^-1 mod p) as I need this for mulmod_montgomery. Then gcd() and I think I might need jacobi/legendre too.

Speeds are between 2 and 5 times faster than GMP for most of these operations. Here are the figures for 2M iterations of 2^x (mod p) for various x values. (p was static somewhere up at 882T) for GMP's mpz_powm() and my assembly:

GMP: 2^27 (mod p) = 6.247 seconds
GMP: 2^15840 (mod p) = 10.785 seconds
GMP: 2^158401584015480 (mod p) = 33.634 seconds

ASM: 2^27 (mod p) = 0.213 seconds
ASM: 2^15840 (mod p) = 3.910 seconds
ASM: 2^158401584015840 (mod p) = 15.813 seconds.

About to order a new Quad PowerMac G5 (a Christmas present to myself) and then I'll rewrite it all for 64-bit G5 and watch it really fly.
Greenbank is offline   Reply With Quote
Old 2005-11-09, 15:23   #8
fatphil
 
fatphil's Avatar
 
May 2003

3478 Posts
Default

Quote:
Originally Posted by Greenbank
That's where some of the code above comes from (work in progress).
Oh - does the G4 not have a muladd? My above statements were on that assumption. Without it, it'll be just another 32-bit arch, and one without much of the stuff (SSE2 on 2 doubles, single operand 32*32->64) that's found in intel kit, spit, spit.

Having said that, one thing that I wrote last night, a factorial+/-1 siever with G5 in mind, didn't actually assume anything G5 specific, and looks like it has the potential to soundly beat my x86-tuned code per clock tick.

Can I enroll you as a G4 tester, pretty please?

Quote:
Originally Posted by Greenbank
About to order a new Quad PowerMac G5 (a Christmas present to myself) and then I'll rewrite it all for 64-bit G5 and watch it really fly.
Drool...

(must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)
fatphil is offline   Reply With Quote
Old 2005-11-09, 15:33   #9
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

22·7·227 Posts
Default

Quote:
Originally Posted by Greenbank
Hey Rogue,

Not sure how the cycle counts differ (does anyone have a link to cycle count stuff for the G4 and G5 chips?
Yes, send an e-mail to rogue (at) wi.rr.com and I'll send you the PDF your are looking for.
rogue is offline   Reply With Quote
Old 2005-11-09, 16:01   #10
Greenbank
 
Greenbank's Avatar
 
Jul 2005

2·193 Posts
Default

fatphil wrote:-
Oh - does the G4 not have a muladd? My above statements were on that assumption. Without it, it'll be just another 32-bit arch, and one without much of the stuff (SSE2 on 2 doubles, single operand 32*32->64) that's found in intel kit, spit, spit.


No muladd. No single operand 32*32->64. You've got mullw and mulhw[u] to get high/low 32-bits seperately..

G5 has (checks PEM)...no muladd either, but of course you can just use mulld to do 32x32->64 in one instruction.

Can I enroll you as a G4 tester, pretty please?

Yup. It's only a Mac Mini 1.42GHz with 512MB RAM and I'd rather not push it too hard as I bought it to act as a quiet server for home. It hides in the cupboard next to the ADSL router.

I tried your fp mulmod but pure integer non-fp assembly is faster on the G4. Plus I want to make sure this code has as few limits as possible. The current code is good for p up to, and including, 63-bits (some of the routines assume the top bit is free to shift into).

There's lots of other work that Chuck and Joe_O over at the SoB and Riesel forums are doing. They're rewriting the client to allow compilation on any architecture with architecture specific optimisations if they are present (they are doing 64-bit x86 specific assembly with CMOV/SSE2/SSE3 improvements.) plus a failsafe long long C implementation that the compiler will optimise so that clients are available for any OS/architecture. The current proth_sieve is entirely 32-bit, there are no 64-bit optimisations so you can imagine the possible speed improvements.

I'm just about to sell two of my Athlon XP 2200+ boxes and replace them with a dual-core Athlon X2 4400+. Time to abandon 32-bit.

They've also removed lots of the page faults in the eratosthenes sieve (for next p and factors for SPH). Can't remember all of the details but the current C++ code in proth_sieve uses vector stuff which screws up the cache usage. It has gone from 250-500 page faults/sec down to 0-1 page faults/sec. I don't have this code but I'll be contributing my G4 assembly to the cause.

> Quad PowerMac G5

Drool...

Indeed. 10GHz of 64-bit G5 lovelyness in one box. A snip at 2500 UKP.

(must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)

:-) All my resources are devoted to SoB at the moment. Ideally the speed increases would allow people to combine the sieving with other projects (Riesel, PSP, 321) and still be sieving faster than before. Faster sieving means more efficient PRP-ing means faster prime finding means even faster sieveing. Lather, rinse and repeat...
Greenbank is offline   Reply With Quote
Old 2005-11-09, 16:16   #11
fatphil
 
fatphil's Avatar
 
May 2003

3×7×11 Posts
Default

Quote:
Originally Posted by Greenbank
fatphil wrote:-
Oh - does the G4 not have a muladd? My above statements were on that assumption.


No muladd. No single operand 32*32->64. You've got mullw and mulhw[u] to get high/low 32-bits seperately..

G5 has (checks PEM)...no muladd either, but of course you can just use mulld to do 32x32->64 in one instruction.
Sorry, I wasn't clear - I live in a FPU world (something I learnt from years of being a 21164 owner!), not an integer one, when it comes to sieving. I was curious about the G4's FPU muladd capability.

Quote:
Originally Posted by Greenbank
(must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.)

:-) All my resources are devoted to SoB at the moment. Ideally the speed increases would allow people to combine the sieving with other projects (Riesel, PSP, 321) and still be sieving faster than before. Faster sieving means more efficient PRP-ing means faster prime finding means even faster sieveing. Lather, rinse and repeat...
The utility of sieving is commonly misunderstood in both directions. Most people underestimate its usefulness, but some overestimate too. A vastly faster sieve means only fractionally faster prime finding. A factor of several in sieve speed maps onto a factor of a few percent in PRPing rate.
fatphil is offline   Reply With Quote
Reply



All times are UTC. The time now is 22:32.


Fri Aug 6 22:32:25 UTC 2021 up 14 days, 17:01, 1 user, load averages: 4.35, 3.60, 3.34

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.