![]() |
|
|
#1 |
|
217358 Posts |
Is there any software that allows me to search using mac os/x? or do I have to go advanced with installing some cryptic linux stuff, and run the program from there? In short: what is the easiest way for me to use my G4 powerbook to search for the mersenne primes?
|
|
|
|
#2 | |
|
"Mark"
Apr 2003
Between here and the
22×7×227 Posts |
Quote:
BTW, there are other projects that could give you more bang for the buck. I would recommend the PIES project. With your G4 you should be able to find a 100000 digit prime within a few days. |
|
|
|
|
|
|
#3 | |
|
May 2003
E716 Posts |
Quote:
|
|
|
|
|
|
|
#4 | |
|
"Mark"
Apr 2003
Between here and the
22×7×227 Posts |
Quote:
Last fiddled with by rogue on 2005-11-09 at 13:36 |
|
|
|
|
|
|
#5 | |
|
May 2003
3×7×11 Posts |
Quote:
I guess I should add the disclaimer that this is not the official opinion of Freescale "Launched by Motorola" Semiconductor... :-| I believe that a G4 gently spanks the behind of a similarly clocked P3 when it comes to small FFTs used in prime number finding (as opposed to the large FFTs used in prime number searching). I suspect that G4s would make superb sieving machines though, if someone were to put the effort into tuning code for them. (Not enough instruction throughput to suffer from register starvation, unlike the _too fast_ G5, where 32 registers just isn't enough.) |
|
|
|
|
|
|
#6 |
|
Jul 2005
2·193 Posts |
Hey Rogue,
Not sure how the cycle counts differ (does anyone have a link to cycle count stuff for the G4 and G5 chips? I've got both PEM's but I can't find the cycle count info) but:- For multiplications you cut the number of muls in a 64bitx64bit down from 8 to 2 and remove all of the addition/carry stuff:- Here's a 32-bit mul for 2 64-bit values:- (Note that it isn't optimised as far as pipelining and it uses far too many registers. Plus I need to make it use addze for the carries. ;-) Code:
/* Do mul (%0,%1)*(%2,%3) put result in r18,r19,r20,r21*/
/* r[0]=mullw(al,bl) */
"mullw r21,%1,%3\n\t"
/* r[1]=mulhw(al,bl) */
"mulhwu r2,%1,%3\n\t"
/* t[1]+=mullw( al, bh ) */
"mullw r3,%1,%2\n\t"
/* t[1]+=mullw( ah, bl ); */
"mullw r4,%0,%3\n\t"
/* t[2]+=mullw( ah, bh ); */
"mullw r5,%0,%2\n\t"
/* t[2]+=mulhw( al, bh ); */
"mulhwu r6,%1,%2\n\t"
/* t[2]+=mulhw( ah, bl ); */
"mulhwu r7,%0,%3\n\t"
/* t[3]+=mulhw( ah, bh ); */
"mulhwu r18,%0,%2\n\t"
"li r11, 0\n\t"
"li r12, 0\n\t"
"li r11, 0\n\t"
"li r12, 0\n\t"
/* r[1]=r2+r3+r4 */
"addco r20,r2,r3\n\t"
"addeo r11, r11, r12\n\t"
"addco r20,r20,r4\n\t"
"addeo r11, r11, r12\n\t"
/* overflow in r11 */
/* r[2]=r5+r11 + r6+r7 */
"addco r2,r11,r5\n\t"
"addeo r18,r18,r12\n\t"
"addco r11,r6,r7\n\t"
"addeo r18,r18,r12\n\t"
"addco r19,r2,r11\n\t"
"addeo r18,r18,r12\n\t"
Code:
mulld r5,r2,r3 mulhdu r4,r2,r3 Looking at some of my code (see the links I posted on the SoB forum) the 64-bit versions are going to be about a fifth of the size and much faster. The 32-bit mulmod routine works by dividing the top 32 bits of the value by the top 16 bits of the modulus (+1). It then multiplies this value by the modulus, shifts it up and subtracts it. This is repeated until the value is small enough that one final subtraction of the modulus may be necessary. It effectively removes up to 16 bits (but usually only 14) each iteration. Now move this to 64-bit. I can now divide the top 64-bits of the number by the top 32-bits of the modulus. Combine this with the fact that the subsequent multiplication is quicker, the subtraction is even simpler, etc. Last fiddled with by Greenbank on 2005-11-09 at 14:46 |
|
|
|
|
|
#7 | |
|
Jul 2005
2·193 Posts |
Quote:
I got the source for proth_sieve from Mikael Klasson and converted it to run against GMP (that slowed it down :-). So far I've written mulmod, mulmod_montgomery, pow2mod in G4 assembly: pow2mod: http://www.greenbank.org/sob/macasm/asm_pow2mod.c (The mulmod routine is hidden inside it. It's hideous to look at. I ran out of registers (crap register management by me) and had to store a loop counter and the initial link register in the 8 bytes for the output!) Still got to do sqrmod(), powmod (non base 2), invert (a^-1 mod p) as I need this for mulmod_montgomery. Then gcd() and I think I might need jacobi/legendre too. Speeds are between 2 and 5 times faster than GMP for most of these operations. Here are the figures for 2M iterations of 2^x (mod p) for various x values. (p was static somewhere up at 882T) for GMP's mpz_powm() and my assembly: GMP: 2^27 (mod p) = 6.247 seconds GMP: 2^15840 (mod p) = 10.785 seconds GMP: 2^158401584015480 (mod p) = 33.634 seconds ASM: 2^27 (mod p) = 0.213 seconds ASM: 2^15840 (mod p) = 3.910 seconds ASM: 2^158401584015840 (mod p) = 15.813 seconds. About to order a new Quad PowerMac G5 (a Christmas present to myself) and then I'll rewrite it all for 64-bit G5 and watch it really fly. |
|
|
|
|
|
|
#8 | ||
|
May 2003
3478 Posts |
Quote:
Having said that, one thing that I wrote last night, a factorial+/-1 siever with G5 in mind, didn't actually assume anything G5 specific, and looks like it has the potential to soundly beat my x86-tuned code per clock tick. Can I enroll you as a G4 tester, pretty please? Quote:
(must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.) |
||
|
|
|
|
|
#9 | |
|
"Mark"
Apr 2003
Between here and the
22·7·227 Posts |
Quote:
|
|
|
|
|
|
|
#10 |
|
Jul 2005
2·193 Posts |
fatphil wrote:-
Oh - does the G4 not have a muladd? My above statements were on that assumption. Without it, it'll be just another 32-bit arch, and one without much of the stuff (SSE2 on 2 doubles, single operand 32*32->64) that's found in intel kit, spit, spit. No muladd. No single operand 32*32->64. You've got mullw and mulhw[u] to get high/low 32-bits seperately.. G5 has (checks PEM)...no muladd either, but of course you can just use mulld to do 32x32->64 in one instruction. Can I enroll you as a G4 tester, pretty please? Yup. It's only a Mac Mini 1.42GHz with 512MB RAM and I'd rather not push it too hard as I bought it to act as a quiet server for home. It hides in the cupboard next to the ADSL router. I tried your fp mulmod but pure integer non-fp assembly is faster on the G4. Plus I want to make sure this code has as few limits as possible. The current code is good for p up to, and including, 63-bits (some of the routines assume the top bit is free to shift into). There's lots of other work that Chuck and Joe_O over at the SoB and Riesel forums are doing. They're rewriting the client to allow compilation on any architecture with architecture specific optimisations if they are present (they are doing 64-bit x86 specific assembly with CMOV/SSE2/SSE3 improvements.) plus a failsafe long long C implementation that the compiler will optimise so that clients are available for any OS/architecture. The current proth_sieve is entirely 32-bit, there are no 64-bit optimisations so you can imagine the possible speed improvements. I'm just about to sell two of my Athlon XP 2200+ boxes and replace them with a dual-core Athlon X2 4400+. Time to abandon 32-bit. They've also removed lots of the page faults in the eratosthenes sieve (for next p and factors for SPH). Can't remember all of the details but the current C++ code in proth_sieve uses vector stuff which screws up the cache usage. It has gone from 250-500 page faults/sec down to 0-1 page faults/sec. I don't have this code but I'll be contributing my G4 assembly to the cause. > Quad PowerMac G5 Drool... Indeed. 10GHz of 64-bit G5 lovelyness in one box. A snip at 2500 UKP. (must ... resist ... temptation ... to plug own ... prime-hunting project ... for which G5s are ... the fastest of all architectures at PRP tests. Shit, failed.) :-) All my resources are devoted to SoB at the moment. Ideally the speed increases would allow people to combine the sieving with other projects (Riesel, PSP, 321) and still be sieving faster than before. Faster sieving means more efficient PRP-ing means faster prime finding means even faster sieveing. Lather, rinse and repeat... |
|
|
|
|
|
#11 | ||
|
May 2003
3×7×11 Posts |
Quote:
Quote:
|
||
|
|
|