mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware

Reply
 
Thread Tools
Old 2003-03-01, 22:58   #1
nukemyrman
 
Mar 2003
Yucaipa, CA, USA

10002 Posts
Default Using Motorola 7410s to factor numbers or test for primality

Hi,
I was interested in factoring large Mersenne prime canidates using the Motorola 7410 AltiVec processor. I suspect they would not be too good running Lucas (prime testing), since the device is only 32 bits single precision. If I'm wrong, let me know. The core processor is 64 bit but the real power lies in the AltiVec vector ALUs, which are 32 bits, integer or floating point.
If there is a ANSI C or C++ version of a factoring program that uses the AltiVec hardware on the PPC 7410 I'd like to compile it and use it.
Any suggestions?
What is the best ANSI C program out there that factors large, (2^45,000,000) - 1, numbers for any processor.
Thank You, Nick
nukemyrman is offline   Reply With Quote
Old 2003-03-02, 02:22   #2
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

2×3,851 Posts
Default

I've wondered for a while if this paper had any significance for LL testing...

http://developer.apple.com/hardware/ve/pdf/g4fft.pdf
Xyzzy is offline   Reply With Quote
Old 2003-03-03, 01:53   #3
Paulie
 
Paulie's Avatar
 
Aug 2002

3378 Posts
Default

I posted at the Glucas site on the link Xyzzy a while back:

Glucas Sourceforge Tracker

On the topic of Altivec, at the bottom of the above link:

Quote:
OTOH, altivec instructions are not usable for us because they
aren't double precision capable, and we need that precision.

Sincerly,

Guillermo Ballester Valor.
Cheers! :)
Paulie is offline   Reply With Quote
Old 2003-03-03, 02:04   #4
Paulie
 
Paulie's Avatar
 
Aug 2002

223 Posts
Default

I just read my old link that I posed on Sourceforge, and they have this doc on there:

Octuple-precision floating-point on Apple G4

Abscract: We describe herein a G4 Velocity Engine (Altivec) implementation of "oct-precision," i.e. 256-bit floating-point operations. (We speak of 32-bit exponents and 224-bit mantissas.) We present performance benchmarks in comparison to an existing C++ library. The basic result is that Altivec-based oct-precision can run about 4x faster than a scalar implementation of the same precision; a 500 Mhz.. G4 can therefore perform at 5-10 Mocts (million oct-ops per second).

http://developer.apple.com/hardware/ve/pdf/oct3a.pdf

If Altivec isn't good for double precision, would it be worth going oct-precision? :D :D :D
Paulie is offline   Reply With Quote
Old 2003-03-03, 02:16   #5
QuintLeo
 
QuintLeo's Avatar
 
Oct 2002
Lost in the hills of Iowa

1110000002 Posts
Default

Isn't trial factoring an integer operation?

I'd think AltiVec code should handle THAT quite well - there are good reasons why the AltiVec G4s are *mosterously* faster than anything else on RC5 crunching.
QuintLeo is offline   Reply With Quote
Old 2003-03-03, 21:53   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

9,791 Posts
Default

Quote:
Originally Posted by Paulie
I just read my old link that I posed on Sourceforge, and they have this doc on there:

Octuple-precision floating-point on Apple G4

Abscract: We describe herein a G4 Velocity Engine (Altivec) implementation of "oct-precision," i.e. 256-bit floating-point operations. (We speak of 32-bit exponents and 224-bit mantissas.) We present performance benchmarks in comparison to an existing C++ library. The basic result is that Altivec-based oct-precision can run about 4x faster than a scalar implementation of the same precision; a 500 Mhz.. G4 can therefore perform at 5-10 Mocts (million oct-ops per second).

http://developer.apple.com/hardware/ve/pdf/oct3a.pdf

If Altivec isn't good for double precision, would it be worth going oct-precision? :D :D :D
The oct-precision stuff is by Richard Crandall and Jason Papadopoulos (my fellow investigators in the compositeness proof of the 24th Fermat number), and I've traded some e-mails with them about it. The important thing to understand here is that the extra precision over the (hardware-supported) IEEE double type is all via software emulation, making it grindingly slow relative to arithmetic with doubles - do the simple math and you'll see that 5-10 "Mocts" (an ugly term, but that's unimportant for the present argument) on a 500MHZ CPU translates to 50-100 cycles per operation. Emulated extended-precision operations are for those applications where one needs precision (and/or range) more than speed, e.g. for high-precision initialization of data tables, numerical solution of exceptionally ill-conditioned matrix equations, and the like.

Quote:
Originally Posted by QuintLeo
Isn't trial factoring an integer operation?

I'd think AltiVec code should handle THAT quite well - there are good reasons why the AltiVec G4s are *mosterously* faster than anything else on RC5 crunching.
Like the LL test, the underlying arithmetic is pure-integer, but there is a variety of ways to effect it in practice, depending on what the hardware supports. Practical sieve-based trial factoring needs to go to a depth of at least 64 bits, so for simplicity let's focus on arithmetic with 64-bit integer operands. The key operation w.r.to factoring is modular multiplication of integers this size, i.e. we want x*y modulo q, where x, y and q are unsigned ints of 64 bits (or slightly larger). Note that the intermediate product (x*y, prior to modular reduction) has twice as many bits, i.e. has 128 bits or more. Only a few processors (e.g. Alpha, Itanium, MIPS) have hardware support for 128-bit products, and fewer of these can do it quickly. Lacking this type of high-end integer multiply support, this kind of functionality must be emulated, e.g. if one had hardware support for a 64-bit product of 32-bit inputs, one would form a 128-bit product of 64-bit ints x = a+b*2^32 and y = c+d*2^32 via 4 of these 32x32-->64-bit muls and then assemble the pieces. If there's good hardware support for these simpler operations (e.g. P3 and Athlon are OK at these), this can still run pretty fast.

Now let's look at PPC and AltiVec (which Klaus Kastens is currently helping me port my C-based TF code to). PPC (at least the more recent models, e.g. the 7450) is pretty good at 32-bit integer math - for instance, 32-bit integer multiply needs 2 cycles (pipelined). But, PPC needs two separate multiply operations to get a 64-bit product of 32-bit inputs; one to get the lower 32 bits of the result, one to get the upper 32. That means (assuming perfect pipelining) 4 cycles to get the 64-bit product; 4 of these need 16 cycles. Assembling the resulting pieces is also nontrivial, since 64-bit integer adds must also be emulated using 32-bit hardware operations, and the carry from the lower 32-bit sum into the upper half serializes the code, making it difficult to keep all of the multiple integer units of the CPU busy. But let's assume we can code things perfectly and actually get a 128-bit product in just 16 cycles. That is about as fast as we can do the same on a P3 or Alpha 21164 or MIPS R10000 (not sure if the R15000 is any better than the R10K at integer multiply, specifically the double-wide DMULTU operation), but is a factor of 8 slower than Alpha 21264, which needs just 2 pipelined cycles to get a 128-bit product.

Whether the AltiVec can help out is still an open question. On the one hand its multiply functionality is even more limited than that of the PPC core - it can only get the lower half of a 64-bit product of 32-bit ints. On the other hand (IIRC) it can do 4 of these at a time. OTOOH (I'm running out of hands here :)), I believe it can only do the 4-way multiply if we have 4 32-bit inputs all getting multiplied by the same 32-bit number, thus it may be difficult to take advantage of the 4-way SIMD capability for the purposes of factoring. I still hold out hope that it could help in some way, but even in my most wildly optimistic dreams we're talking perhaps a 2X speedup over the PPC core alone.
ewmayer is offline   Reply With Quote
Old 2003-03-04, 15:41   #7
Paulie
 
Paulie's Avatar
 
Aug 2002

223 Posts
Default

Quote:
Originally Posted by ewmayer
The important thing to understand here is that the extra precision over the (hardware-supported) IEEE double type is all via software emulation, making it grindingly slow relative to arithmetic with doubles - do the simple math and you'll see that 5-10 "Mocts" (an ugly term, but that's unimportant for the present argument) on a 500MHZ CPU translates to 50-100 cycles per operation. Emulated extended-precision operations are for those applications where one needs precision (and/or range) more than speed, e.g. for high-precision initialization of data tables, numerical solution of exceptionally ill-conditioned matrix equations, and the like.
Oh boy, per op, yikes! :)

Thanks so much for the reply, it's will go in my bag of linkage for future reference (since this question keeps poping up every once and a while).

I guess those of us that farm P4's but use PPC's will just have to deal with the fact that the G4's are optimized for media, not crunching. Maybe the 970 will be a boon, assuming a bunch of assumptions actually happen between IBM and Apple. :D
Paulie is offline   Reply With Quote
Old 2003-03-04, 16:08   #8
Xyzzy
 
Xyzzy's Avatar
 
"Mike"
Aug 2002

11110000101102 Posts
Default

Those of us with the G3 (Mine is a 750FX) are really up a creek...
Xyzzy is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Conjectured Primality Test for Specific Class of Mersenne Numbers primus Miscellaneous Math 1 2014-10-12 09:25
Pretty Fast Primality Test (for numbers = 3 mod 4) tapion64 Miscellaneous Math 40 2014-04-20 05:43
Proof of Primality Test for Fermat Numbers princeps Math 15 2012-04-02 21:49
The fastest primality test for Fermat numbers. Arkadiusz Math 6 2011-04-05 19:39
A primality test for Fermat numbers faster than Pépin's test ? T.Rex Math 0 2004-10-26 21:37

All times are UTC. The time now is 07:06.

Thu Oct 22 07:06:59 UTC 2020 up 42 days, 4:17, 0 users, load averages: 1.50, 1.36, 1.27

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.