![]() |
|
|
#1 |
|
Jul 2004
5·7 Posts |
Any ideas on doing "64-bit integer multiplication into 128-bit" the fastest way? You may assume Pentium 4 without SSE3 is used.
The pmuludq xmm,xmm can do two "32bit x 32bit into 64bit", latency is 6 clocks only. Using grammar multiplication method, we need four multiplications. But 128-bit addition is not available. I am stuck with shifting left and right the xmm registers quite a lot and using paddq xmm,xmm. Also, pslldq shifts on byte granularity, that's 8, 16, 24, ... bits |
|
|
|
|
|
#2 |
|
Jan 2005
Caught in a sieve
39510 Posts |
I'm pretty good at standard IA-32 assembly, but the SSE2 stuff for P4 throws me (mainly because I don't have a P4). The best expert I know on this stuff is named Jay Berg. He rewrote the ECCp-109 client for the P4, while I did non-P4. Unfortunately, all his code seems to have disappeared. If you can find somebody who downloaded it, you might take a look at his code for 128x128=256 (it's named mulmod_p109.S or something similar) and see if you get any ideas.
I'm not sure I noticed this when I rewrote the code , but 64x64=128 can be done with three multiplies, not four. Following Robert Sedgewick's Algorithms, chapter 36, page 527-528:If 64-bit numbers p and q are split into their low and high words, pl, ql, ph, and qh, Then compute rl=pl*ql, rh=ph*qh, and rm=(pl+ph)*(ql+qh)-rl-rh. It might help at this point if pl+ph and ql+qh could fit in 32 bits somehow... Anyway, note that rm=(pl*ql+ph*ql+pl*qh+ph*qh)-(pl*ql)-(ph*qh), or (ph*ql+pl*qh). So then p*q = rl+rm>>32+rh>>64, where rl, rm, and rh are each 64 bits long. |
|
|
|
|
|
#3 |
|
Jul 2004
5·7 Posts |
Thanks for the reply.
How do you sum up the 3 parts (rl, rm, rh, after shifting) to get the 128 bit result? There is 64-bit add in SSE2, but no carry output. |
|
|
|
|
|
#4 |
|
Jan 2005
Caught in a sieve
5×79 Posts |
Beats me.
I'm just givin' you the algorithm. Remember, I have no clue about SSE2. You can just concatenate rl and rh, but you will need a 128-bit sum to add rm. I did it with non-sse2 code; an addl followed by several adcl's.For ECCp-109, everything was modulo a 109-bit number, so I could have split up the multiplicands into three 31-bit numbers and one 16-bit number (with several Athlon-friendly bit shifts), and then run that algorithm. You might try breaking your numbers up into 16-bit chunks stored in 32-bit words. Note that the multiplication algorithm works recursively; multiplying four 16-bit numbers takes nine multiplies. That's more than four, but could the extra space help avoid some bit shifts? If you're interested in my non-SSE2 code, it's here. I'm still looking for Jay's code, but I'm not optimistic that I have it. And Jay's web site has disappeared. However, I recently found the former ECCp-109 project leader, Chris Monico, posting in the factoring forum. So if you're really interested in getting this code, you might contact him.
|
|
|
|
|
|
#5 |
|
Jan 2005
Caught in a sieve
5×79 Posts |
I'd just like to add some notes about calculating rm. You're adding two 32-bit numbers twice before multiplying, so the result could be as big as 66 bits before subtraction. You can do rl and rh in SSE2, in parallel (I've been studying a bit), but I think rm will have to be mostly non-sse2. Worse, I thought rm would be limited to 64 bits in the end, but it looks like it could be as big as 2*(2^32-1)^2, which is 65 bits.
Let a = pl+ph, and b = ql+qh. Split them into al, ah, bl, and bh. This is like splitting p and q except ah and bh are either 0 or 1, nothing else. In the "grammar multiplication method", multiplying a and b looks something like this: Code:
[ah] [al]
* [bh] [bl]
----------------
[ bl*al ]
[bl]?ah
[ah*bh][al]?bh
And then you get to subtract rl and rh, and then add rm to them! Fun, huh? ![]() Now you see why I would have preferred 31-bit numbers - in that case you can ignore anything that would go into hw because it does fall out in subtraction. But not, apparently with 32-bit numbers. |
|
|
|
|
|
#6 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
Don't know if this will be of any help, but perhaps just
for purposes of debugging, here's a simple 64x64=>128-bit unsigned product macro in C - this assumes an unsigned 64-bit integer type (typedef'ed to "uint64" here) is supported by the compiler. Obvious optimizations would include: a) Use direct 32-bit register accesses to obviate the shifts that extract the low and high half of each 64-bit input; b) Replace sum < (either addend) carry check with carry bit check on architectures which support an explicit add-with-carry; c) Implement 3-MUL variant if hardware integer MUL is slow. Have fun, -Ernst Code:
/* Generic 128-bit product macros: represent the inputs as
x = a + b*2^32, y = c + d*2^32, and then do 4 MULs and a bunch of
adds-with-carry to get x*y = b*d*2^64 + (a*d + b*c)*2^32 + a*c .
Actual calling arguments are assumed to be 64-bit ints - user must
make sure this is true. Result written into lo, hi.
*/
#define MUL_LOHI(__x,__y,__lo,__hi) \
{ \
uint64 __a,__b,__c,__d,__ac,__ad,__bc; \
/*__a = (__x) & (uint64)0x00000000ffffffff;*/ \
__a = ((__x)<<32)>>32; \
__b = (__x)>>32; \
/*__c = (__y) & (uint64)0x00000000ffffffff;*/ \
__c = ((__y)<<32)>>32; \
__d = (__y)>>32; \
/* Calculate 4 subproducts in order in which they are first used */\
__ac = __a*__c; \
__bc = __b*__c; \
__hi = __b*__d; \
__ad = __a*__d; \
__lo = __ac; /* use __lo to store copy of __ac */ \
__ac += (__bc<<32); __hi += (__ac < __lo); \
__lo = __ac + (__ad<<32); __hi += (__lo < __ac); \
__hi += (__bc>>32) + (__ad>>32); \
}
|
|
|
|
|
|
#7 |
|
Jul 2004
5×7 Posts |
Thanks for the replies. I will be looking at every idea.
By the way, I have been inactive in programming for a while. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Help from coders | ET_ | GPU Computing | 5 | 2014-01-26 13:58 |
| performance improvement with assembly | bsquared | Software | 15 | 2010-09-28 19:00 |
| Calling hotshot coders: NFS code challenge | R.D. Silverman | Programming | 63 | 2006-10-09 05:48 |