mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2004-10-18, 01:42   #1
xenon
 
Jul 2004

5·7 Posts
Default IA-32 Assembly Coders, anyone?

Any ideas on doing "64-bit integer multiplication into 128-bit" the fastest way? You may assume Pentium 4 without SSE3 is used.

The pmuludq xmm,xmm can do two "32bit x 32bit into 64bit", latency is 6 clocks only. Using grammar multiplication method, we need four multiplications. But 128-bit addition is not available. I am stuck with shifting left and right the xmm registers quite a lot and using paddq xmm,xmm. Also, pslldq shifts on byte granularity, that's 8, 16, 24, ... bits
xenon is offline   Reply With Quote
Old 2005-01-12, 00:48   #2
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

39510 Posts
Default

I'm pretty good at standard IA-32 assembly, but the SSE2 stuff for P4 throws me (mainly because I don't have a P4). The best expert I know on this stuff is named Jay Berg. He rewrote the ECCp-109 client for the P4, while I did non-P4. Unfortunately, all his code seems to have disappeared. If you can find somebody who downloaded it, you might take a look at his code for 128x128=256 (it's named mulmod_p109.S or something similar) and see if you get any ideas.

I'm not sure I noticed this when I rewrote the code , but 64x64=128 can be done with three multiplies, not four. Following Robert Sedgewick's Algorithms, chapter 36, page 527-528:

If 64-bit numbers p and q are split into their low and high words, pl, ql, ph, and qh,
Then compute rl=pl*ql, rh=ph*qh, and rm=(pl+ph)*(ql+qh)-rl-rh. It might help at this point if pl+ph and ql+qh could fit in 32 bits somehow...
Anyway, note that rm=(pl*ql+ph*ql+pl*qh+ph*qh)-(pl*ql)-(ph*qh), or (ph*ql+pl*qh).
So then p*q = rl+rm>>32+rh>>64, where rl, rm, and rh are each 64 bits long.
Ken_g6 is offline   Reply With Quote
Old 2005-01-13, 03:17   #3
xenon
 
Jul 2004

5·7 Posts
Default

Thanks for the reply.
How do you sum up the 3 parts (rl, rm, rh, after shifting) to get the 128 bit result?
There is 64-bit add in SSE2, but no carry output.
xenon is offline   Reply With Quote
Old 2005-01-13, 17:54   #4
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Beats me. I'm just givin' you the algorithm. Remember, I have no clue about SSE2. You can just concatenate rl and rh, but you will need a 128-bit sum to add rm. I did it with non-sse2 code; an addl followed by several adcl's.

For ECCp-109, everything was modulo a 109-bit number, so I could have split up the multiplicands into three 31-bit numbers and one 16-bit number (with several Athlon-friendly bit shifts), and then run that algorithm. You might try breaking your numbers up into 16-bit chunks stored in 32-bit words. Note that the multiplication algorithm works recursively; multiplying four 16-bit numbers takes nine multiplies. That's more than four, but could the extra space help avoid some bit shifts?

If you're interested in my non-SSE2 code, it's here. I'm still looking for Jay's code, but I'm not optimistic that I have it. And Jay's web site has disappeared. However, I recently found the former ECCp-109 project leader, Chris Monico, posting in the factoring forum. So if you're really interested in getting this code, you might contact him.
Ken_g6 is offline   Reply With Quote
Old 2005-01-13, 21:46   #5
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

I'd just like to add some notes about calculating rm. You're adding two 32-bit numbers twice before multiplying, so the result could be as big as 66 bits before subtraction. You can do rl and rh in SSE2, in parallel (I've been studying a bit), but I think rm will have to be mostly non-sse2. Worse, I thought rm would be limited to 64 bits in the end, but it looks like it could be as big as 2*(2^32-1)^2, which is 65 bits.

Let a = pl+ph, and b = ql+qh. Split them into al, ah, bl, and bh. This is like splitting p and q except ah and bh are either 0 or 1, nothing else.

In the "grammar multiplication method", multiplying a and b looks something like this:
Code:
       [ah] [al] 
     * [bh] [bl]
----------------
       [ bl*al ]
       [bl]?ah
[ah*bh][al]?bh
So start with al*bl. This bit you could do in SSE2. Set another higher word hw = ah & bh; if either one is 0, the word is 0. Then add bl<<32 if ah != 0, and al<<32 if bh != 0. I'd suggest cmov's for this, except the P4 apparently isn't great at cmov's, so try branches too. Don't forget to adc hw,0 after each add.

And then you get to subtract rl and rh, and then add rm to them! Fun, huh?

Now you see why I would have preferred 31-bit numbers - in that case you can ignore anything that would go into hw because it does fall out in subtraction. But not, apparently with 32-bit numbers.
Ken_g6 is offline   Reply With Quote
Old 2005-05-17, 18:14   #6
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

Don't know if this will be of any help, but perhaps just
for purposes of debugging, here's a simple 64x64=>128-bit
unsigned product macro in C - this assumes an unsigned 64-bit
integer type (typedef'ed to "uint64" here) is supported by the
compiler. Obvious optimizations would include:

a) Use direct 32-bit register accesses to obviate the shifts
that extract the low and high half of each 64-bit input;

b) Replace sum < (either addend) carry check with carry bit
check on architectures which support an explicit add-with-carry;

c) Implement 3-MUL variant if hardware integer MUL is slow.

Have fun,
-Ernst

Code:
/* Generic 128-bit product macros: represent the inputs as
x = a + b*2^32, y = c + d*2^32, and then do 4 MULs and a bunch of
adds-with-carry to get x*y = b*d*2^64 + (a*d + b*c)*2^32 + a*c .
Actual calling arguments are assumed to be 64-bit ints - user must
make sure this is true. Result written into lo, hi.
*/
#define MUL_LOHI(__x,__y,__lo,__hi)                        \
{                                                          \
    uint64 __a,__b,__c,__d,__ac,__ad,__bc;                 \
    /*__a = (__x) & (uint64)0x00000000ffffffff;*/          \
    __a = ((__x)<<32)>>32;                                 \
    __b =  (__x)>>32;                                      \
    /*__c = (__y) & (uint64)0x00000000ffffffff;*/          \
    __c = ((__y)<<32)>>32;                                 \
    __d =  (__y)>>32;                                      \
    /* Calculate 4 subproducts in order in which they are first used */\
    __ac = __a*__c;                                        \
    __bc = __b*__c;                                        \
    __hi = __b*__d;                                        \
    __ad = __a*__d;                                        \
    __lo  =  __ac;  /* use __lo to store copy of __ac */   \
    __ac +=         (__bc<<32);    __hi += (__ac < __lo);  \
    __lo  =  __ac + (__ad<<32);    __hi += (__lo < __ac);  \
    __hi += (__bc>>32) + (__ad>>32);                       \
}
ewmayer is offline   Reply With Quote
Old 2005-06-02, 13:26   #7
xenon
 
Jul 2004

5×7 Posts
Default

Thanks for the replies. I will be looking at every idea.

By the way, I have been inactive in programming for a while.
xenon is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Help from coders ET_ GPU Computing 5 2014-01-26 13:58
performance improvement with assembly bsquared Software 15 2010-09-28 19:00
Calling hotshot coders: NFS code challenge R.D. Silverman Programming 63 2006-10-09 05:48

All times are UTC. The time now is 04:11.


Fri Jul 7 04:11:28 UTC 2023 up 323 days, 1:40, 0 users, load averages: 1.58, 1.61, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔