mersenneforum.org Efficient modular reduction with SSE
 Register FAQ Search Today's Posts Mark Forums Read

 2008-03-26, 14:47 #1 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 2×7×461 Posts Efficient modular reduction with SSE The vague idea here is to do four parallel NTTs, say modulo the primes {51,54,60,63}*2^25+1. The operations inside the NTT are ring operations, so you can do them in a vectorised fashion; you have to do a CRT step before carry propagation. The primes are less than 2^31 so modular addition and subtraction are trivial. Transform length is 2^25, and I think you can have a digit size of 48 bits before worrying about round-off error (since the product of the primes is greater than 2^(96+25)), so that's exact arithmetic on 800Mbit integers. But is there a nice way to do AB mod p, with A, B, p 32-bit numbers, without having to do a full 64x64 multiply of the intermediate product AB by (2^94/p) using a total of nine pmuludq instructions and more shifts than the human brain or the permute unit can stand?
2008-03-26, 16:04   #2
fivemack
(loop (#_fork))

Feb 2006
Cambridge, England

193616 Posts

My current quad 64x64->128 multiply (input in four SSE2 registers, output in four SSE2 registers) takes 104 cycles on K8 and 117 on core2. This is an age, and in particular is significantly longer than using the integer multiplier four times. Has anyone got a notably better one?

The code is of a reasonably parallel shape, so I have no idea why it takes quite so long - there are 85 instructions, and lots of them can be executed in parallel.
Attached Files
 kitten4.cpp.txt (5.5 KB, 730 views)

2008-03-26, 18:51   #3
rogue

"Mark"
Apr 2003
Between here and the

24·421 Posts

Quote:
 Originally Posted by fivemack My current quad 64x64->128 multiply (input in four SSE2 registers, output in four SSE2 registers) takes 104 cycles on K8 and 117 on core2. This is an age, and in particular is significantly longer than using the integer multiplier four times. Has anyone got a notably better one? The code is of a reasonably parallel shape, so I have no idea why it takes quite so long - there are 85 instructions, and lots of them can be executed in parallel.
Talk to Geoff Reynolds (geoff on this forum). He has done a lot of optimizations for Intel and his work for the GFN search might be applicable here.

 2008-03-26, 19:28 #4 ewmayer ∂2ω=0     Sep 2002 República de California 11,743 Posts Having only had a cursory glance at Tom's sample code, a few comments: - In 64-bit mode, you'd need just *one* scalar MUL operation, which can complete every other cycle, so the SSE variant looks even worse there. - That's an awful lot of unpack.shuffle/repacking going on. If possible, consider rearranging the data, e.g. by doing 4 such wide-MULs in parallel, to make the SSE code look much more like the scalar analog, thus eliminating the data-shuffling. This will cause register spills, but one can just write data to their original array slots or to local storage as needed - trading a few register spills/fills for tens of in-place shuffles should be well worth it. - You're using intrinsics - In my experience that often leads to dismal performance. If you have access to Visual Studio under 32-bit Windows, I suggest trying some raw inline assembly in lieu of intrinsics - VS has a very simple inline ASM syntax relative to e.g. GCC. Or if youi're an old hand at GCC inline ASM, try that. - The Penryn version of Core2 will support some improved vectorized integer MUL instruction by way of SSE4.1 support, e.g. PMULDQ. Together with the above shuffle-less parallel wide-mul variant you should be able to get the per-128-bit-product pipelined throughput to something on the order of 10 cycles or better. Still worse than the 2 cycles in 64-bit mode, but quite competitive with the scalar version in 32-bit mode.
 2008-03-26, 19:56 #5 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 2·7·461 Posts I *am* doing four wide-MULs in parallel; I realise there's an awful lot of shuffling going on, but it's for getting things lined up correctly for 64-bit SSE2 arithmetic. PMULUDQ multiplies slots 0 and 2 of one register by slots 0 and 2 of another putting the result into slots 01 and 23 of the output, which is why I need the initial eight unpack instructions, and the rest of the shifting is needed to get carries not to fall off the top of registers. Ah, yes, if I work less generally (I only really need the top 64 bits of the product for the division) I can avoid the final shuffling about and just return sumzw231 and sumzw232. I checked that the output from gcc-4.1.2 looked pretty much like how I'd write the assembly language if I had enough brain to follow the register allocations; I'd rather be writing this in a nice macro assembler, but I don't know of one clever enough to let me write in A->B+C style and sort out the copying required for the annoying x86 destroy-one-input instructions. I don't think PMULDQ helps - signed multiplies are no use for multi-precision arithmetic. The code is anyway twice as fast as I thought it was because I'd forgotten that RDTSC takes perceptible time; looping a thousand times takes about 55k cycles.
 2008-03-26, 21:12 #6 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 2·7·461 Posts OK, SSE is obviously the wrong answer here; a simple implementation using standard 64-bit operations (and magic-reciprocals for division) takes under 25 cycles to do four 32x32%32->32 operations.
2008-03-26, 22:40   #7
ewmayer
2ω=0

Sep 2002
República de California

11,743 Posts

Quote:
 Originally Posted by fivemack OK, SSE is obviously the wrong answer here; a simple implementation using standard 64-bit operations (and magic-reciprocals for division) takes under 25 cycles to do four 32x32%32->32 operations.
In 64-bit mode, I agree - but in 32-bit mode, I'm thinking that maybe trying to do *every* part of the 128-bit product algorithm in SSE mode is the real problem: SSE mode is good for the 32x32-bit subproducts, but is a pain for reassembling them with the needed carries. OTOH, non-SSE unsigned-compares and adds should be efficient for doing the reassembly. I still think ~10 pipelined cycles is achievable, but need to work through the details of the above hybrid approach to prove or disprove that timing estimate.

Last fiddled with by ewmayer on 2008-03-26 at 22:42

2008-03-27, 13:34   #8
jasonp
Tribal Bullet

Oct 2004

3,547 Posts

Quote:
 Originally Posted by fivemack But is there a nice way to do AB mod p, with A, B, p 32-bit numbers, without having to do a full 64x64 multiply of the intermediate product AB by (2^94/p) using a total of nine pmuludq instructions and more shifts than the human brain or the permute unit can stand?
I've been adding a lot of these optimizations to the NTTs in GMP-ECM. See the latest SVN distribution, maybe something there can help.

 2008-03-27, 14:20 #9 fivemack (loop (#_fork))     Feb 2006 Cambridge, England 2×7×461 Posts Thanks for that pointer - I'd forgotten that ECM stage 2 is an application of large Fourier transforms. Indeed you're doing exactly what I thought of doing, but with the NTTs mod the different small primes sequential rather than concurrent, and the butterfly code looks very familiar. I need to read Jorg's book more carefully, I still don't quite understand in-place Fourier transforms so my fft spends most of its time in Code: left[u]=input[v++]; right[u++]=input[v++]; and finds itself doing an awful lot of mallocs.
2008-04-01, 14:01   #10
jasonp
Tribal Bullet

Oct 2004

1101110110112 Posts

Quote:
 Originally Posted by fivemack I need to read Jorg's book more carefully, I still don't quite understand in-place Fourier transforms so my fft spends most of its time in Code: left[u]=input[v++]; right[u++]=input[v++]; and finds itself doing an awful lot of mallocs.
The decimation-in-frequency version is just a matrix-vector product, where the matrix is a chain of factorized Fourier matrices. The decimation-in-time version is another matrix-vector product, where the matrix is the transpose of the chain of factorized Fourier matrices. You can use that formulation to understand the butterfly structure as well. Nussbaumer has a good (by now quite old) introductory book on the subject as well.

Last fiddled with by jasonp on 2008-04-01 at 14:04 Reason: fix errors

 2008-04-01, 19:49 #11 ewmayer ∂2ω=0     Sep 2002 República de California 2DDF16 Posts My Mlucas code does a forward DIF transform with radices [r1,r2,...,rn], followed by an inverse DIT transform with the same radices in reverse order. This allows for a nice transpose-and-bit-reversal-less in-place implementation. Lots of details - literally years' worth of coding and optimization work - buried in that brief description, however. Getting back to the original thread topic, here's a crude first-try sketch of an SSE-based 64x64->128-bit unsigned integer MUL. This is mostly SSE2 but a few of the new SSE4.2 instructions [e.g. pcmpgtq - for the life of me I can't understand why this and some other very basic quadword instructions were not implemented until 4.2], so the big caveat is that I have no actual hardware to try it out on - probably a couple glaring bugs in there. But gives at least a rough idea of what this might look like in SSE4.2 - fuse a couple of these together [for instance to allow for 2 64-bit middle-product parts to be processed side-by-side] and I still think 10 cycles per 128-bit result is reasonable. This uses the MSVC inline-asm syntax: Code: __asm [instruction] dest, src, [imm8] , i.e. the opposite order of the source and destination operands as in GCC. Comments and bug reports welcome. ;) Code: /* 64x64->128-bit unsigned integer MUL using SSE2 Inputs in adjacent 64-bit memlocs A = [a0,a1], B = [b0,b1], a0,a1,b0,b1 are 32-bit unsigned int, in memory we have a 128-bit chunk with a0 at the low address, assume 128-bit aligned: */ uint64 *ABinputs; /* Make sure this is 16-byte aligned */ ... [allocate local storage for ABinputs of at least 2 64-bit elements; init them] ... __asm mov eax, ABinputs /* Pointer to input-pair memloc */ __asm movaps xmm0,[eax] /* xmm[3:2:1:0] = [b1,b0,a1,a0] */ __asm pshufd xmm1,xmm0,0x10 /* Move xmm0[1:0] = [a1,a0] into xmm1[2:0] to get xmm1 = [--,a1,--,a0], slots 1 and 3 are don't-cares */ __asm pshufd xmm2,xmm0,0x32 /* Move xmm0[3:2] = [a0,a1] into xmm2[2:0] to get xmm2 = [--,b1,--,b0], slots 1 and 3 are don't-cares */ __asm pmuludq xmm2,xmm1 /* xmm2 = [a0*b0,a1*b1] */ __asm pshufd xmm0,xmm0,0x23 /* Move xmm0[2:3] = [b1,b0] into xmm0[2:0] to get xmm0 = [b1,xx,b0,yy], slots 1 and 3 are don't-cares */ __asm pmuludq xmm0,xmm1 /* xmm0 = [a0*b1,a1*b0] */ __asm pshufd xmm2,xmm2,0x39 /* Circular right-shift the contents of xmm2 by one 32-bit word: xmm[3:2:1:0] <- xmm[0:3:2:1] */ __asm movq xmm1,xmm0 /* xmm1.hi = 0; xmm1.lo = a0*b1 */ __asm psrldq xmm0,0x8 /* xmm0.hi = 0; xmm0.lo = a1*b0 */ __asm paddq xmm2,xmm1 /* Add a0*b1 to middle 64 bits of accumulator and check for carryout */ __asm pcmpgtq xmm1,xmm2 /* xmm1 = 0xff... if overflow occurred, 0 otherwise. This NEEDS SSE4.2! */ __asm paddq xmm2,xmm0 /* Add a1*b0 to middle 64 bits of accumulator and check for carryout */ __asm pcmpgtq xmm0,xmm2 __asm psubq xmm0,xmm1 /* Total carry, negated */ __asm pslldq xmm0,0x8 /* Left-shift carry to set up folding into high 32 bits of result, which are currently in xmm2[2] */ __asm psubq xmm2,xmm0 /* We know that the carryin cannot overflow the 32 high bits, since the full product must be < 2^128 */ __asm pshufd xmm2,xmm2,0x93 /* Circular left-shift the contents of xmm2 by one 32-bit word: xmm[3:2:1:0] <- xmm[2:1:0:3] */ /* Result is in xmm2 */ Edit: Indeed, there is at least one glaring bug in the above, namely that since pcmpgtq is a signed compare, each such instruction above must be preceded by an xorpd with 0x80...00 of each input operand (or copy thereof) in order to effect an unsigned-compare emulation as described in post #12 below. Last fiddled with by ewmayer on 2013-04-02 at 01:46 Reason: add note about pcmpgtq being signed compare

 Similar Threads Thread Thread Starter Forum Replies Last Post paulunderwood Computer Science & Computational Number Theory 5 2017-06-09 14:02 BenR Computer Science & Computational Number Theory 2 2016-03-27 00:37 hj47 Software 11 2009-01-29 00:45 Prime95 PrimeNet 3 2008-11-17 19:57 R.D. Silverman Factoring 2 2005-08-03 13:55

All times are UTC. The time now is 01:45.

Thu Sep 29 01:45:51 UTC 2022 up 41 days, 23:14, 0 users, load averages: 1.24, 1.21, 1.16