mersenneforum.org  

Go Back   mersenneforum.org > Math Stuff > Computer Science & Computational Number Theory

Reply
 
Thread Tools
Old 2008-03-26, 14:47   #1
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×32×179 Posts
Default Efficient modular reduction with SSE

The vague idea here is to do four parallel NTTs, say modulo the primes {51,54,60,63}*2^25+1. The operations inside the NTT are ring operations, so you can do them in a vectorised fashion; you have to do a CRT step before carry propagation. The primes are less than 2^31 so modular addition and subtraction are trivial.

Transform length is 2^25, and I think you can have a digit size of 48 bits before worrying about round-off error (since the product of the primes is greater than 2^(96+25)), so that's exact arithmetic on 800Mbit integers.

But is there a nice way to do AB mod p, with A, B, p 32-bit numbers, without having to do a full 64x64 multiply of the intermediate product AB by (2^94/p) using a total of nine pmuludq instructions and more shifts than the human brain or the permute unit can stand?
fivemack is offline   Reply With Quote
Old 2008-03-26, 16:04   #2
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×32×179 Posts
Default

My current quad 64x64->128 multiply (input in four SSE2 registers, output in four SSE2 registers) takes 104 cycles on K8 and 117 on core2. This is an age, and in particular is significantly longer than using the integer multiplier four times. Has anyone got a notably better one?

The code is of a reasonably parallel shape, so I have no idea why it takes quite so long - there are 85 instructions, and lots of them can be executed in parallel.
Attached Files
File Type: txt kitten4.cpp.txt (5.5 KB, 700 views)
fivemack is offline   Reply With Quote
Old 2008-03-26, 18:51   #3
rogue
 
rogue's Avatar
 
"Mark"
Apr 2003
Between here and the

32·719 Posts
Default

Quote:
Originally Posted by fivemack View Post
My current quad 64x64->128 multiply (input in four SSE2 registers, output in four SSE2 registers) takes 104 cycles on K8 and 117 on core2. This is an age, and in particular is significantly longer than using the integer multiplier four times. Has anyone got a notably better one?

The code is of a reasonably parallel shape, so I have no idea why it takes quite so long - there are 85 instructions, and lots of them can be executed in parallel.
Talk to Geoff Reynolds (geoff on this forum). He has done a lot of optimizations for Intel and his work for the GFN search might be applicable here.
rogue is online now   Reply With Quote
Old 2008-03-26, 19:28   #4
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

1167610 Posts
Default

Having only had a cursory glance at Tom's sample code, a few comments:

- In 64-bit mode, you'd need just *one* scalar MUL operation, which can complete every other cycle, so the SSE variant looks even worse there.

- That's an awful lot of unpack.shuffle/repacking going on. If possible, consider rearranging the data, e.g. by doing 4 such wide-MULs in parallel, to make the SSE code look much more like the scalar analog, thus eliminating the data-shuffling. This will cause register spills, but one can just write data to their original array slots or to local storage as needed - trading a few register spills/fills for tens of in-place shuffles should be well worth it.

- You're using intrinsics - In my experience that often leads to dismal performance. If you have access to Visual Studio under 32-bit Windows, I suggest trying some raw inline assembly in lieu of intrinsics - VS has a very simple inline ASM syntax relative to e.g. GCC. Or if youi're an old hand at GCC inline ASM, try that.

- The Penryn version of Core2 will support some improved vectorized integer MUL instruction by way of SSE4.1 support, e.g. PMULDQ. Together with the above shuffle-less parallel wide-mul variant you should be able to get the per-128-bit-product pipelined throughput to something on the order of 10 cycles or better. Still worse than the 2 cycles in 64-bit mode, but quite competitive with the scalar version in 32-bit mode.
ewmayer is online now   Reply With Quote
Old 2008-03-26, 19:56   #5
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22×32×179 Posts
Default

I *am* doing four wide-MULs in parallel; I realise there's an awful lot of shuffling going on, but it's for getting things lined up correctly for 64-bit SSE2 arithmetic. PMULUDQ multiplies slots 0 and 2 of one register by slots 0 and 2 of another putting the result into slots 01 and 23 of the output, which is why I need the initial eight unpack instructions, and the rest of the shifting is needed to get carries not to fall off the top of registers.

Ah, yes, if I work less generally (I only really need the top 64 bits of the product for the division) I can avoid the final shuffling about and just return sumzw231 and sumzw232.

I checked that the output from gcc-4.1.2 looked pretty much like how I'd write the assembly language if I had enough brain to follow the register allocations; I'd rather be writing this in a nice macro assembler, but I don't know of one clever enough to let me write in A->B+C style and sort out the copying required for the annoying x86 destroy-one-input instructions.

I don't think PMULDQ helps - signed multiplies are no use for multi-precision arithmetic.

The code is anyway twice as fast as I thought it was because I'd forgotten that RDTSC takes perceptible time; looping a thousand times takes about 55k cycles.
fivemack is offline   Reply With Quote
Old 2008-03-26, 21:12   #6
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22·32·179 Posts
Default

OK, SSE is obviously the wrong answer here; a simple implementation using standard 64-bit operations (and magic-reciprocals for division) takes under 25 cycles to do four 32x32%32->32 operations.
fivemack is offline   Reply With Quote
Old 2008-03-26, 22:40   #7
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

266348 Posts
Default

Quote:
Originally Posted by fivemack View Post
OK, SSE is obviously the wrong answer here; a simple implementation using standard 64-bit operations (and magic-reciprocals for division) takes under 25 cycles to do four 32x32%32->32 operations.
In 64-bit mode, I agree - but in 32-bit mode, I'm thinking that maybe trying to do *every* part of the 128-bit product algorithm in SSE mode is the real problem: SSE mode is good for the 32x32-bit subproducts, but is a pain for reassembling them with the needed carries. OTOH, non-SSE unsigned-compares and adds should be efficient for doing the reassembly. I still think ~10 pipelined cycles is achievable, but need to work through the details of the above hybrid approach to prove or disprove that timing estimate.

Last fiddled with by ewmayer on 2008-03-26 at 22:42
ewmayer is online now   Reply With Quote
Old 2008-03-27, 13:34   #8
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

3×1,181 Posts
Default

Quote:
Originally Posted by fivemack View Post
But is there a nice way to do AB mod p, with A, B, p 32-bit numbers, without having to do a full 64x64 multiply of the intermediate product AB by (2^94/p) using a total of nine pmuludq instructions and more shifts than the human brain or the permute unit can stand?
I've been adding a lot of these optimizations to the NTTs in GMP-ECM. See the latest SVN distribution, maybe something there can help.
jasonp is offline   Reply With Quote
Old 2008-03-27, 14:20   #9
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

22·32·179 Posts
Default

Thanks for that pointer - I'd forgotten that ECM stage 2 is an application of large Fourier transforms. Indeed you're doing exactly what I thought of doing, but with the NTTs mod the different small primes sequential rather than concurrent, and the butterfly code looks very familiar.

I need to read Jorg's book more carefully, I still don't quite understand in-place Fourier transforms so my fft spends most of its time in
Code:
left[u]=input[v++]; right[u++]=input[v++];
and finds itself doing an awful lot of mallocs.
fivemack is offline   Reply With Quote
Old 2008-04-01, 14:01   #10
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

354310 Posts
Default

Quote:
Originally Posted by fivemack View Post
I need to read Jorg's book more carefully, I still don't quite understand in-place Fourier transforms so my fft spends most of its time in
Code:
left[u]=input[v++]; right[u++]=input[v++];
and finds itself doing an awful lot of mallocs.
The decimation-in-frequency version is just a matrix-vector product, where the matrix is a chain of factorized Fourier matrices. The decimation-in-time version is another matrix-vector product, where the matrix is the transpose of the chain of factorized Fourier matrices. You can use that formulation to understand the butterfly structure as well. Nussbaumer has a good (by now quite old) introductory book on the subject as well.

Last fiddled with by jasonp on 2008-04-01 at 14:04 Reason: fix errors
jasonp is offline   Reply With Quote
Old 2008-04-01, 19:49   #11
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

266348 Posts
Default

My Mlucas code does a forward DIF transform with radices [r1,r2,...,rn], followed by an inverse DIT transform with the same radices in reverse order. This allows for a nice transpose-and-bit-reversal-less in-place implementation. Lots of details - literally years' worth of coding and optimization work - buried in that brief description, however.

Getting back to the original thread topic, here's a crude first-try sketch of an SSE-based 64x64->128-bit unsigned integer MUL. This is mostly SSE2 but a few of the new SSE4.2 instructions [e.g. pcmpgtq - for the life of me I can't understand why this and some other very basic quadword instructions were not implemented until 4.2], so the big caveat is that I have no actual hardware to try it out on - probably a couple glaring bugs in there. But gives at least a rough idea of what this might look like in SSE4.2 - fuse a couple of these together [for instance to allow for 2 64-bit middle-product parts to be processed side-by-side] and I still think 10 cycles per 128-bit result is reasonable. This uses the MSVC inline-asm syntax:
Code:
__asm [instruction] dest, src, [imm8]
, i.e. the opposite order of the source and destination operands as in GCC. Comments and bug reports welcome. ;)
Code:
/* 64x64->128-bit unsigned integer MUL using SSE2

Inputs in adjacent 64-bit memlocs A = [a0,a1], B = [b0,b1], a0,a1,b0,b1 are 32-bit unsigned int, in memory we have a 128-bit chunk with a0 at the low address, assume 128-bit aligned:
*/
	uint64 *ABinputs;	/* Make sure this is 16-byte aligned */
	...
	[allocate local storage for ABinputs of at least 2 64-bit elements; init them]
	...
        __asm   mov     eax, ABinputs    /* Pointer to input-pair memloc */
        __asm   movaps  xmm0,[eax]       /* xmm[3:2:1:0] = [b1,b0,a1,a0] */
        __asm   pshufd  xmm1,xmm0,0x10   /* Move xmm0[1:0] = [a1,a0] into xmm1[2:0] to get xmm1 = [--,a1,--,a0], slots 1 and 3 are don't-cares */
        __asm   pshufd  xmm2,xmm0,0x32   /* Move xmm0[3:2] = [a0,a1] into xmm2[2:0] to get xmm2 = [--,b1,--,b0], slots 1 and 3 are don't-cares */
        __asm   pmuludq xmm2,xmm1        /* xmm2 = [a0*b0,a1*b1] */
        __asm   pshufd  xmm0,xmm0,0x23   /* Move xmm0[2:3] = [b1,b0] into xmm0[2:0] to get xmm0 = [b1,xx,b0,yy], slots 1 and 3 are don't-cares */
        __asm   pmuludq xmm0,xmm1        /* xmm0 = [a0*b1,a1*b0] */
        __asm   pshufd  xmm2,xmm2,0x39   /* Circular right-shift the contents of xmm2 by one 32-bit word: xmm[3:2:1:0] <- xmm[0:3:2:1] */
        __asm   movq    xmm1,xmm0        /* xmm1.hi = 0; xmm1.lo = a0*b1 */
        __asm   psrldq  xmm0,0x8         /* xmm0.hi = 0; xmm0.lo = a1*b0 */
        __asm   paddq   xmm2,xmm1        /* Add a0*b1 to middle 64 bits of accumulator and check for carryout */
        __asm   pcmpgtq xmm1,xmm2        /* xmm1 = 0xff... if overflow occurred, 0 otherwise. This NEEDS SSE4.2! */
        __asm   paddq   xmm2,xmm0        /* Add a1*b0 to middle 64 bits of accumulator and check for carryout */
        __asm   pcmpgtq xmm0,xmm2
        __asm   psubq   xmm0,xmm1        /* Total carry, negated */
        __asm   pslldq  xmm0,0x8         /* Left-shift carry to set up folding into high 32 bits of result, which are currently in xmm2[2] */
        __asm   psubq   xmm2,xmm0        /* We know that the carryin cannot overflow the 32 high bits, since the full product must be < 2^128 */
        __asm   pshufd  xmm2,xmm2,0x93   /* Circular left-shift the contents of xmm2 by one 32-bit word: xmm[3:2:1:0] <- xmm[2:1:0:3] */
        /* Result is in xmm2 */
Edit: Indeed, there is at least one glaring bug in the above, namely that since pcmpgtq is a signed compare, each such instruction above must be preceded by an xorpd with 0x80...00 of each input operand (or copy thereof) in order to effect an unsigned-compare emulation as described in post #12 below.

Last fiddled with by ewmayer on 2013-04-02 at 01:46 Reason: add note about pcmpgtq being signed compare
ewmayer is online now   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
Efficient Test paulunderwood Computer Science & Computational Number Theory 5 2017-06-09 14:02
Fast modular reduction for primes < 512 bits? BenR Computer Science & Computational Number Theory 2 2016-03-27 00:37
Most efficient way to LL hj47 Software 11 2009-01-29 00:45
CPU stats reduction Prime95 PrimeNet 3 2008-11-17 19:57
Lattice Reduction R.D. Silverman Factoring 2 2005-08-03 13:55

All times are UTC. The time now is 22:18.


Tue Dec 7 22:18:58 UTC 2021 up 137 days, 16:47, 0 users, load averages: 1.07, 1.49, 1.57

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.