![]() |
|
|
#1 |
|
Dec 2008
Boycotting the Soapbox
2D016 Posts |
I guess I got a little distracted...
Algorithm: Get a 12 bit-approx with rcpps then do one step w/quintic convergence. The whole thing has 21 instructions (no rcppd?!), so I think it should be possible to get it to complete in roughly 7 cycles (~3.5 cycles per double, i.e. 2x faster than the Intel MKL advertises) for in-cache arrays. As far as I can tell, the results match the ones provided by divpd. I think there is a way to compute accurate values for e^x-1 in around 8 cycles, so e.g. hyperbolic tangents should be computable in around 12.5 cycles/double. I think I can also get log(1+x) in under 16 cycles, so the inverse hyperbolic functions can probably be done in under 20 cycles. Code:
struct DIVIDE
{
// compute y/x
const __m128d operator() ( __m128d y, __m128d x ) const
{
__m128d temp;
asm volatile(
"movaps %[r0],%[r3] \n"
"andps %[pdnotexp],%[r0] \n"// mantissa & sign
"orps %[pd1],%[r0] \n"// 1<=x<2
"andps %[pdexp],%[r3] \n"// exponent
"psubd %[pdexpbias],%[r3] \n"// normalize exponent
"movaps %[r0],%[r1] \n"// z
"cvtpd2ps %[r0],%[r0] \n"
"rcpps %[r0],%[r0] \n"// x (12-bit approx)
"cvtps2pd %[r0],%[r0] \n"
"mulpd %[r0],%[r1] \n"// z*x
"mulpd %[y],%[r0] \n"// y*x - more precision if done here!
"movaps %[pd1],%[r2] \n"
"subpd %[r1],%[r2] \n"// h=1-z*x
"movaps %[r2],%[r1] \n"// h
"mulpd %[r2],%[r2] \n"// h^2
"addpd %[r2],%[r1] \n"// h+h^2
"addpd %[pd1],%[r2] \n"// 1+h^2
"mulpd %[r0],%[r1] \n"// y*x*(h+h^2)
"mulpd %[r1],%[r2] \n"// y*x*(1+h^2)*(h+h^2)
"addpd %[r2],%[r0] \n"// y*x+y*x*(1+h^2)*(h+h^2)
"psubd %[r3],%[r0] \n"// -exponent
:
[r0]"+x"(x),
[r1]"=&x"(temp),
[r2]"=&x"(temp),
[r3]"=&x"(temp)
:
[y]"X"(y),
[pdexp]"X"(CONST::pdexp),
[pdnotexp]"X"(CONST::pdnotexp),
[pdexpbias]"X"(CONST::pdexpbias),
[pd1]"X"(CONST::pd1)
:
);
return x;
}
};
|
|
|
|
|
|
#2 |
|
"Ben"
Feb 2007
66748 Posts |
I need some help understanding this.
I'm fine with single precision. I've successfully implemented multi-up division via rcpps and one stage of newton-raphson to get ~23 bits of precision. It runs approximately 4.5x faster than using "/" to divide a list of numbers by another list. My trouble comes about when extending to double precision. Lack of "rcppd" requires using cvtps2ps/cvtpd2pd as above, but everything gets all hosed up when I try it. I suspect the exponents/signs aren't being treated correctly. Maybe it's something as simple as setting the correct floating point rounding mode, but I can't figure it out. The above code gets me close to understanding what needs to happen, but, for instance, what is [pdnotexp], [pdexpbias], and [pdexp], and where do they come from? Here is my code, which simple iterates through two vectors of random doubles, dividing each number from one list by one from the second: Code:
asm volatile( "xorl %%ecx, %%ecx \n\t" /* init loop counter */ "movaps (%4), %%xmm3 \n\t" /* bring in vector of 2's */ "1: \n\t" "cmpl %%eax, %%ecx \n\t" /* are we done? */ "jge 2f \n\t" "movapd (%1, %%rcx, 8), %%xmm0 \n\t" /* bring in denominators */ "movapd (%2, %%rcx, 8), %%xmm1 \n\t" /* bring in numerators */ "cvtpd2ps %%xmm0, %%xmm0 \n\t" /* convert to single precision */ "rcpps %%xmm0, %%xmm2 \n\t" /* approx 1/d */ "cvtps2pd %%xmm2, %%xmm2 \n\t" /* convert back to double precision */ "movapd %%xmm2, %%xmm4 \n\t" /* copy 1/d */ "movapd %%xmm3, %%xmm5 \n\t" /* copy 2.0, begin NR stuff below */ "mulpd %%xmm2, %%xmm0 \n\t" /* d * 1/d */ "mulpd %%xmm0, %%xmm2 \n\t" /* d * 1/d * 1/d */ "mulpd %%xmm5, %%xmm4 \n\t" /* 2.0 * 1/d */ "subpd %%xmm2, %%xmm4 \n\t" /* 1/d = 2 * 1/d - d * 1/d^2 */ "mulpd %%xmm4, %%xmm1 \n\t" /* n * 1/d */ "movapd %%xmm1, (%3, %%rcx, 8) \n\t" /* move answers out */ "addl $2, %%ecx \n\t" /* next two divisions */ "jmp 1b \n\t" "2: \n\t" : : "a"(numdiv), "r"(darray), "r"(narray), "r"(rarray), "r"(twoarray) : "rcx", "xmm0", "xmm1", "xmm2", "xmm3", "xmm4", "xmm5", "memory", "cc"); Instead of stuff like this (from control code using ordinary "/" operators): Code:
0: 2.1128 * 13792587.0000 - 29141108.0000 = 0.0000 Code:
0: -228535456420911741628038824726223177832726528.0000 * 13792587.0000 - 29141108.0000 = -3152095165270133574372109169366733124553529156960256.0000 |
|
|
|
|
|
#3 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
11001000101112 Posts |
You're doing mulpd %%xmm2, %%xmm0 where xmm0 contains the single-precision version of the denominators!
(I did have to step through this in gdb to see what was happening ...) Last fiddled with by fivemack on 2012-01-05 at 08:36 |
|
|
|
|
|
#4 |
|
"Ben"
Feb 2007
1101101111002 Posts |
Perfect! Thank you, that did the trick. I need to learn how to operate gdb better.
Speedup also looks nice. Even when using 2 NR iterations to get ~48 bits of precision, it runs ~3.5x faster through the loop. Also looks promising as a substitute for integer division, as it runs ~1.8x faster than a for loop of integer divisions. Although I don't know what conversion penalties will do to that figure yet. Thanks again fivemack. - ben. |
|
|
|
|
|
#5 | |
|
(loop (#_fork))
Feb 2006
Cambridge, England
144278 Posts |
Quote:
disassemble main (to find out the address of the start of the SSE2 code) break *0x00000004000c839 (to set the breakpoint; you need the * otherwise it thinks you're looking for a variable with a really stupid name) display/i $pc (so that after each instruction it disassembles the .next. one for me) start stepi (to run one single instruction) print $xmm0 print $xmm2 stepi print $xmm0 print $xmm2 until I figured out what was going on. If I were more competent I could have put 'display $xmm0' 'display $xmm1' through 'display $xmm5' at the start and it would then print out the values of all the relevant XMM registers at each step. |
|
|
|
|
|
|
#6 |
|
"Ben"
Feb 2007
DBC16 Posts |
Maybe not, but that sequence of commands displays more competence than I have
. I've only ever really used gdb to run backtraces on segfaults and such.I'm vaguely hopeful that this multi-up division will someday make my squfof routine faster, by enabling more than one multiplier to be run in parallel. |
|
|
|
|
|
#7 | |
|
"Ben"
Feb 2007
22·3·293 Posts |
Quote:
|
|
|
|
|
|
|
#8 |
|
"Ben"
Feb 2007
DBC16 Posts |
Over lunch I played around with using this for integer division (pointwise division of each integer in two large random lists). Attached is code which does that, about 1.6x faster than a naive for loop of integer divisions.
I had to play around with a fudge factor to get this to work with all possible divisor sizes and ended up with +1/16 (following by truncation). I didn't try to optimize register usage - it'd be nice to be able to tweak it enough to only use the x86 SSE2 register set (i.e. xmm0 - xmm7). Although I think interleaving the inversion approximation of the high half and low half dwords might get a decent speed up (at the cost of using nearly all of the x86_64 SSE2 register set). Last fiddled with by bsquared on 2012-01-05 at 19:34 Reason: 1/16, not 1/8... |
|
|
|
|
|
#9 |
|
∂2ω=0
Sep 2002
República de California
19·613 Posts |
Ben, I suggest replacing the core NR computation sequence with something like the following. This assumes the packed numerators and denominators are in xmm1 and xmm0 as in your code snippet above), the constant 2.0 in PD form is in xmm7, and xmm6 contains a 64-bit-wide bitmask 0x8000... in PD for negation of a paired-double - follow the code flow to see why this is useful:
Code:
"cvtpd2ps %%xmm0, %%xmm2 \n\t" /* convert d to sp and output in xmm2; keep original in xmm0 for negation */ "xorpd %%xmm6, %%xmm0 \n\t" /* -d */ "rcpps %%xmm2, %%xmm2 \n\t" /* ainv := approx 1/d to 11-12 bits of precision */ "cvtps2pd %%xmm2, %%xmm2 \n\t" /* convert approx-inverse back to double precision */ "movaps %%xmm2, %%xmm3 \n\t" /* copy ainv - this is a PD but movaps save an instruction byte*/ "mulpd %%xmm0, %%xmm2 \n\t" /* -d*ainv, overwrites ainv */ "addpd %%xmm7, %%xmm2 \n\t" /* 2 - d*ainv, overwrites ainv */ "mulpd %%xmm3, %%xmm2 \n\t" /* xmm2 = ainv*(2 - d*ainv) = 1/d accurate to ~23 bits */ "movaps %%xmm2, %%xmm3 \n\t" /* Do one more NR iteration by repeating previous 4 instructions: */ "mulpd %%xmm0, %%xmm2 \n\t" "addpd %%xmm7, %%xmm2 \n\t" "mulpd %%xmm3, %%xmm2 \n\t" /* xmm2 = 1/d accurate to ~46 bits */ "mulpd %%xmm2, %%xmm1 \n\t" /* overwrite numerator with desired quotient n/d */ Code:
"cvtpd2ps %%xmm0, %%xmm1 \n\t" /* convert d to sp and output in xmm1; keep original in xmm0 for negation */ "xorpd [%%rax], %%xmm0 \n\t" /* -d */ "rcpps %%xmm1, %%xmm1 \n\t" /* ainv := approx 1/d to 11-12 bits of precision */ "cvtps2pd %%xmm1, %%xmm1 \n\t" /* convert approx-inverse back to double precision */ "movaps %%xmm1, %%xmm2 \n\t" /* copy ainv - this is a PD but movaps save an instruction byte*/ "mulpd %%xmm0, %%xmm1 \n\t" /* -d*ainv, overwrites ainv */ "addpd %%xmm7, %%xmm1 \n\t" /* 2 - d*ainv, overwrites ainv */ "mulpd %%xmm2, %%xmm1 \n\t" /* xmm1 = ainv*(2 - d*ainv) = 1/d accurate to ~23 bits */ "movaps %%xmm1, %%xmm2 \n\t" /* Do one more NR iteration by repeating previous 4 instructions: */ "mulpd %%xmm0, %%xmm1 \n\t" "addpd %%xmm7, %%xmm1 \n\t" "mulpd %%xmm2, %%xmm1 \n\t" /* xmm1 = 1/d accurate to ~46 bits */ "mulpd [%%rbx], %%xmm1 \n\t" /* multiply by numerator to get desired quotient n/d */ With 4 inputs-at-a-time (i.e. 4 numerators and 4 denominators) it is worth seeing if going to the trouble of packing the sp approximation of all 4 denominators into a single register (via a pair of cvtpd2ps and some high/low-half manipulations of the 2 respective results), then doing a single rcpps and one Newton iteration on that before back-coverting to double for the final iteration is worth it. Lastly, the above 2 snips do 2 full iterations at quadratic convergence - One could consider doing the second iteration at cubic convergence, if something close to the maximal 53 bits of precision is desired. To estimate how many registers that night need, let's consider the respective algorithm in pseudocode form. The Newton-Raphson iteration for successively approximating x = 1/c is x_{n+1} = x_n*[2 - c*x_n] . My implementation above stores -x and y (drop the iteration-subscript now) in registers and proceeds as z = x; // copy inv x *= -c; // -c*x x += 2; // 2 - c*x x *= z; // improved approximation to 1/c To derive a third-order scheme, it helps (or at least it helps me) to recall where the above second-order scheme comes from. NR for roots of f(x) iterates as x_{n+1} = x_n - f(x_n)/f'(x_n). If we want the inverse of a constant c, the obvious target function to express this as a root-finding problem is f(x) := c - 1/x. Then f' = 1/x^2, yielding the iteration formula x_{n+1} = x - (c*x^2 - x) = 2x - cx^2 = x*(2-c*x) . I tried applying Halley's cubic-convergence formula (Philos. Trans. Roy. Soc. London, 18 (1694), 136-145) to the same target function (all derivatives on the rhs are evaluated at he current iterate x_n): x_{n+1} = x_n - (f*f')/[(f')^2 - f*f''/2] Substituting f = c-1/x, f' = 1/x^2, f'' = -2/x^3 gives x_{n+1} = x - (c/x^2 - 1/x^3)/[1/x^4 + (c/x^3 - 1/x^4)] = x - (c/x^2 - 1/x^3)/(c/x^3) = x - (x - 1/c) = 1/c, which is annoying because 1/c - the very inverse we are trying to approximate without use of explicit inversion - appears as the result of the iteration formula. I also tried a standard 4th-order method: x_{n+1} = x_n - f*[(f')^2 - f*f''/2]/[(f')^3 - f*f'*f'' + f^2*f'''/6] with the same result, cancellation of terms causes the rhs to reduce to the thing we seek, 1/c. Crandall gives the following quartic scheme for iterative inversion, but does not give its derivation (nor give a 3rd-order analog): x_{n+1} = x_n*(2 - c*x_n)*[1 + (1 - c*x_n)^2] . Perhaps a usable 3rd-order scheme requires a more-adroit choice of target function. Last fiddled with by ewmayer on 2012-01-06 at 01:05 |
|
|
|
|
|
#10 | |
|
"Ben"
Feb 2007
1101101111002 Posts |
Quote:
Yes, but this sort of thing is on the back burner for now. The impact of 2 rounds of NR vs. 0 rounds was less than 10%, so the benefits of 1 NR + cubic might be negligible compared to just doing a 3rd round of NR (if necessary). __HRB__ in the OP has code for 1 round of quintic convergence, but I don't know how that was derived. 4th order Householder method, maybe? Last fiddled with by bsquared on 2012-01-06 at 02:49 |
|
|
|
|
|
|
#11 | |
|
"Ben"
Feb 2007
22·3·293 Posts |
Quote:
Code:
test 2x simd double precision division via inverse-approx and newton-raphson iteration: time for 0x newton-raphson simd division = 22.0409, 0.00 % correct time for 1x newton-raphson simd division = 22.4083, 21.47 % correct time for 2x newton-raphson simd division = 24.0490, 100.00 % correct time for (ewmayer) 2x newton-raphson simd division = 23.7880, 100.00 % correct time for serial division = 67.3021, 100.00 % correct test 2x simd integer division via inverse-approx and newton-raphson iteration: time for 2x newton-raphson simd division = 29.4001, 100.00 % correct time for interleaved 2x newton-raphson simd division = 26.2557, 100.00 % correct time for (ewmayer) interleaved 2x newton-raphson simd division = 25.9254, 100.00 % correct time for serial division = 40.0531, 100.00 % correct I'll play around next with the single precision stuff, and with extending precision up to the full 53 bits. Hopefully sometime this weekend. Thanks again for your help! - ben. Last fiddled with by bsquared on 2012-01-06 at 05:14 Reason: more info |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Do normal adults give themselves an allowance? (...to fast or not to fast - there is no question!) | jasong | jasong | 35 | 2016-12-11 00:57 |
| translating double to single precision? | ixfd64 | Hardware | 5 | 2012-09-12 05:10 |
| Double precision GPUs coming very soon | dsouza123 | Hardware | 4 | 2007-10-15 02:20 |
| Training for your multi-precision division routine | fivemack | Puzzles | 3 | 2007-04-26 17:01 |
| double precision in LL tests | drew | Software | 4 | 2006-08-08 04:08 |