mersenneforum.org And now for something completely the same....
 Register FAQ Search Today's Posts Mark Forums Read

 2005-08-07, 13:26 #1 R.D. Silverman     Nov 2003 22×5×373 Posts And now for something completely the same.... Hello Everyone, Here is a slightly different challenge. Improve the following code. It divides an unsigned multi-precision integer by and unsigned single precision integer. The numbers are in radix 2^30, i.e. int A[MPDIM] equals A = a_n 2^(30n) + a_{n-1} 2^(30(n-1)) + ....... the *length* is stored in A[0] and includes word 0, so single precision has A[0] = 2, double has A[0] = 3, etc. double precision means A[2] * 2^30 + A[1] triple means A[3] * 2^60 + A[2] 2^30 + A[1] etc. Note: most of the time is spent in divrem_asm(). Tricks like loop unrolling in div_single() don't matter much. SIZE(a) is a macro for a[0] /*******************************************************/ /* */ /* routine to do single precision divide */ /* */ /******************************************************/ small_div_single(a,b,result) unsigned int a[],b,result[]; { /* start of div_single */ register unsigned int i,rem,as,first; unsigned int x[2]; // note: for 1st div, rem starts at 0. can elim this div entirely if b > a[i] // do direct single/single division otherwise without calling // divrem. Useful if size is small. as = SIZE(a); first = a[as-1]; if (b > first) // no division needed! { result[as-1] = 0; // quotient = 0, rem is as assigned; rem = first; SIZE(result) = as-1; } else { result[as-1] = first/b; rem = rem - result[as-1] *b; SIZE(result) = as; } for (i=as-2; i>0; i--) { divrem_asm(rem,a[i],b,x); result[i] = x[0]; rem = x[1]; } return(rem); } /* end of div_single */ /********************************************************/ /* */ /* compute (a*2^30 + b)/c and (a*2^30 + b) % */ /* assembler version */ /* */ /*******************************************************/ __inline void divrem_asm(a,b,c,d) unsigned int a,b,c,d[2]; { /* start of divrem_asm */ /* We could use a double length register fromthe mmx instruction set, */ /* however, the emmx instruction must be executedto clean up the */ /* FP registers every time we use mmx. Emmx is a very lengthy */ /* instruction compared to what we are doing here. */ _asm { /* edx:eax = (a << 30) + tempb; */ mov eax,a shl eax,30 mov edx,a shr edx,2 and edx,0x3fffffff add eax,b adc edx,0 /* Now divide a * (2^30) which resides in the register pair: edx:eax */ /* eax = edx:eax / c */ /* edx = edx:eax % c */ mov ecx,c /* d[x] is a pointer (moved ahead here for a little pentium optimization*/ mov edi,d div ecx /* d[0] = ((a << 30) + b) / c; */ mov DWORD PTR[edi],eax /* d[1] = ((a << 30) + b) % c; */ mov DWORD PTR[edi]+4,edx } // end _asm } /* end of divrem_asm */
 2005-08-07, 17:52 #2 dsouza123     Sep 2002 29616 Posts ; put a,b in edx:eax as a:b a*(2^30) + b (both a,b 30 bit) ; convert from radix 2^30 to 2^32 resulting in an unsigned 64 bit mov eax, b mov edx, a shl eax, 2 shrd eax, edx, 2 shr edx, 2 Last fiddled with by dsouza123 on 2005-08-07 at 17:55
 2005-08-08, 03:12 #3 xenon   Jul 2004 5×7 Posts Again, some statistical distribution data is needed to optimize. Say something about the length of dividend. For the divisor, I'm thinking of doing a bit population count and avoid div instruction, if bit population is small. But let's see how's the distribution first. Also, is one divisor going to be used multiple times? shr edx,2 and edx,0x3fffffff The "and" is obviously redundant.
2005-08-08, 12:04   #4
R.D. Silverman

Nov 2003

11101001001002 Posts

Quote:
 Originally Posted by xenon Again, some statistical distribution data is needed to optimize. Say something about the length of dividend. For the divisor, I'm thinking of doing a bit population count and avoid div instruction, if bit population is small. But let's see how's the distribution first. Also, is one divisor going to be used multiple times? shr edx,2 and edx,0x3fffffff The "and" is obviously redundant.
Typically the dividend has 3 to 5 limbs. (i.e. triple to pentuple precision)
The divisors are almost always small, i.e. less than 16 bits. Once the
dividend reduces to double precision we have usually divided out the
small primes, leaving a double precision number that is either prime
or gets split with a version of QS optimized for 60 bit numbers.

I've been looking at ways to avoid the idiv as well. Perhaps a
floating point approach?

 2005-08-08, 13:25 #5 xenon   Jul 2004 2316 Posts I have no idea right now. I'll find a way later. Just a try to make the code shorter. I can reduce some overhead by coding the entire function in assembly. Code: unsigned int div_m1(unsigned int *a, unsigned int b, unsigned int *result) { register unsigned int rem, as; as = a[0] - 1; // as = index to most significant word if (b > a[as]) // no division needed { result[as] = 0; // quotient = 0, rem is as assigned rem = a[as]; result[0] = as; // shrink size by one word } else { result[0] = a[0]; // quotient size = dividend size _asm { mov ecx, as mov eax, a lea eax, [eax+ecx*4] // compute pointer eax = a[as] mov eax, [eax] xor edx, edx // for MSWord divide, 32bit dividend div b mov rem, edx mov edx, result lea edx, [edx+ecx*4] // compute pointer edx = result[as] mov [edx], eax } } for (as-=1; as>0; as--) // start from next most significant word { _asm { mov eax, rem mov edx, eax shl eax, 30 shr edx, 2 // edx:eax loaded with last remainder mov ecx, as mov ebx, a lea ebx, [ebx+ecx*4] // compute pointer ebx = a[as] add eax, [ebx] adc edx, 0 // edx:eax prepared div b mov rem, edx mov edx, result lea edx, [edx+ecx*4] // compute pointer edx = result[as] mov [edx], eax } } return rem; } This is actually not for speed. I didn't do anything to speed up. However I might have reduced some register movements.
 2005-08-10, 14:05 #6 xenon   Jul 2004 3510 Posts When you say radix 2^30, does that mean every limb is less than 2^30? If that is the case, add eax,b adc edx,0 can be changed to or eax,b And yes, my "shortened code" is a little faster than the original. For 3 limbs, 170 vs 185 clocks. I guess another 10 clocks can be saved by writing the whole function in assembly.
2005-08-12, 10:45   #7
xenon

Jul 2004

5·7 Posts

Currently my code doesn't check for the special case of single precision.
This is to squeeze out every bit of performance.

Constraints:
divisor: [1, 2^30-1]
For every dividend limb: [0, 2^30-1]
minimum number of limbs: 2 (Length=3)
maximum number of limbs: >1 billion, limited by memory

Code:
Pentium III timing

asm    previous*
3-limb    127      158
4-limb    172      209
5-limb    214      252
*previous refers to my "shortened" code.
Please note that this code is for Visual C++ 6 only. The assembly code is not suitable for inline (using _asm keyword) in C code because the way esp is used to access arguments. Anyway, creating a function in C and then using _asm will not perform as good as linking functions written in assembly. The compiler generates some code to set up ebp which is not needed.

I don't think additional improvements can be done without avoiding DIV instruction. Look at the timings, 43 clocks per limb, while DIV instruction takes 39 clocks. Now the word "reciprocal" plays in my mind...
Attached Files
 xdiv1_vc.zip (36.2 KB, 162 views)

2005-08-15, 12:01   #8
R.D. Silverman

Nov 2003

22×5×373 Posts

Quote:
 Originally Posted by xenon Now the word "reciprocal" plays in my mind...
I tried it already in FP. There are two problems.

(1) round-off error. Let A be the dividend and d be the divisor.
compute 1/d. This is accurate to +/- 1 ULP, ~ 2^-53 for the
Pentium. (i.e. without IEEE 80-bit arithmetic)

A partial dividend A can be as large as 2^60-1, so A*d is only
accurate to +/- 128. This, however can be corrected.

(2) Speed. My floating point code resembled:

r = 1/d
rem = 0

Then, for each limb a[i]

A = rem * RADIX + a[i]
quo = (int) (A * r)
rem = A - quo * d

This was almost 3 TIMES SLOWER than the all integer code under discussion.
And this is without the code needed to correct for round-off error.

I see no point in trying to optimize this approach.

An integer version would need to set:

r = 2^60/d

But now r is double-precision, so each loop iteration would involve
multiplying two double precision numbers ---> slow!

 2005-08-15, 12:57 #9 xenon   Jul 2004 2316 Posts Yes, I tried myself. The floating point approach can't get it faster. But that's not too slow either, just about 1.2x the time taken. Floating point multiplication isn't really slow, it's moving data in and out between integer and FPU that slow things down. Anyway, another idea. Is the dividend held constant and changing divisor? Or divisor held constant and changing dividend? Thinking about SIMD. So, are you able to make good use of my assembly code? Or am I just making rubbish?
2005-08-15, 16:54   #10
R.D. Silverman

Nov 2003

164448 Posts

Quote:
 Originally Posted by xenon Yes, I tried myself. The floating point approach can't get it faster. But that's not too slow either, just about 1.2x the time taken. Floating point multiplication isn't really slow, it's moving data in and out between integer and FPU that slow things down. Anyway, another idea. Is the dividend held constant and changing divisor? Or divisor held constant and changing dividend? Thinking about SIMD. So, are you able to make good use of my assembly code? Or am I just making rubbish?
I have one version that uses your assembly code. I keep it for
"experimental code". For code that I make public I want as little assembler
as possible.

 2005-08-17, 01:45 #11 Ken_g6     Jan 2005 Caught in a sieve 5×79 Posts If you have a constant dividend, there are certain tricks you can use to get the remainder faster, utilizing things like tables of remainder values. Tricks like this were used in the ECCP-109 code I worked on once. As for general division, there's an algorithm I've always wanted to try, but I don't think I ever have. First, place the single precision number the highest radix. Shift it down one bit until it's smaller than the divisor. Now subtract, put a bit in the result, and shift again. This will be terrible on P4, and I don't really have time to try it on any other platform, but at least it has no idiv's.

 Similar Threads Thread Thread Starter Forum Replies Last Post Unregistered Information & Answers 4 2013-04-10 07:09 aketilander Factoring 4 2012-08-08 18:09 WVU Mersenneer Factoring 58 2011-01-27 15:03 MarkJD Information & Answers 10 2010-08-19 17:31 philmoore Factoring 1 2003-03-31 23:49

All times are UTC. The time now is 00:29.

Mon Apr 19 00:29:00 UTC 2021 up 10 days, 19:09, 0 users, load averages: 1.94, 2.23, 2.45