20050807, 13:26  #1 
Nov 2003
2^{2}×5×373 Posts 
And now for something completely the same....
Hello Everyone,
Here is a slightly different challenge. Improve the following code. It divides an unsigned multiprecision integer by and unsigned single precision integer. The numbers are in radix 2^30, i.e. int A[MPDIM] equals A = a_n 2^(30n) + a_{n1} 2^(30(n1)) + ....... the *length* is stored in A[0] and includes word 0, so single precision has A[0] = 2, double has A[0] = 3, etc. double precision means A[2] * 2^30 + A[1] triple means A[3] * 2^60 + A[2] 2^30 + A[1] etc. Note: most of the time is spent in divrem_asm(). Tricks like loop unrolling in div_single() don't matter much. SIZE(a) is a macro for a[0] /*******************************************************/ /* */ /* routine to do single precision divide */ /* */ /******************************************************/ small_div_single(a,b,result) unsigned int a[],b,result[]; { /* start of div_single */ register unsigned int i,rem,as,first; unsigned int x[2]; // note: for 1st div, rem starts at 0. can elim this div entirely if b > a[i] // do direct single/single division otherwise without calling // divrem. Useful if size is small. as = SIZE(a); first = a[as1]; if (b > first) // no division needed! { result[as1] = 0; // quotient = 0, rem is as assigned; rem = first; SIZE(result) = as1; } else { result[as1] = first/b; rem = rem  result[as1] *b; SIZE(result) = as; } for (i=as2; i>0; i) { divrem_asm(rem,a[i],b,x); result[i] = x[0]; rem = x[1]; } return(rem); } /* end of div_single */ /********************************************************/ /* */ /* compute (a*2^30 + b)/c and (a*2^30 + b) % */ /* assembler version */ /* */ /*******************************************************/ __inline void divrem_asm(a,b,c,d) unsigned int a,b,c,d[2]; { /* start of divrem_asm */ /* We could use a double length register fromthe mmx instruction set, */ /* however, the emmx instruction must be executedto clean up the */ /* FP registers every time we use mmx. Emmx is a very lengthy */ /* instruction compared to what we are doing here. */ _asm { /* edx:eax = (a << 30) + tempb; */ mov eax,a shl eax,30 mov edx,a shr edx,2 and edx,0x3fffffff add eax,b adc edx,0 /* Now divide a * (2^30) which resides in the register pair: edx:eax */ /* eax = edx:eax / c */ /* edx = edx:eax % c */ mov ecx,c /* d[x] is a pointer (moved ahead here for a little pentium optimization*/ mov edi,d div ecx /* d[0] = ((a << 30) + b) / c; */ mov DWORD PTR[edi],eax /* d[1] = ((a << 30) + b) % c; */ mov DWORD PTR[edi]+4,edx } // end _asm } /* end of divrem_asm */ 
20050807, 17:52  #2 
Sep 2002
296_{16} Posts 
; put a,b in edx:eax as a:b a*(2^30) + b (both a,b 30 bit)
; convert from radix 2^30 to 2^32 resulting in an unsigned 64 bit mov eax, b mov edx, a shl eax, 2 shrd eax, edx, 2 shr edx, 2 Last fiddled with by dsouza123 on 20050807 at 17:55 
20050808, 03:12  #3 
Jul 2004
5×7 Posts 
Again, some statistical distribution data is needed to optimize.
Say something about the length of dividend. For the divisor, I'm thinking of doing a bit population count and avoid div instruction, if bit population is small. But let's see how's the distribution first. Also, is one divisor going to be used multiple times? shr edx,2 and edx,0x3fffffff The "and" is obviously redundant. 
20050808, 12:04  #4  
Nov 2003
1110100100100_{2} Posts 
Quote:
The divisors are almost always small, i.e. less than 16 bits. Once the dividend reduces to double precision we have usually divided out the small primes, leaving a double precision number that is either prime or gets split with a version of QS optimized for 60 bit numbers. I've been looking at ways to avoid the idiv as well. Perhaps a floating point approach? 

20050808, 13:25  #5 
Jul 2004
23_{16} Posts 
I have no idea right now. I'll find a way later.
Just a try to make the code shorter. I can reduce some overhead by coding the entire function in assembly. Code:
unsigned int div_m1(unsigned int *a, unsigned int b, unsigned int *result) { register unsigned int rem, as; as = a[0]  1; // as = index to most significant word if (b > a[as]) // no division needed { result[as] = 0; // quotient = 0, rem is as assigned rem = a[as]; result[0] = as; // shrink size by one word } else { result[0] = a[0]; // quotient size = dividend size _asm { mov ecx, as mov eax, a lea eax, [eax+ecx*4] // compute pointer eax = a[as] mov eax, [eax] xor edx, edx // for MSWord divide, 32bit dividend div b mov rem, edx mov edx, result lea edx, [edx+ecx*4] // compute pointer edx = result[as] mov [edx], eax } } for (as=1; as>0; as) // start from next most significant word { _asm { mov eax, rem mov edx, eax shl eax, 30 shr edx, 2 // edx:eax loaded with last remainder mov ecx, as mov ebx, a lea ebx, [ebx+ecx*4] // compute pointer ebx = a[as] add eax, [ebx] adc edx, 0 // edx:eax prepared div b mov rem, edx mov edx, result lea edx, [edx+ecx*4] // compute pointer edx = result[as] mov [edx], eax } } return rem; } 
20050810, 14:05  #6 
Jul 2004
35_{10} Posts 
When you say radix 2^30, does that mean every limb is less than 2^30?
If that is the case, add eax,b adc edx,0 can be changed to or eax,b And yes, my "shortened code" is a little faster than the original. For 3 limbs, 170 vs 185 clocks. I guess another 10 clocks can be saved by writing the whole function in assembly. 
20050812, 10:45  #7 
Jul 2004
5·7 Posts 
Currently my code doesn't check for the special case of single precision.
This is to squeeze out every bit of performance. Constraints: divisor: [1, 2^301] For every dividend limb: [0, 2^301] minimum number of limbs: 2 (Length=3) maximum number of limbs: >1 billion, limited by memory Code:
Pentium III timing asm previous* 3limb 127 158 4limb 172 209 5limb 214 252 *previous refers to my "shortened" code. I don't think additional improvements can be done without avoiding DIV instruction. Look at the timings, 43 clocks per limb, while DIV instruction takes 39 clocks. Now the word "reciprocal" plays in my mind... 
20050815, 12:01  #8  
Nov 2003
2^{2}×5×373 Posts 
Quote:
(1) roundoff error. Let A be the dividend and d be the divisor. compute 1/d. This is accurate to +/ 1 ULP, ~ 2^53 for the Pentium. (i.e. without IEEE 80bit arithmetic) A partial dividend A can be as large as 2^601, so A*d is only accurate to +/ 128. This, however can be corrected. (2) Speed. My floating point code resembled: r = 1/d rem = 0 Then, for each limb a[i] A = rem * RADIX + a[i] quo = (int) (A * r) rem = A  quo * d This was almost 3 TIMES SLOWER than the all integer code under discussion. And this is without the code needed to correct for roundoff error. I see no point in trying to optimize this approach. An integer version would need to set: r = 2^60/d But now r is doubleprecision, so each loop iteration would involve multiplying two double precision numbers > slow! 

20050815, 12:57  #9 
Jul 2004
23_{16} Posts 
Yes, I tried myself. The floating point approach can't get it faster. But that's not too slow either, just about 1.2x the time taken. Floating point multiplication isn't really slow, it's moving data in and out between integer and FPU that slow things down.
Anyway, another idea. Is the dividend held constant and changing divisor? Or divisor held constant and changing dividend? Thinking about SIMD. So, are you able to make good use of my assembly code? Or am I just making rubbish? 
20050815, 16:54  #10  
Nov 2003
16444_{8} Posts 
Quote:
"experimental code". For code that I make public I want as little assembler as possible. 

20050817, 01:45  #11 
Jan 2005
Caught in a sieve
5×79 Posts 
If you have a constant dividend, there are certain tricks you can use to get the remainder faster, utilizing things like tables of remainder values. Tricks like this were used in the ECCP109 code I worked on once.
As for general division, there's an algorithm I've always wanted to try, but I don't think I ever have. First, place the single precision number the highest radix. Shift it down one bit until it's smaller than the divisor. Now subtract, put a bit in the result, and shift again. This will be terrible on P4, and I don't really have time to try it on any other platform, but at least it has no idiv's. 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
I'm not sure if this is prime or my CPUCompletely lost.  Unregistered  Information & Answers  4  20130410 07:09 
Factored vs. Completely factored  aketilander  Factoring  4  20120808 18:09 
M4219 completely factored?  WVU Mersenneer  Factoring  58  20110127 15:03 
Prime 95 Reccomended  Completely Lost  MarkJD  Information & Answers  10  20100819 17:31 
M673 completely factored  philmoore  Factoring  1  20030331 23:49 