mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2005-08-07, 13:26   #1
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Thumbs up And now for something completely the same....

Hello Everyone,

Here is a slightly different challenge. Improve the following code.
It divides an unsigned multi-precision integer by and unsigned single
precision integer. The numbers are in radix 2^30, i.e.

int A[MPDIM] equals

A = a_n 2^(30n) + a_{n-1} 2^(30(n-1)) + .......

the *length* is stored in A[0] and includes word 0, so single precision has
A[0] = 2, double has A[0] = 3, etc.

double precision means A[2] * 2^30 + A[1]
triple means A[3] * 2^60 + A[2] 2^30 + A[1]

etc.

Note: most of the time is spent in divrem_asm(). Tricks like loop
unrolling in div_single() don't matter much.

SIZE(a) is a macro for a[0]


/*******************************************************/
/* */
/* routine to do single precision divide */
/* */
/******************************************************/

small_div_single(a,b,result)
unsigned int a[],b,result[];

{ /* start of div_single */
register unsigned int i,rem,as,first;
unsigned int x[2];

// note: for 1st div, rem starts at 0. can elim this div entirely if b > a[i]
// do direct single/single division otherwise without calling
// divrem. Useful if size is small.

as = SIZE(a);
first = a[as-1];

if (b > first) // no division needed!
{
result[as-1] = 0; // quotient = 0, rem is as assigned;
rem = first;
SIZE(result) = as-1;
}
else
{
result[as-1] = first/b;
rem = rem - result[as-1] *b;
SIZE(result) = as;
}

for (i=as-2; i>0; i--)
{
divrem_asm(rem,a[i],b,x);
result[i] = x[0];
rem = x[1];
}
return(rem);

} /* end of div_single */



/********************************************************/
/* */
/* compute (a*2^30 + b)/c and (a*2^30 + b) % */
/* assembler version */
/* */
/*******************************************************/


__inline void divrem_asm(a,b,c,d)
unsigned int a,b,c,d[2];

{ /* start of divrem_asm */

/* We could use a double length register fromthe mmx instruction set, */
/* however, the emmx instruction must be executedto clean up the */
/* FP registers every time we use mmx. Emmx is a very lengthy */
/* instruction compared to what we are doing here. */

_asm {

/* edx:eax = (a << 30) + tempb; */

mov eax,a
shl eax,30
mov edx,a
shr edx,2
and edx,0x3fffffff

add eax,b
adc edx,0

/* Now divide a * (2^30) which resides in the register pair: edx:eax */
/* eax = edx:eax / c */
/* edx = edx:eax % c */

mov ecx,c

/* d[x] is a pointer (moved ahead here for a little pentium optimization*/
mov edi,d
div ecx

/* d[0] = ((a << 30) + b) / c; */
mov DWORD PTR[edi],eax

/* d[1] = ((a << 30) + b) % c; */
mov DWORD PTR[edi]+4,edx

} // end _asm

} /* end of divrem_asm */
R.D. Silverman is offline   Reply With Quote
Old 2005-08-07, 17:52   #2
dsouza123
 
dsouza123's Avatar
 
Sep 2002

29616 Posts
Default

; put a,b in edx:eax as a:b a*(2^30) + b (both a,b 30 bit)
; convert from radix 2^30 to 2^32 resulting in an unsigned 64 bit

mov eax, b
mov edx, a

shl eax, 2
shrd eax, edx, 2
shr edx, 2

Last fiddled with by dsouza123 on 2005-08-07 at 17:55
dsouza123 is offline   Reply With Quote
Old 2005-08-08, 03:12   #3
xenon
 
Jul 2004

5×7 Posts
Default

Again, some statistical distribution data is needed to optimize.
Say something about the length of dividend.
For the divisor, I'm thinking of doing a bit population count and avoid div instruction, if bit population is small. But let's see how's the distribution first. Also, is one divisor going to be used multiple times?


shr edx,2
and edx,0x3fffffff
The "and" is obviously redundant.
xenon is offline   Reply With Quote
Old 2005-08-08, 12:04   #4
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

11101001001002 Posts
Thumbs up

Quote:
Originally Posted by xenon
Again, some statistical distribution data is needed to optimize.
Say something about the length of dividend.
For the divisor, I'm thinking of doing a bit population count and avoid div instruction, if bit population is small. But let's see how's the distribution first. Also, is one divisor going to be used multiple times?


shr edx,2
and edx,0x3fffffff
The "and" is obviously redundant.
Typically the dividend has 3 to 5 limbs. (i.e. triple to pentuple precision)
The divisors are almost always small, i.e. less than 16 bits. Once the
dividend reduces to double precision we have usually divided out the
small primes, leaving a double precision number that is either prime
or gets split with a version of QS optimized for 60 bit numbers.

I've been looking at ways to avoid the idiv as well. Perhaps a
floating point approach?
R.D. Silverman is offline   Reply With Quote
Old 2005-08-08, 13:25   #5
xenon
 
Jul 2004

2316 Posts
Default

I have no idea right now. I'll find a way later.

Just a try to make the code shorter. I can reduce some overhead by coding the entire function in assembly.
Code:
unsigned int div_m1(unsigned int *a, unsigned int b, unsigned int *result)
{
	register unsigned int rem, as;

	as = a[0] - 1;	// as = index to most significant word

	if (b > a[as])	// no division needed
	{
		result[as] = 0;	// quotient = 0, rem is as assigned
		rem = a[as];
		result[0] = as;	// shrink size by one word
	}
	else
	{
		result[0] = a[0];		// quotient size = dividend size
		_asm {
			mov ecx, as
			mov eax, a
			lea eax, [eax+ecx*4]	// compute pointer eax = a[as]
			mov eax, [eax]
			xor edx, edx		// for MSWord divide, 32bit dividend
			div b
			mov rem, edx
			mov edx, result
			lea edx, [edx+ecx*4]	// compute pointer edx = result[as]
			mov [edx], eax
		}
	}

	for (as-=1; as>0; as--)		// start from next most significant word
	{
		_asm {
			mov eax, rem
			mov edx, eax
			shl eax, 30
			shr edx, 2		// edx:eax loaded with last remainder
			mov ecx, as
			mov ebx, a
			lea ebx, [ebx+ecx*4]	// compute pointer ebx = a[as]
			add eax, [ebx]
			adc edx, 0		// edx:eax prepared

			div b
			mov rem, edx
			mov edx, result
			lea edx, [edx+ecx*4]	// compute pointer edx = result[as]
			mov [edx], eax
		}
	}

	return rem;
}
This is actually not for speed. I didn't do anything to speed up. However I might have reduced some register movements.
xenon is offline   Reply With Quote
Old 2005-08-10, 14:05   #6
xenon
 
Jul 2004

3510 Posts
Default

When you say radix 2^30, does that mean every limb is less than 2^30?

If that is the case,
add eax,b
adc edx,0
can be changed to
or eax,b

And yes, my "shortened code" is a little faster than the original. For 3 limbs, 170 vs 185 clocks. I guess another 10 clocks can be saved by writing the whole function in assembly.
xenon is offline   Reply With Quote
Old 2005-08-12, 10:45   #7
xenon
 
Jul 2004

5·7 Posts
Default

Currently my code doesn't check for the special case of single precision.
This is to squeeze out every bit of performance.

Constraints:
divisor: [1, 2^30-1]
For every dividend limb: [0, 2^30-1]
minimum number of limbs: 2 (Length=3)
maximum number of limbs: >1 billion, limited by memory


Code:
Pentium III timing

          asm    previous*
3-limb    127      158
4-limb    172      209
5-limb    214      252
*previous refers to my "shortened" code.
Please note that this code is for Visual C++ 6 only. The assembly code is not suitable for inline (using _asm keyword) in C code because the way esp is used to access arguments. Anyway, creating a function in C and then using _asm will not perform as good as linking functions written in assembly. The compiler generates some code to set up ebp which is not needed.

I don't think additional improvements can be done without avoiding DIV instruction. Look at the timings, 43 clocks per limb, while DIV instruction takes 39 clocks. Now the word "reciprocal" plays in my mind...
Attached Files
File Type: zip xdiv1_vc.zip (36.2 KB, 162 views)
xenon is offline   Reply With Quote
Old 2005-08-15, 12:01   #8
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

22×5×373 Posts
Default

Quote:
Originally Posted by xenon

Now the word "reciprocal" plays in my mind...

I tried it already in FP. There are two problems.

(1) round-off error. Let A be the dividend and d be the divisor.
compute 1/d. This is accurate to +/- 1 ULP, ~ 2^-53 for the
Pentium. (i.e. without IEEE 80-bit arithmetic)

A partial dividend A can be as large as 2^60-1, so A*d is only
accurate to +/- 128. This, however can be corrected.

(2) Speed. My floating point code resembled:

r = 1/d
rem = 0

Then, for each limb a[i]

A = rem * RADIX + a[i]
quo = (int) (A * r)
rem = A - quo * d

This was almost 3 TIMES SLOWER than the all integer code under discussion.
And this is without the code needed to correct for round-off error.

I see no point in trying to optimize this approach.

An integer version would need to set:

r = 2^60/d

But now r is double-precision, so each loop iteration would involve
multiplying two double precision numbers ---> slow!
R.D. Silverman is offline   Reply With Quote
Old 2005-08-15, 12:57   #9
xenon
 
Jul 2004

2316 Posts
Default

Yes, I tried myself. The floating point approach can't get it faster. But that's not too slow either, just about 1.2x the time taken. Floating point multiplication isn't really slow, it's moving data in and out between integer and FPU that slow things down.

Anyway, another idea. Is the dividend held constant and changing divisor? Or divisor held constant and changing dividend? Thinking about SIMD.

So, are you able to make good use of my assembly code? Or am I just making rubbish?
xenon is offline   Reply With Quote
Old 2005-08-15, 16:54   #10
R.D. Silverman
 
R.D. Silverman's Avatar
 
Nov 2003

164448 Posts
Default

Quote:
Originally Posted by xenon
Yes, I tried myself. The floating point approach can't get it faster. But that's not too slow either, just about 1.2x the time taken. Floating point multiplication isn't really slow, it's moving data in and out between integer and FPU that slow things down.

Anyway, another idea. Is the dividend held constant and changing divisor? Or divisor held constant and changing dividend? Thinking about SIMD.

So, are you able to make good use of my assembly code? Or am I just making rubbish?
I have one version that uses your assembly code. I keep it for
"experimental code". For code that I make public I want as little assembler
as possible.
R.D. Silverman is offline   Reply With Quote
Old 2005-08-17, 01:45   #11
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

If you have a constant dividend, there are certain tricks you can use to get the remainder faster, utilizing things like tables of remainder values. Tricks like this were used in the ECCP-109 code I worked on once.

As for general division, there's an algorithm I've always wanted to try, but I don't think I ever have. First, place the single precision number the highest radix. Shift it down one bit until it's smaller than the divisor. Now subtract, put a bit in the result, and shift again. This will be terrible on P4, and I don't really have time to try it on any other platform, but at least it has no idiv's.
Ken_g6 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
I'm not sure if this is prime or my CPU-Completely lost. Unregistered Information & Answers 4 2013-04-10 07:09
Factored vs. Completely factored aketilander Factoring 4 2012-08-08 18:09
M4219 completely factored? WVU Mersenneer Factoring 58 2011-01-27 15:03
Prime 95 Reccomended - Completely Lost MarkJD Information & Answers 10 2010-08-19 17:31
M673 completely factored philmoore Factoring 1 2003-03-31 23:49

All times are UTC. The time now is 00:29.

Mon Apr 19 00:29:00 UTC 2021 up 10 days, 19:09, 0 users, load averages: 1.94, 2.23, 2.45

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.