mersenneforum.org  

Go Back   mersenneforum.org > Extra Stuff > Programming

Reply
 
Thread Tools
Old 2005-08-02, 13:09   #45
xenon
 
Jul 2004

2316 Posts
Default

Speed improvement. Readable assembly code.
Still for Visual C++ compiler only. Please tell whether you have compilation problems.
Code:
   modulus      asm      C
   1048583      254     349
   2097169      266     365
   4194319      279     395
   8388617      293     401
  16777259      305     418
  33554467      318     436
Attached Files
File Type: zip xi_vc2.zip (22.5 KB, 253 views)
xenon is offline   Reply With Quote
Old 2005-08-02, 14:59   #46
R.D. Silverman
 
R.D. Silverman's Avatar
 
"Bob Silverman"
Nov 2003
North of Boston

11101100011012 Posts
Thumbs up

Quote:
Originally Posted by xenon
Speed improvement. Readable assembly code.
Still for Visual C++ compiler only. Please tell whether you have compilation problems.
Code:
   modulus      asm      C
   1048583      254     349
   2097169      266     365
   4194319      279     395
   8388617      293     401
  16777259      305     418
  33554467      318     436
Query 1:

It appears that the routine assumes a __fastcall. Correct? And that
the inputs are already in registers ecx,edx.

If the calling code is: int __fastcall eegcd(a,modulus) will VC++
automatically put the arguments into ecx & edx?


Query 2:

Done_Ans_1 seems quite a bit away from the jump instruction. Would it
help to move it closer?

???

Also, how much time is saved by __fastcall as opposed to (say) __cdecl?

I prefer not to rely upon arguments being in spcific registers.
With yourOK, I would like to change this to:

int single_modinv(int a, int modulus)

{
_asm {
mov ecx, a
mov edx, modulus
... your code
etc.
}
}
The routine then no longer needs to save a copy of modulus in [esp-4].
The code after Done_mebx can replace

mov eax, [esp-4]

with

mov eax, modulus

and does not need to save modulus at [esp-4] at the top.

We can also eliminate all the push/pops etc.
R.D. Silverman is offline   Reply With Quote
Old 2005-08-03, 00:16   #47
xenon
 
Jul 2004

5×7 Posts
Default

You may do anything to my code.

Yes, that is fastcall calling convention, arguments are put by compiler into registers.
According to Agner, both Microsoft and GNU use the same registers (ecx and edx) for fastcall. Therefore, now you can use my code just like library. It should always work as long as the same compiler is used.

ebp, ebx, esi, edi must be preserved. So, that's the reason push and pops are there.

fastcall is faster than other calls which put arguments on stack, but I could be wrong here. Actually I used stdcall before this and then found information about fastcall.

For Done_Ans_1, you can experiment yourself if you have MASM32. I've tried moving codes around and find this to be fast. Maybe I missed some arrangement.
xenon is offline   Reply With Quote
Old 2005-08-04, 12:40   #48
xenon
 
Jul 2004

5×7 Posts
Default

Got it run on Linux. I run the code on my other computer which is also a Pentium III. Now I see that maybe I don't know how to use Visual C++ to compile properly. This gcc can compile codes faster than VC++. Anyway, I tried both -O2 switch and -O3 switch on Linux. Nevertheless, hand-tuned code wins. Keep in mind that my assembly code is derived from akruppa's code.

Code:
   modulus    asm(O2)  C(O2)   asm(O3)  C(O3)
   1048583      260     320     258      279
   2097169      273     336     271      295
   4194319      287     351     284      310
   8388617      300     367     298      324
  16777259      313     383     311      339
  33554467      327     398     325      354
I was wrong about fastcall that uses registers ecx and edx. In VC++, it's true. But for gcc, the registers are eax, edx, ecx. I was thinking that it'll be easier to use the link method for asm-C interface. But now, I've learn gcc inline asm AT&T syntax. So I can do both method. Both linking method and inlining method requires rewriting every line of assembly code for peak performance. Actually, by using linking method, we can be a bit lazy and use a few mov at the beginning to get the input registers in the right position. So what do you choose? link or inline?

I assume we only need to take care Windows and Linux on x86, right?
For the file I attach, asm folder is meant for Windows (because I use MASM32 assembler, then convert using emxaout). xi2_gcc directory is meant for linux.
Attached Files
File Type: zip xi2_gcc.zip (5.8 KB, 223 views)
xenon is offline   Reply With Quote
Old 2005-08-04, 19:36   #49
Wacky
 
Wacky's Avatar
 
Jun 2003
The Texas Hill Country

32·112 Posts
Default

Quote:
Originally Posted by xenon
I assume we only need to take care Windows and Linux on x86, right?
It would certainly be nice to include FreeBSD, Slowlaris, and MacOS.

Is anything that you are doing particular to Linux (rather than gcc)?
Wacky is offline   Reply With Quote
Old 2005-08-04, 22:19   #50
TTn
 

3·5·72·11 Posts
Post

Quote:
modulus asm c
1048583 423 475
2097169 385 473
4194319 407 490
8388617 425 517
16777259 439 546
33554467 462 552
I did get a warning:

Quote:
--------------------Configuration: inverse3 - Win32 Release--------------------
Compiling...
inverse3.cpp
C:\Documents and Settings\MyAcount\Desktop\xenon\xi_vc2\benchmark\inverse3.cpp(14) : warning C4101: 'temp' : unreferenced local variable
Linking...

inverse3.exe - 0 error(s), 1 warning(s)
  Reply With Quote
Old 2005-08-05, 00:14   #51
xenon
 
Jul 2004

5×7 Posts
Default

Quote:
Originally Posted by Wacky
It would certainly be nice to include FreeBSD, Slowlaris, and MacOS.

Is anything that you are doing particular to Linux (rather than gcc)?
No doing anything particular to Linux, but asm code is for x86 only.

Quote:
Originally Posted by TTn
I did get a warning:
Yes, it is intended. I was too lazy to write #ifdef code there.
xenon is offline   Reply With Quote
Old 2005-08-18, 00:30   #52
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Here's a theoretical (meaning I haven't tested it) fast LSB function. It does a binary search for the LSB! In theory you could do the same for the MSB (I think you'd just check if lowpart != 0, but that might be off by one). And the conditionals should produce CMOVs in the assembly.
Code:
int lsb(unsigned int n) {
	unsigned int lowpart;
	int lsb = 0;
	int bits;

	for(bits=sizeof(unsigned int)/2; bits != 0; bits >>= 1) {
		lowpart = (1<<bits)-1;
		lowpart = n & lowpart;
		n = (lowpart==0)?(n>>bits):n;
		lsb += (lowpart==0)?bits:0;
	}
	
}
I hope this helps.
Ken_g6 is offline   Reply With Quote
Old 2005-08-18, 11:25   #53
cjohnsto
 
Jun 2005

11112 Posts
Default

Here is a recursive version of LSB algorithm. I have tested it using a simple wrapper. It does not handle the input of 0 well (I consider this undefined). This may be better or worse than what you are currently doing for 32 bit numbers however it should scale to 64bit better.
I tried to make it easy to convert to ASM. Also mine probably requires more registers than yours.

Also maybe you can combine them to get both at once (if that helps you).

Code:
int LSB(unsigned int x)
{
	unsigned int a, b, c;
	
	a = x;
	b = 0;
	c = 16;
	do
	{
		a <<= c;
		if (a)
			b |= c;
		else
			a = x << b;
		c >>= 1;
	}	while (c);
	b = 31 - b;
	return b;
}

int MSB(unsigned int x)
{
	unsigned int a, b, c;
	
	a = x;
	b = 0;
	c = 16;
	do
	{
		a >>= c;
		if (a)
			b |= c;
		else
			a = x >> b;
		c >>= 1;
	}	while (c);
	return b;
}
cjohnsto is offline   Reply With Quote
Old 2005-08-18, 16:00   #54
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

I knew I had a better MSB code lying around here somewhere. It turns out that my LSB algorithm is useless because only 1 in 2^n numbers has an LSB of n or greater. So this takes about twice as long as the naive method.

For MSB it's somewhat more useful, but there's another better method I found here. Pick any one of those methods for MSB.

For LSB, a similar table approach takes 1/4 the time of the naive approach. I couldn't get the compiler to inline my function, so I inlined it myself in my test:
Code:
	const char LsbTable256[] = 
	{
		0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
		4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
	};

	unsigned int i, n, lsb1;
	for(i=1; i < 1000000000; ++i) {
		n = i;
		lsb1 = 0;
		while((n&0xFF) == 0) {
			n >>= 8;
			lsb1 += 8;
		}
		lsb1 |= LsbTable256[n];
	}
As in cjohnsto's code, passing 0 is undefined. I haven't actually checked that this table is correct, but I think it is.

If a table's too much for you, here's an intermediate LSB method that's about 70% faster than naive:
Code:
	unsigned int i, n, lsb1;
	for(i=1; i < 1000000000; ++i) {
		n = i;
		lsb1 = 0;
		while((n&3) == 0) {
			n >>= 2;
			lsb1 += 2;
		}
		lsb1 |= ((~n)&1);
	}
Ken_g6 is offline   Reply With Quote
Old 2005-08-18, 18:04   #55
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

6138 Posts
Default

Now that I look at that page again, there's another algorithm here that's about 15% faster than my table algorithm on my Athlon, and has no shifts so it's probably faster on P4. But it uses a modulus, so I don't know for sure whether it's better.
Ken_g6 is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Calling airsquirrels Prime95 GPU Computing 16 2015-09-29 18:06
Help from coders ET_ GPU Computing 5 2014-01-26 13:58
Calling all 64-bit Linux sievers! frmky NFS@Home 25 2013-10-16 15:58
IA-32 Assembly Coders, anyone? xenon Programming 6 2005-06-02 13:26
Bob, I'm calling you out! synergy Miscellaneous Math 17 2004-10-26 15:26

All times are UTC. The time now is 04:11.


Fri Jul 7 04:11:28 UTC 2023 up 323 days, 1:40, 0 users, load averages: 1.58, 1.61, 1.41

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔