![]() |
|
|
#45 |
|
Jul 2004
2316 Posts |
Speed improvement. Readable assembly code.
Still for Visual C++ compiler only. Please tell whether you have compilation problems. Code:
modulus asm C 1048583 254 349 2097169 266 365 4194319 279 395 8388617 293 401 16777259 305 418 33554467 318 436 |
|
|
|
|
|
#46 | |
|
"Bob Silverman"
Nov 2003
North of Boston
11101100011012 Posts |
Quote:
It appears that the routine assumes a __fastcall. Correct? And that the inputs are already in registers ecx,edx. If the calling code is: int __fastcall eegcd(a,modulus) will VC++ automatically put the arguments into ecx & edx? Query 2: Done_Ans_1 seems quite a bit away from the jump instruction. Would it help to move it closer? ??? Also, how much time is saved by __fastcall as opposed to (say) __cdecl? I prefer not to rely upon arguments being in spcific registers. With yourOK, I would like to change this to: int single_modinv(int a, int modulus) { _asm { mov ecx, a mov edx, modulus ... your code etc. } } The routine then no longer needs to save a copy of modulus in [esp-4]. The code after Done_mebx can replace mov eax, [esp-4] with mov eax, modulus and does not need to save modulus at [esp-4] at the top. We can also eliminate all the push/pops etc. |
|
|
|
|
|
|
#47 |
|
Jul 2004
5×7 Posts |
You may do anything to my code.
Yes, that is fastcall calling convention, arguments are put by compiler into registers. According to Agner, both Microsoft and GNU use the same registers (ecx and edx) for fastcall. Therefore, now you can use my code just like library. It should always work as long as the same compiler is used. ebp, ebx, esi, edi must be preserved. So, that's the reason push and pops are there. fastcall is faster than other calls which put arguments on stack, but I could be wrong here. Actually I used stdcall before this and then found information about fastcall. For Done_Ans_1, you can experiment yourself if you have MASM32. I've tried moving codes around and find this to be fast. Maybe I missed some arrangement. |
|
|
|
|
|
#48 |
|
Jul 2004
5×7 Posts |
Got it run on Linux. I run the code on my other computer which is also a Pentium III. Now I see that maybe I don't know how to use Visual C++ to compile properly. This gcc can compile codes faster than VC++. Anyway, I tried both -O2 switch and -O3 switch on Linux. Nevertheless, hand-tuned code wins. Keep in mind that my assembly code is derived from akruppa's code.
Code:
modulus asm(O2) C(O2) asm(O3) C(O3) 1048583 260 320 258 279 2097169 273 336 271 295 4194319 287 351 284 310 8388617 300 367 298 324 16777259 313 383 311 339 33554467 327 398 325 354 I assume we only need to take care Windows and Linux on x86, right? For the file I attach, asm folder is meant for Windows (because I use MASM32 assembler, then convert using emxaout). xi2_gcc directory is meant for linux. |
|
|
|
|
|
#49 | |
|
Jun 2003
The Texas Hill Country
32·112 Posts |
Quote:
Is anything that you are doing particular to Linux (rather than gcc)? |
|
|
|
|
|
|
#50 | ||
|
3·5·72·11 Posts |
Quote:
Quote:
|
||
|
|
|
#51 | ||
|
Jul 2004
5×7 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#52 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
Here's a theoretical (meaning I haven't tested it) fast LSB function. It does a binary search for the LSB! In theory you could do the same for the MSB (I think you'd just check if lowpart != 0, but that might be off by one). And the conditionals should produce CMOVs in the assembly.
Code:
int lsb(unsigned int n) {
unsigned int lowpart;
int lsb = 0;
int bits;
for(bits=sizeof(unsigned int)/2; bits != 0; bits >>= 1) {
lowpart = (1<<bits)-1;
lowpart = n & lowpart;
n = (lowpart==0)?(n>>bits):n;
lsb += (lowpart==0)?bits:0;
}
}
|
|
|
|
|
|
#53 |
|
Jun 2005
11112 Posts |
Here is a recursive version of LSB algorithm. I have tested it using a simple wrapper. It does not handle the input of 0 well (I consider this undefined). This may be better or worse than what you are currently doing for 32 bit numbers however it should scale to 64bit better.
I tried to make it easy to convert to ASM. Also mine probably requires more registers than yours. Also maybe you can combine them to get both at once (if that helps you). Code:
int LSB(unsigned int x)
{
unsigned int a, b, c;
a = x;
b = 0;
c = 16;
do
{
a <<= c;
if (a)
b |= c;
else
a = x << b;
c >>= 1;
} while (c);
b = 31 - b;
return b;
}
int MSB(unsigned int x)
{
unsigned int a, b, c;
a = x;
b = 0;
c = 16;
do
{
a >>= c;
if (a)
b |= c;
else
a = x >> b;
c >>= 1;
} while (c);
return b;
}
|
|
|
|
|
|
#54 |
|
Jan 2005
Caught in a sieve
5·79 Posts |
I knew I had a better MSB code lying around here somewhere. It turns out that my LSB algorithm is useless because only 1 in 2^n numbers has an LSB of n or greater. So this takes about twice as long as the naive method.
For MSB it's somewhat more useful, but there's another better method I found here. Pick any one of those methods for MSB. For LSB, a similar table approach takes 1/4 the time of the naive approach. I couldn't get the compiler to inline my function, so I inlined it myself in my test: Code:
const char LsbTable256[] =
{
0, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
7, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
6, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
5, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0,
4, 0, 1, 0, 2, 0, 1, 0, 3, 0, 1, 0, 2, 0, 1, 0
};
unsigned int i, n, lsb1;
for(i=1; i < 1000000000; ++i) {
n = i;
lsb1 = 0;
while((n&0xFF) == 0) {
n >>= 8;
lsb1 += 8;
}
lsb1 |= LsbTable256[n];
}
If a table's too much for you, here's an intermediate LSB method that's about 70% faster than naive: Code:
unsigned int i, n, lsb1;
for(i=1; i < 1000000000; ++i) {
n = i;
lsb1 = 0;
while((n&3) == 0) {
n >>= 2;
lsb1 += 2;
}
lsb1 |= ((~n)&1);
}
|
|
|
|
|
|
#55 |
|
Jan 2005
Caught in a sieve
6138 Posts |
Now that I look at that page again, there's another algorithm here that's about 15% faster than my table algorithm on my Athlon, and has no shifts so it's probably faster on P4. But it uses a modulus, so I don't know for sure whether it's better.
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Calling airsquirrels | Prime95 | GPU Computing | 16 | 2015-09-29 18:06 |
| Help from coders | ET_ | GPU Computing | 5 | 2014-01-26 13:58 |
| Calling all 64-bit Linux sievers! | frmky | NFS@Home | 25 | 2013-10-16 15:58 |
| IA-32 Assembly Coders, anyone? | xenon | Programming | 6 | 2005-06-02 13:26 |
| Bob, I'm calling you out! | synergy | Miscellaneous Math | 17 | 2004-10-26 15:26 |