![]() |
|
|
#1 |
|
"Ben"
Feb 2007
DB916 Posts |
Is there an instruction in the x86 instruction set which will count the number of set bits in a byte, or word? I've looked through the AMD and Intel Architecture manuals a bit, and didn't see anything in the general purpose instructions. But I could have missed it, or maybe there is something in SSE(2,3,4) or MMX or something.
In liu of that, the best I could come up with considering pipelining and all is the following: Code:
//val is an array of bytes. count += (val[i] & 0x80) >> 7; count += (val[i] & 0x40) >> 6; count += (val[i] & 0x20) >> 5; count += (val[i] & 0x10) >> 4; count += (val[i] & 0x8) >> 3; count += (val[i] & 0x4) >> 2; count += (val[i] & 0x2) >> 1; count += (val[i] & 0x1); |
|
|
|
|
|
#2 |
|
"Ben"
Feb 2007
3×1,171 Posts |
If I cast the array of bytes into an unsigned short pointer, and unroll some more, I get a 4% improvement...
If I do the same for an unsigned int and unroll to 32 lines per loop iteration I get another 1%, so returns diminish as I would expect. A hardware instruction to just give me the number of set bits would far and away beat all of this, I would think, and be much cleaner. |
|
|
|
|
|
#3 | |
|
Jul 2003
So Cal
210610 Posts |
This post suggests that the fastest way is to use a lookup table for each byte.
Greg Quote:
|
|
|
|
|
|
|
#4 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
72·131 Posts |
There is a bit-count instruction in AMD's Barcelona processor. This isn't very helpful unless you have one of those.
The paper from 1990 has measurements on a series of systems whose performance characteristics are rather different from contemporary hardware. I think a 256-byte table is probably still best for single bytes, but something like the HAKMEM recipe might well work better for 64-bit words. [you do X = (X & 0x55555555555) + ((X & 0xaaaaaaaaaaaaa)>>1) X = (X & 0x33333333333) + ((X & 0xccccccccccccccc)>>2) X = (X & 0x0f0f0f0f0f0f0f) + ((X & 0xf0f0f0f0f0f0f0f0)>>4) X = (X & 0x00ff00ff00ff) + ((X & 0xff00ff00ff00)>>8) X += X>>16; X += X>>32; X &= 0xff; where each step adds up the values in two adjacent 2^n-bit pockets of the register and writes the sum into a 2^{n+1}-bit pocket overwriting the original data. You can replace some of the lines by a multiplication by 0x10101010101010101... which is nice and fast on EMT64 hardware. |
|
|
|
|
|
#5 |
|
Apr 2003
Berlin, Germany
192 Posts |
I found a paper covering this topic, called "Beating the popcount":
http://www.icis.ntu.edu.sg/scs-ijit/91/91-1.pdf |
|
|
|
|
|
#6 |
|
"Ben"
Feb 2007
3×1,171 Posts |
No Barcelona for me, unfortunately
, but if it can sieve faster than the 3GHz Xeon Woodcrest I have access to I'll be impressed.Thank you all for the suggestions! I'll post results as I implement some of these. By the way, I should clarify my original improvement gains. The 5% was for the *overall* application (a sieve of erat.). The bit counting part was about 33% faster with the 32x loop unrolling. As of now, I can find and count all the primes up to 1e9 in 0.75 sec, on the above mentioned system. |
|
|
|
|
|
#7 |
|
"Ben"
Feb 2007
3×1,171 Posts |
Cool!
If i use the consistently fastest method from Greg's link (a table lookup), I get a 75% speedup over my 32x unrolled version! Here it is: Code:
//cast array of bytes into an unsigned int32 pointer
flagblock32 = (uint32 *)soe.line[j];
//for each word of flags in this sieve line:
for (i=0;i<numlinebytes/4;i++)
{
//cast again into a byte pointer
p = (unsigned char *)&flagblock32[i];
//lookup and sum the counts in each byte of the 32bit word
count += (inbits[p[0]] + inbits[p[1]] + inbits[p[2]] + inbits[p[3]]);
}
Last fiddled with by bsquared on 2007-10-18 at 14:00 Reason: added stuff at the end |
|
|
|
|
|
#8 |
|
"Ben"
Feb 2007
3×1,171 Posts |
This method:
Code:
//cast array of bytes into an unsigned int32 pointer
flagblock32 = (uint32 *)soe.line[j];
//for each word of flags in this sieve line:
for (i=0;i<numlinebytes/4;i++)
{
//look at each word
n = flagblock32[i];
//and do some magic
n = (n & 0x55555555) + ((n >> 1) & 0x55555555);
n = (n & 0x33333333) + ((n >> 2) & 0x33333333);
n = (n + (n >> 4)) & 0x0f0f0f0f;
n += n >> 8;
n += n >> 16;
it += (n & 0xff);
}
- ben. Last fiddled with by bsquared on 2007-10-18 at 14:32 |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Yafu instantly crashing after first instruction | luisda2994 | YAFU | 1 | 2016-12-09 00:30 |
| gnfs asm version sievers illegal instruction | EdH | Factoring | 32 | 2016-10-12 20:49 |
| Instruction for setup PRPNET server | pepi37 | Homework Help | 2 | 2016-04-24 21:18 |
| Larrabee instruction set announced | fivemack | Hardware | 0 | 2009-03-25 12:09 |
| Auto-instruction creating processor!! | PrimeCruncher | Hardware | 6 | 2004-05-01 11:53 |