mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Programming (https://www.mersenneforum.org/forumdisplay.php?f=29)
-   -   Random bit in a word (https://www.mersenneforum.org/showthread.php?t=15310)

Prime95 2011-02-26 00:02

Random bit in a word
 
I need some code that efficiently (on Intel this means relatively short chunk of code with no branches that are hard to predict) returns the index of a randomly selected set bit in a 64-bit value.

Psuedo code:

int64 t;

cnt = POPCNT (t)
i = random value between 0 and cnt
while (i--) use BSF to find a set bit, then clear that set bit
return BSF(t)

POPCNT is an Intel assembly code instruction to count the number of set bits in a register
BSF is bit-scan-forward, it returns the index of the first set bit in a register


Clever bit-twiddling suggestions are welcome!

Batalov 2011-02-26 01:48

[QUOTE=Prime95;253762]BSF is bit-scan-forward, it returns the index of the first set bit in a register

Clever bit-twiddling suggestions are welcome![/QUOTE]
Shooting in the dark: init a 64k lookup table (for 16-bit words) once, and use it four times per BSF() call?

axn 2011-02-26 04:48

[QUOTE=Prime95;253762] use BSF to find a set bit, then clear that set bit
[/QUOTE]

If BSF scans from LSB to MSB, then "x & (x-1)" is a standard way to turn of the least set bit. Don't need bsf at all.

Prime95 2011-02-26 05:12

I was thinking a binary search for the random nth set bit as in:

int result = 0;
int64 t;

cnt = POPCNT (t)
i = random value between 0 and cnt

cnt32 = POPCNT (t & 0xFFFFFFFF)
t = t >> ((i > cnt32) ? 32 : 0)
result = result + ((i > cnt32) ? 32 : 0)
i = i - ((i > cnt32) ? cnt32 : 0)
repeat for 16, 8, 4, 2
return result

the compiler ought to be smart enough to generate conditional move instructions and thus the above is jump-less

only_human 2011-02-26 07:21

[URL="http://graphics.stanford.edu/~seander/bithacks.html#SelectPosFromMSBRank"]Select the bit position (from the most-significant bit) with the given count (rank)[/URL] might be close to what you want. It does its' own bit counting using a common fast method but retains the intermediate calculations and then uses those values to quickly locate the index of the r[SUP]th[/SUP] set bit. In the example it finds the position of the r[SUP]th[/SUP] set bit from the most significant end but it says that it can be adapted to find the bit from the least significant end.

[CODE]The following 64-bit code selects the position of the rth 1 bit when counting from the left. In other words if we start at the most significant bit and proceed to the right, counting the number of bits set to 1 until we reach the desired rank, r, then the position where we stop is returned. If the rank requested exceeds the count of bits set, then 64 is returned. The code may be modified for 32-bit or counting from the right.
uint64_t v; // Input value to find position with rank r.
unsigned int r; // Input: bit's desired rank [1-64].
unsigned int s; // Output: Resulting position of bit with rank r [1-64]
uint64_t a, b, c, d; // Intermediate temporaries for bit count.
unsigned int t; // Bit count temporary.

// Do a normal parallel bit count for a 64-bit integer,
// but store all intermediate steps.
// a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
a = v - ((v >> 1) & ~0UL/3);
// b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
// c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
c = (b + (b >> 4)) & ~0UL/0x11;
// d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
d = (c + (c >> 8)) & ~0UL/0x101;
t = (d >> 32) + (d >> 48);
// Now do branchless select!
s = 64;
// if (r > t) {s -= 32; r -= t;}
s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
t = (d >> (s - 16)) & 0xff;
// if (r > t) {s -= 16; r -= t;}
s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
t = (c >> (s - 8)) & 0xf;
// if (r > t) {s -= 8; r -= t;}
s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
t = (b >> (s - 4)) & 0x7;
// if (r > t) {s -= 4; r -= t;}
s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
t = (a >> (s - 2)) & 0x3;
// if (r > t) {s -= 2; r -= t;}
s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
t = (v >> (s - 1)) & 0x1;
// if (r > t) s--;
s -= ((t - r) & 256) >> 8;
s = 65 - s;
If branching is fast on your target CPU, consider uncommenting the if-statements and commenting the lines that follow them.
Juha Järvi sent this to me on November 21, 2009.[/CODE]

The link I provided doesn't jump to the anchored text properly in my browser. Here is another link anchored below the quoted section (scroll upward to quoted text) [url]http://graphics.stanford.edu/~seander/bithacks.html#ParityNaive[/url]

Prime95 2011-02-26 16:57

[QUOTE=only_human;253804][URL="http://graphics.stanford.edu/~seander/bithacks.html#SelectPosFromMSBRank"]Select the bit position (from the most-significant bit) with the given count (rank)[/URL] might be close to what you want.
[/QUOTE]

Perfect! Thanks.

ewmayer 2011-02-28 19:14

This is more in generic-C terms, but perhaps worth investigating:

1. Create a pair of 256-entry lookup tables: Table 1 is simply a bytewise lookup-table version of POPCNT: Take a byte and return #set bits in that byte;

Table 2 is a bit more complex - it is a lookup table which encodes the following: For any byte-valued integer B in [0,255], given a target set-bit index I in [0,POPCNT(B)], the table-lookup returns the bit-index of the set bit with respect to B. This is a bit trickier to set up, since different values of B have numbers of set bits ranging from 0 to 8, so one needs to consider whether the requested bit-index is legal for the byte-valued int in question. But assuming the caller will only ever request a bit-index value which is appropriate for the specific byte-value in question, you could use a set of 256 24-bit ints to encode this. Each 24-bit table entry is divided into eight 3-bit subfields containing up to eight bit indices for the byte-value corresponding to the table entry. The low 3 bits are the bit index of the first set bit in B, etc.

For greater efficiency it would probably be better to use 32-bit ints 4-bit index-offset subfields, with one bit in each subfield going unused. Then, declaring this table like so:

uint32 table2[256];

and after properly initializing it, the bit-index of a given set bit I (zero-offset, i.e.I=0 means the first set bit) in byte-value B is simply equal to

(table2[B] >> (4*I)) & 0xf

2. Loop over the 8 bytes in your 64-bit input int X, and do 8 Table-1 lookups to generate a sequence of partial sums based on the total #bits set in the accumulated bytes. This could be the basis for a fast binary search, but since are just 8 such partial the overhead for managing such a data structure might exceed any gains due to the better asymptotic opcount;

3. Given an input bit-index in [0, POPCNT(X)], use the partial sums you generated in step [2] to find which byte contains the bit in question, and then use Table 2 to give you the bit index within the target byte.

only_human 2011-02-28 21:58

I'm a bit rusty in programming Ernst, so I apologize if I miss the points or end up blathering a bit...

It is easy to get population counts for small fields of the entire 64 bit value

For single bit fields, POPCNT is the bit itself.

For two bit POPCNTs, you add even bit positions to odd bit positions.

Looking above, at the code snippet in my earlier message, it is just a case of shift, mask and add. The comments for the POPCNT variables are more explicit than the slightly optimized actual code.


v = 1 bit POPCNTs (64 of them)
a = 2 bit POPCNTs (32 of them)
b = 4 bit POPCNTs (16 of them)
c = 8 bit POPCNTs (8 of them)
d = 16 bit POPCNTs (4 of them)

Perhaps 'c' would be more efficient than 8 table lookups for step two that you describe.

The rest of the code snippet is a bit roughish; to avoid conditional tests and jumps, for comparisons it does subtracts and depends on certain bits of the result. To adjust 's' it depends on bit 8 of an integer being 0 or 1 after a subtraction . To adjust 'r' it depends on several bits of the second byte of a subtraction being all 0 or all 1.

In any case the fact that it works makes it so ugly that it is nice. The first 't' value looks wrong to me but I accept that I must be misunderstanding things a bit. After each test and adjustment, 't' is changed to depend on a different one of those POPCNT variables I mentioned above shifted by the partial result 's' and masked to the relevant bits for the next comparison.

My tools didn't compile the masking constants correctly and I was forced to replace ~0UL with ~0ui64
Ross

Prime95 2011-03-01 01:05

[QUOTE=only_human;254006]
It is easy to get population counts for small fields of the entire 64 bit value[/QUOTE]

I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea.

The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table.

BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.

only_human 2011-03-01 01:28

[QUOTE=Prime95;254025]I agree, the POPCNTs are fairly easy to calculate. However, Ernst's idea of a table lookup for the last 3 levels of the binary search is a good idea.

The code from Juha takes 11 operations per binary search level. We can replace 33 operations with a few operations and a lookup into a 1KB table.

BTW, I think I can reduce the 11 operations per binary search level down to a more reasonable 6 or 7.[/QUOTE]Agree that a table could be good in here somewhere and definitely don't intend to disparage Ernst's ideas.

I thought that more than one comparison could be done at a time by setting guard bits that would keep borrows from propagating too far but that is likely primitive thinking because this is probably perfect fodder for modern SSE type instructions. I would be curious to see your ideas on simplifying Juha's binary search if you do develop them.

An ugly part of the code in my opinion is that, as written, in C is that the masks used after the subtractions wouldn't work correctly on machines that have a signed binary default integer representation.

Prime95 2011-03-01 01:57

Here is my modified version without Ernst's suggestion to eliminate the last 3 binary search levels. This is quite a bit faster. Also, it shouldn't be hard to remove the multiplication operations.

The original code was 25.5% of program execution time. The new code is 20.5% of total execution time.

[CODE] // Adapted from code by Juha J‰rvi found on Stanford's bit hacks web page
uint64_t v = needs_work_mask; // Input value to find position with rank r.
unsigned int r; // Input: bit's desired rank [1-64].
unsigned int s; // Output: Resulting position of bit with rank r [1-64]
uint64_t a, b, c, d; // Intermediate temporaries for bit count.
unsigned int t; // Bit count temporary.

// Do a normal parallel bit count for a 64-bit integer,
// but store all intermediate steps.
// a = (v & 0x5555...) + ((v >> 1) & 0x5555...);
a = v - ((v >> 1) & ~0UL/3);
// b = (a & 0x3333...) + ((a >> 2) & 0x3333...);
b = (a & ~0UL/5) + ((a >> 2) & ~0UL/5);
// c = (b & 0x0f0f...) + ((b >> 4) & 0x0f0f...);
c = (b + (b >> 4)) & ~0UL/0x11;
// d = (c & 0x00ff...) + ((c >> 8) & 0x00ff...);
d = (c + (c >> 8)) & ~0UL/0x101;
#ifdef OLD
t = (d >> 32) + (d >> 48);
// Now do branchless select!
s = 64;
// if (r > t) {s -= 32; r -= t;}
s -= ((t - r) & 256) >> 3; r -= (t & ((t - r) >> 8));
t = (d >> (s - 16)) & 0xff;
// if (r > t) {s -= 16; r -= t;}
s -= ((t - r) & 256) >> 4; r -= (t & ((t - r) >> 8));
t = (c >> (s - 8)) & 0xf;
// if (r > t) {s -= 8; r -= t;}
s -= ((t - r) & 256) >> 5; r -= (t & ((t - r) >> 8));
t = (b >> (s - 4)) & 0x7;
// if (r > t) {s -= 4; r -= t;}
s -= ((t - r) & 256) >> 6; r -= (t & ((t - r) >> 8));
t = (a >> (s - 2)) & 0x3;
// if (r > t) {s -= 2; r -= t;}
s -= ((t - r) & 256) >> 7; r -= (t & ((t - r) >> 8));
t = (v >> (s - 1)) & 0x1;
// if (r > t) s--;
s -= ((t - r) & 256) >> 8;
//s = 65 - s;

//convert s to LSB 0-63 notation
s--;
#else
int pick;
// s = 0;
// if (r > popcnt32[s]) {s += 32; r -= popcnt32[s];}
t = (d + (d >> 16)) & 0xFF;
pick = (r > t);
s = pick << 5; r -= pick * t;
// if (r > popcnt16[s]) {s += 16; r -= popcnt16[s];}
t = (d >> s) & 0xff;
pick = (r > t);
s += pick << 4; r -= pick * t;
// if (r > popcnt8[s]) {s += 8; r -= popcnt8[s];}
t = (c >> s) & 0xf;
pick = (r > t);
s += pick << 3; r -= pick * t;
// if (r > popcnt4[s]) {s += 4; r -= popcnt4[s];}
t = (b >> s) & 0x7;
pick = (r > t);
s += pick << 2; r -= pick * t;
// if (r > popcnt2[s]) {s += 2; r -= popcnt2[s];}
t = (a >> s) & 0x3;
pick = (r > t);
s += pick << 1; r -= pick * t;
// if (r > popcnt1[s]) s++;
t = (v >> s) & 0x1;
pick = (r > t);
s += pick;
#endif
[/CODE]

ckdo 2011-03-01 07:59

[QUOTE=Prime95;253762]I need some code that efficiently (on Intel this means relatively short chunk of code with no branches that are hard to predict) returns the index of a randomly selected set bit in a 64-bit value.

[...]

Clever bit-twiddling suggestions are welcome![/QUOTE]

Ahem... :hello:

[code]
bitIndex = CountLeadingZeroes(RotateLeft(inputWord, randomShift)) + randomShift;
[/code]Clever enough? :geek:

jrk 2011-03-01 08:42

[QUOTE=ckdo;254059]Ahem... :hello:

[code]
bitIndex = CountLeadingZeroes(RotateLeft(inputWord, randomShift)) + randomShift;
[/code]Clever enough? :geek:[/QUOTE]

That will have bias.

Prime95 2011-03-01 14:56

[QUOTE=jrk;254061]That will have bias.[/QUOTE]

Agreed. I don't need a perfectly random selection, but this algorithm has way too much bias.

akruppa 2011-03-01 17:34

What is "random?" Is i supplied by the caller, or should i be a function of t, or should i be (more or less) uniformly at random in [0, popcnt(t)-1], but independent of the positions of the lit bits? Or something else?

Edit: if i is a supplied parameter, Jörg Arndt's [URL="http://www.jjj.de/fxt/#fxtbook"]book[/URL] has an algorithm in 1.10. I looks similar to the one only_human posted.

Prime95 2011-03-01 18:35

[QUOTE=akruppa;254106]What is "random?" Is i supplied by the caller, or should i be a function of t, or should i be (more or less) uniformly at random in [0, popcnt(t)-1], but independent of the positions of the lit bits?[/QUOTE]

i is (more or less) uniformly at random in [0, popcnt(t)-1].

only_human 2011-03-01 23:19

I've rewritten George's version of the branchless select portion of the code to have an explicit variable 'n' that might be useful for a table lookup.

I don't think I have messed it up but I have been looking at it too long.

I changed a mask from 0xf to 0xff at one location where I felt it was safe to do so.

As it stands, I believe it is less efficient than George's version. I think when I checked his version it took about 9 instructions for a particular popcnt section. This takes about 11 assembler instructions and to my eyeballs looks it might have more back to back data dependencies.

[CODE]
int pick;
unsigned char n;
//s = 0

// if (r > popcnt32[s]) {s += 32; r -= popcnt32[s];}
t = (d + (d >> 16)) & 0xff;
pick = (r > t);
n = pick; s = n << 5; r -= pick * t;
// n = [0,1], s = n * 32

// if (r > popcnt16[s]) {s += 16; r -= popcnt16[s];}
t = (d >> s) & 0xff;
pick = (r > t);
n = n + n + pick; s = n << 4; r -= pick * t;
// n = [0,3], s = n * 16

// if (r > popcnt8[s]) {s += 8; r -= popcnt8[s];}
t = (c >> s) & 0xff;
pick = (r > t);
n = n + n + pick; s = n << 3; r -= pick * t;
// n = [0,7], s = n * 8

// if (r > popcnt4[s]) {s += 4; r -= popcnt4[s];}
t = (b >> s) & 0x7;
pick = (r > t);
n = n + n + pick; s = n << 2; r -= pick * t;
// n = [0,15], s = n * 4

// if (r > popcnt2[s]) {s += 2; r -= popcnt2[s];}
t = (a >> s) & 0x3;
pick = (r > t);
n = n + n + pick; s = n << 1; r -= pick * t;
// n = [0,31], s = n * 2

// if (r > popcnt1[s]) s++;
t = (v >> s) & 0x1;
pick = (r > t); s += pick;
// s = [0,63]
[/CODE]

only_human 2011-03-03 04:04

These modified routines silently return a value of 0x3f if a rank higher than the number of 1's present is requested. The bit at the returned index may or may not be set too (which is obviously wrong for an index to a specific 1 bit). An error check should be done to avoid biases caused by this.

I just noticed this problem just now.

only_human 2011-03-03 20:09

For what it's worth I'll complete my thoughts here.

The way the routine works is at each interval level the routine checks if the lower half of an interval has enough set bits to find the r-th bit. If there are not enough bits present, it subtracts the bit count for the lower half interval from 'r' and adjusts the index to point into the higher half interval. The interval size in any case is cut in half and the above check is done again.

So eventually when the requested r-th bit is higher than the all the 1's present in all the lower halves of all the intervals the index finally ends up 0x3f. If the index is 0x3f, that bit is never actually checked (the last check is the lower bit of a 2 bit interval).

Instead of this:
[CODE]// if (r > popcnt1[s]) s++;
t = (v >> s) & 0x1;
pick = (r > t);
s += pick;[/CODE]

An error check could be like this
[CODE]// if (r > popcnt1[s]) s++;
t = (v >> s) & 0x1;
pick = (r > t);
s += pick;
[COLOR="Red"] if (( s == 0x3f )
if (((r - pick * t) > 1 ) || ~((v >> s) & 0x1) ) error code here[/COLOR]
[/CODE]
The routine in the book that akruppa mentioned has an error check earlier in its' corresponding routine "1.10 Index of the i-th set bit":
[QUOTE]1 static inline ulong ith_one_idx(ulong x, ulong i)
2 // Return index of the i-th set bit of x where 0 <= i < bit_count(x).
3 {
4 ulong x2 = x - ((x>>1) & 0x5555555555555555UL); // 0-2 in 2 bits
5 ulong x4 = ((x2>>2) & 0x3333333333333333UL) +
6 (x2 & 0x3333333333333333UL); // 0-4 in 4 bits
7 ulong x8 = ((x4>>4) + x4) & 0x0f0f0f0f0f0f0f0fUL; // 0-8 in 8 bits
8 ulong ct = (x8 * 0x0101010101010101UL) >> 56; // bit count
9
10 ++i;
11 if ( ct < i ) return ~0UL; // less than i bits set[/QUOTE]

robertk 2012-10-06 09:34

[QUOTE=Prime95;254032]Here is my modified version without Ernst's suggestion to eliminate the last 3 binary search levels. This is quite a bit faster. Also, it shouldn't be hard to remove the multiplication operations.

The original code was 25.5% of program execution time. The new code is 20.5% of total execution time.
[/QUOTE]

Your thread addresses a problem I also have: obtaining the position of a random set bit. But the code you provided here gets the position of a specific set bit. Would there not be possible optimizations if beforehand its not of interest which set bit is selected as long as there is no (or little) bias?

Thanks!


All times are UTC. The time now is 23:28.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.