网上搜到的bit算法很好的材料 http://aggregate.org/MAGIC/

### Absolute Value of a Float

IEEE floating point uses an explicit sign bit, so the absolute value can be taken by a bitwise AND with the complement of the sign bit. For IA32 32-bit, the sign bit is an `int` value of 0x80000000, for IA32 64-bit, the sign bit is the `long long int` value 0x8000000000000000. Of course, if you prefer to use just `int` values, the IA32 64-bit sign bit is 0x80000000 at an `int` address offset of +1. For example:

### Alignment of Pointers

### Average of Integers

### Bit Reversal

### Comparison of Float Values

### Comparison to Mask Conversion

### Dual-Linked List with One Pointer Field

### Divide Rounding

### GPU Any

### Gray Code Conversion

### Integer Constant Multiply

### Integer Minimum or Maximum

### Integer Power

### Integer Selection

### Is Power of 2

### Leading Zero Count

### Least Significant 1 Bit

### Log2 of an Integer

### Next Largest Power of 2

### Most Significant 1 Bit

### Natural Data Type Precision Conversions

### Polynomials

### Population Count (Ones Count)

### Shift-and-Add Optimization

### Sign Extension

### Swap Values Without a Temporary

### SIMD Within A Register (SWAR) Operations

### Trailing Zero Count

## Other People's Magic

The following are generally good sources of magic algorithms:

double x;

/* make x = abs(x) */

*(((int *) &x) + 1) &= 0x7fffffff;

Alignment of pointers is a pretty common problem, and there are several relevant tricks, so, at the suggestion of Jean-Charles Meyrignac (who also provided an example of the upward alignment described below), I've added a little description here.

It is fairly obvious that the downward alignment of an address `a` to a multiple-of-`b` boundary, where `b` is a power of 2, is simply `(a & ~(b-1))`. Of course, `~(b-1)` is also `-b`, so `(a & -b)` also works (the difference is usually nothing; if `b` is a constant, most compilers will fold the first into the second form).

For upward alignment, we simply add `b-1`: `((a + (b-1)) & -b)`.

Of course, there are a few complications. First, languages like C, which allow pointer arithmetic, generally scale pointer offsets by the size of the target object -- which would keep our math from working. It used to be that casting a pointer as a

Secondly, aligning an address doesn't help unless you allocated a large enough chunk of memory so that you can treat your da`b`-aligned da`c` bytes, you would do something like: `free(a)`; you'll need to keep a copy of the original block address for that.

This is actually an extension of the "well known" fact that for binary integer values `x` and `y`, `(x+y)` equals `((x&y)+(x|y))` equals `((x^y)+2*(x&y))`.

Given two integer values `x` and `y`, the (floor of the) average normally would be computed by `(x+y)/2`; unfortunately, this can yield incorrect results due to overflow. A very sneaky alternative is to use `(x&y)+((x^y)/2)`. If we are aware of the potential non-portability due to the fact that C does not specify if shifts are signed, this can be simplified to `(x&y)+((x^y)>>1)`. In either case, the benefit is that this co

Reversing the bits in an integer `x` is somewhat painful, but here's a SWAR algorithm for a 32-bit value:

unsigned int

reverse(register unsigned int x)

{

x = (((x & 0xaaaaaaaa) >> 1) | ((x & 0x55555555) << 1));

x = (((x & 0xcccccccc) >> 2) | ((x & 0x33333333) << 2));

x = (((x & 0xf0f0f0f0) >> 4) | ((x & 0x0f0f0f0f) << 4));

x = (((x & 0xff00ff00) >> 8) | ((x & 0x00ff00ff) << 8));

return((x >> 16) | (x << 16));

}

It also is possible to re-write this algorithm to use 4 instead of 8 constants, thus saving some instruction bandwidth. On my 1.2GHz Athlon (a Thunderbird), the difference is too small to measure reliably. Here's the other version:

unsigned int

reverse(register unsigned int x)

{

register unsigned int y = 0x55555555;

x = (((x >> 1) & y) | ((x & y) << 1));

y = 0x33333333;

x = (((x >> 2) & y) | ((x & y) << 2));

y = 0x0f0f0f0f;

x = (((x >> 4) & y) | ((x & y) << 4));

y = 0x00ff00ff;

x = (((x >> 8) & y) | ((x & y) << 8));

return((x >> 16) | (x << 16));

}

IEEE floating point has a number of nice properties, including the ability to use 2's complement integer comparisons to compare floating point values, provided the native byte order is consistent between float and integer values. The on*AMD Athlon Processor x86 Co de Optimization Guide* gives a nice summary on Page 43. Here's a set of C routines that embody the same logic:

#define FasI(f) (*((int *) &(f)))

#define FasUI(f) (*((unsigned int *) &(f)))

#define lt0(f) (FasUI(f) > 0x80000000U)

#define le0(f) (FasI(f) <= 0)

#define gt0(f) (FasI(f) > 0)

#define ge0(f) (FasUI(f) <= 0x80000000U)

In many cases, it is useful to convert the result of a comparison, which is either 0 or some non-zero bit pattern, into either a "clean" 0 or -1 bit mask. If the messy **non-negative** integer value is `x`, the sanitized version is:

(((int) (-x)) >> (WORDBITS - 1))

To remove the constraint that the messy value be non-negative, use:

(((int) (x | -x)) >> (WORDBITS - 1))

Logically, this works because the shift by `(WORDBITS-1)` replicates the sign bit to create a mask -- be aware, however, that the C language does *not* require that shifts are signed even if their operands are signed, so there is a potential portability problem. Additionally, on`WORDBITS` would have the same effect, but many instruction sets have shifts that behave strangely when such shift distances are specified.

Of course, the opposite condition can be tested using:

(((int) ~(x | -x)) >> (WORDBITS - 1))

If you prefer the C-standard 0 or 1 comparison result, simply use the `unsigned` shift:

(((unsigned int) (x | -x)) >> (WORDBITS - 1))

The opposite condition can be obtained using:

(((unsigned int) ~(x | -x)) >> (WORDBITS - 1))

Normally, a dual-linked circular list would contain both previous and next pointer fields and the current position in the list would be identified by a single pointer. By using two current pointers, on

Unfortunately, using this trick in C is awkward because the XOR operation is not defined for pointers.

Joe Ibershoff, on`Ceiling(a/b)` is `(a+b-1)/b` and `RoundToNearest(a/b)` is `(a+(b/2))/b`. Of course, these tricks also work if divide is implemented in less obvious ways, such as shifts or shift-and-subtract sequences.

On`p_any(flag)` operation within a block can be accomplished by:

if (flag) sharedtemp = serial; /* maskable store */

__syncthreads(); /* optional with right block size */

p_any = (sharedtemp == (serial++));

We first publically announced this at SC08, and we're pretty sure we invented it. The trick is that NVIDIA's hardware seems to take constant time to resolve N threads storing into the same object, i.e., it picks a winner. This behaviour is not documented, but has been experimentally observed; this `p_any(flag)` will run on any of the CUDA hardware, and takes essentially the same time as the atomic any that was added to later CUDA hardware. There are actually quite a few useful operations that can be built using variations on this trick....

A Gray co`K`th Gray co`K^(K>>1)`.

The well-known algorithm for conversion from Gray to binary is a linear sequence of XORs that makes it seem each bit must be dealt with separately. Fortunately, that is equivalent to a parallel prefix XOR that can be computed using SWAR techniques in log time. For 32-bit Gray co

unsigned int

g2b(unsigned int gray)

{

gray ^= (gray >> 16);

gray ^= (gray >> 8);

gray ^= (gray >> 4);

gray ^= (gray >> 2);

gray ^= (gray >> 1);

return(gray);

}

Given an integer value `x` and an integer or floating point value `y`, the value of `x*y` can be computed efficiently using a sequence derived from the binary value of `x`. For example, if `x` is 5 (4 + 1):

y2 = y + y;

y4 = y2 + y2;

result = y + y4;

In the special case that `y` is an integer, this can be done with shifts:

y4 = (y << 2);

result = y + y4;

Given 2's complement integer values `x` and `y`, the minimum can be computed without any branches as `x+(((y-x)>>(WORDBITS-1))&(y-x))`. Logically, this works because the shift by `(WORDBITS-1)` replicates the sign bit to create a mask -- be aware, however, that the C language does *not* require that shifts are signed even if their operands are signed, so there is a potential portability problem. Additionally, on`WORDBITS` would have the same effect, but many instruction sets have shifts that behave strangely when such shift distances are specified.

Of course, maximum can be computed using the same trick: `x-(((x-y)>>(WORDBITS-1))&(x-y))`.

Actually, the Integer Selection coding trick is just as efficient in encoding minimum and maximum....

Given an integer value `x` and an integer or floating point value `y`, the value of `y` to the `x` power can be computed efficiently using a sequence derived from the binary value of `x`. For example, if `x` is 5 (4 + 1):

y2 = y * y;

y4 = y2 * y2;

result = y * y4;

A branchless, lookup-free, alternative to co`if (a<b) x=c; else x=d;` is `((((a-b) >> (WORDBITS-1)) & (c^d)) ^ d)`. This co

A non-negative binary integer value `x` is a power of 2 iff `(x&(x-1))` is 0 using 2's complement arithmetic.

Some machines have had single instructions that count the number of leading zero bits in an integer; such an operation can be an artifact of having floating point normalization hardware around. Clearly, floor of base 2 log of `x` is `(WORDBITS-lzc(x))`. In any case, this operation has found its way into quite a few algorithms, so it is useful to have an efficient implementation:

unsigned int

lzc(register unsigned int x)

{

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x |= (x >> 8);

x |= (x >> 16);

return(WORDBITS - ones(x));

}

This can be useful for extracting the lowest numbered element of a bit set. Given a 2's complement binary integer value `x`, `(x&-x)` is the least significant 1 bit. (This was pointed-out by Tom May.) The reason this works is that it is equivalent to `(x & ((~x) + 1))`; any trailing zero bits in `x` become on`~x`, adding 1 to that carries into the following bit, and AND with `x` yields on

Alternatively, since `(x&(x-1))` is actually `x` stripped of its least significant 1 bit, the least significant 1 bit is also `(x^(x&(x-1)))`.

Given a binary integer value `x`, the floor of the base 2 log of that number efficiently can be computed by the application of two variable-precision SWAR algorithms. The first "folds" the upper bits into the lower bits to construct a bit vector with the same most significant 1 as `x`, but all 1's below it. The second SWAR algorithm is population count, defined elsewhere in this document. However, we must consider the issue of what the log2(0) should be; the log of 0 is undefined, so how that value should be handled is unclear. The following co`LOG0UNDEFINED`, this co

unsigned int

floor_log2(register unsigned int x)

{

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x |= (x >> 8);

x |= (x >> 16);

#ifdef LOG0UNDEFINED

return(ones32(x) - 1);

#else

return(ones32(x >> 1));

#endif

}

Suppose instead that you want the ceiling of the base 2 log. The floor and ceiling are identical if `x` is a power of two; otherwise, the result is 1 too small. This can be corrected using the power of 2 test followed with the comparison-to-mask shift used in integer minimum/maximum. The result is:

unsigned int

log2(register unsigned int x)

{

register int y = (x & (x - 1));

y |= -y;

y >>= (WORDBITS - 1);

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x |= (x >> 8);

x |= (x >> 16);

#ifdef LOG0UNDEFINED

return(ones(x) - 1 - y);

#else

return(ones(x >> 1) - y);

#endif

}

Given a binary integer value `x`, the next largest power of 2 can be computed by a SWAR algorithm that recursively "folds" the upper bits into the lower bits. This process yields a bit vector with the same most significant 1 as `x`, but all 1's below it. Adding 1 to that value yields the next largest power of 2. For a 32-bit value:

unsigned int

nlpo2(register unsigned int x)

{

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x |= (x >> 8);

x |= (x >> 16);

return(x+1);

}

Given a binary integer value `x`, the most significant 1 bit (highest numbered element of a bit set) can be computed using a SWAR algorithm that recursively "folds" the upper bits into the lower bits. This process yields a bit vector with the same most significant 1 as `x`, but all 1's below it. Bitwise AND of the original value with the complement of the "folded" value shifted down by on

unsigned int

msb32(register unsigned int x)

{

x |= (x >> 1);

x |= (x >> 2);

x |= (x >> 4);

x |= (x >> 8);

x |= (x >> 16);

return(x & ~(x >> 1));

}

For integers used to represent natural da

For example, a 10-bit raw pixel value (e.g., from my Canon G1) called `x` can be extended to a 16-bit value by the expr`((x<<6)|(x>>4))`. This way, both the maximum and minimum values are reachable, with good linearity throughout the entire range.

It is fairly obvious, but `x0+x1*x+x2*x*x+x3*x*x*x+...` always can be rewritten as the usually faster equivalent `x0+x*(x1+x*(x2+x*(x3+x*(...))))`. There are various accuracy and other issues, but this sort of obvious transformation should not be overlooked.

The population count of a binary integer value `x` is the number of on

unsigned int

ones32(register unsigned int x)

{

/* 32-bit recursive reduction using SWAR...

but first step is mapping 2-bit values

into sum of 2 1-bit values in sneaky way

*/

x -= ((x >> 1) & 0x55555555);

x = (((x >> 2) & 0x33333333) + (x & 0x33333333));

x = (((x >> 4) + x) & 0x0f0f0f0f);

x += (x >> 8);

x += (x >> 16);

return(x & 0x0000003f);

}

It is worthwhile noting that the SWAR population count algorithm given above can be improved upon for the case of counting the population of multi-word bit sets. How? The last few steps in the reduction are using on

On`return((x * 0x01010101) >> 24);`. For the Athlon (which has a very fast integer multiply), I would have expected AMD's co

Rather obviously, if an integer multiply can be implemented by a shift-and-add sequence, then a shift-and-add sequence can be implemented by multiplying by the appropriate constant... with some speedup on processors like the AMD Athlon. Unfortunately, GCC seems to believe constant multiplies should *always* be converted into shift-and-add sequences, so there is a problem in using this optimization in C source co

Although many instruction sets provide single machine instructions that implement sign extension of 2's-complement integers, Jean-Charles Meyrignac suggested including a couple of tricks for sign extension. I've included them here because sign extension instructions generally work on

The most obvious method assumes that you have a signed right shift: to extend an `a`-bit number `x` to `b` bits, shift left by `b-a`, then signed shift that value right by `b-a` bits. A more interesting variation that doesn't use shifts (shifters are often a limited resource) basically does a 1-bit add to flip high bits: if `n` is 2 to the `a`, simply compute `(((x | -n) + (n/2)) ^ (n/2))`.

Given two binary integer values, `x` and `y`, the values can be exchanged without use of a temporary by:

x ^= y; /* x' = (x^y) */

y ^= x; /* y' = (y^(x^y)) = x */

x ^= y; /* x' = (x^y)^x = y */

Before we coined the name SWAR in Fall 1996, we already had defined a complete set of basic operations and described how they could be implemented with good efficiency. On February 4, 1997, we posted this fairly complete overview document and there also are slides from a seminar presentation I gave at Purdue. These methods were used in our SWARC compiler and were detailed in a number of our publications throughout the 1990s. We hadn't posted them on this page because they were so prominently published elsewhere.

However, much to our surprize, United States Patent 7039906, "Compiler for enabling multiple signed independent da

Given the Least Significant 1 Bit and Population Count (On

unsigned int

tzc(register int x)

{

return(ones((x & -x) - 1));

}

- HAKMEM

*a collection of mostly low-level tricks and otherwise interesting ideas, some quite old* - Bit Twiddling Hacks

*a page full of tricks very similar in spirit to those here* - AMD Athlon Processor x86 Co
de Optimization Guide

*an exceptionally complete coverage of Athlon-oriented magic at both the C and x86 assembly levels* - Intel Architecture Optimization Reference Manual

Intel Pentium 4 Processor Optimization Reference Manual

*Intel's equivalents to the AMD document above. very x86-oriented (including MMX, SSE, and SSE2)* - The PowerPC Compiler Writer's Guide
- Technical Site Index for azillionmonkeys

*lots of stuff, including hacks in x86 and C* - Reciprocal Multiplication, a tutorial

*a really nice piece of work by Doug Jones...* - Digital Signal Processing Tricks (from comp.dsp)
- various good algorithms
- Programming Pearls book on
line - Hacker's Delight is a book that seems to talk about many of the same issues discussed on this page

评论这张

<#--最新日志，群博日志-->
<#--推荐日志-->
<#--引用记录-->
<#--博主推荐-->
<#--随机阅读-->
<#--首页推荐-->
<#--历史上的今天-->
<#--被推荐日志-->
<#--上一篇，下一篇-->
<#-- 热度 -->
<#-- 网易新闻广告 -->
<#--右边模块结构-->
<#--评论模块结构-->
<#--引用模块结构-->
<#--博主发起的投票-->

## 评论