Tuesday Mar 11, 2014

A simple PRNG construction idiom

The academic literature is rife with algorithms for pseudo-random number generators (PRNGs). Typically, there's a trade-off between performance and the quality of the distribution. In most cases I need PRNGs to implement very lightweight Bernoulli trials for randomized stress tests, benchmarks, or scalable probabilistic counters. My budget is usually less that 100 cycles to generate a uniformly distributed value. Marsaglia's xor-shift PRNG is one of my personal favorites. If I need better quality I'll step up to Ziff's four tap or Mersenne twister.

One variation of Marsaglia has only one word of state, a 4G-1 period, and requires just 3 shifts and 3 XOR operations to generate a new value. 0 is an absorbing state that we avoid. See MarsagliaNext(), below. Ignoring 0, the trajectory or stream of values forms a cycle -- conceptually a ring. The initialization and seeding operation should act to place different threads at different positions on that ring. In a sense the ring is shared by all threads but we start the threads at different points. Unfortunately, we can sometimes find that 2 different threads are at about the same position at about the same time through simple bad luck, and thus generate similar streams of values. (Longer PRNG cycles reduce the odds of such scenarios, of course). Ideally, to avoid such inopportune and undesirable behavior, each thread would have its own private ring of values.

A simple approach that is tantamount to giving each thread its own ring -- trajectory of values -- is shown below in NextRandom(). Hash32() is a hash function, which we'll describe later. Note that we explicitly "color" the value passed into Hash32() with the address of the thread-local PRNG state. Recall that at any one time, such an address will be associated with at most one thread, and is thus temporally unique. This approach gives us additional independence over concurrently executing threads. It also makes NextRandom() self-seeding -- explicit initialization is not required.

The Hash32() hash function is the key to this particular PRNG, and its implementation directly embodies the trade-off between performance and the quality of the resulting distribution. Conceptually, you could think of Hash32() as representing a randomized cycle with a period of 4G. We might implement Hash32() via a strong cryptographic hash, such as SHA-3 or MD5. Those hash functions would work perfectly well, but tend to be high quality but somewhat expensive, so we'd prefer a cheaper hash. Critically, we want the hash function to exhibit a high degree of avalanche effect. (A good encryption function could also be used as a hash operator). Some cheaper candidates for the hash function include: MurmurHash ;CityHash ; FNV hash family; siphash; and Jenkins hash.

Doug Lea and Guy Steele invented some of the best candidates for our hash function: see Mix32() and Mix64() below. These are relatively cheap but do well on avalanche tests and strike a reasonable balance between quality and performance. Beware that they're obviously not cryptographically strong. Mix32() and Mix64() are related to mix functions found in java.util.SplittableRandom.

Update 2014-03-13 : Thanks to Paul Sandoz for pointing out Stafford's mix function.

static int32_t MarsagliaNext () {
static __thread int32_t rv = 0 ;
if (rv == 0) rv = GenerateSeed() ;
int32_t v = rv ;
v ^= v << 6 ;
v ^= uint32_t(v) >> 21 ;
v ^= v << 7 ;
rv = v ;
return v ;

static int NextRandom() {
// assumes a 32-bit environment : ILP32
// Instead of incrementing by 1 we could also
// increment by a large prime or by the MIX constant
// if we're using Mix32() as our hash function.
static __thread int rv = 0 ;
return Hash32 (int(&rv) + (++rv)) ;

// Mix32 and Mix64 are provided courtesy of Doug Lea and Guy Steele

static int32_t Mix32 (int32_t z) {
static const int32_t MIX = 0x9abe94e3;
// Another workable constant is 0xa0ca9b6b
z = (z ^ (uint32_t(z) >> 16)) * MIX;
z = (z ^ (uint32_t(z) >> 16)) * MIX;
return z ^ (uint32_t(z) >> 16);

static int64_t Mix64 (int64_t z) {
static const int64_t MIX = 0xdaba0b6eb09322e3LL;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
z = (z ^ (uint64_t(z) >> 32)) * MIX;
return z ^ (uint64_t(z) >> 32);

MSPC 2014 submission deadline has been extended until April 4th

See the workshop web site for details.

Monday Mar 10, 2014

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« March 2014 »