Monday Mar 10, 2014

PRNG optimizations

Say you wanted to generate a stream of uniformly distributed random integers in the range [0,N) where N > 0. Java's Random.nextInt(N) is an example of a convenient API that does just that. Furthermore lets say you willing to trade off the quality of the distribution in order to gain better performance -- lower latency for each call to nextInt(N). Since we care about speed, we'll assume the implementation can't use floating point.

A good starting point might be Marsaglia's xor-shift pseudo-random number generator (PRNG), which can generate a new 32-bit value in [0,4G) in just a few cycles. Once you have that 32-bit value in hand you next need to reduce it to the [0,N) range. The most obvious approach is to use the "MOD" operator. The MOD (and DIV) instructions can be relatively expensive, however. Somewhat perversely, it's not uncommon to find that the MOD instruction is more expensive than generating the value via the Marsaglia xor-shift PRNG, and thus dominates performance of your nextInt(N) implementation. Furthermore, some processors have only one processing unit per core that can handle MOD instructions, so you might encounter both scaling and latency problems. If N happens to be a constant known at compile-time then your compiler might be sufficiently clever to apply strength reduction techniques, such as those described by Granlund and Montgomery in Division by Invariant Integers using Multiplication (PLDI 1994). Hacker's Delight (2E) is another good resource. Alternatively, you could explicitly apply the optimization yourself. But lets assume that N is a variable.

One approach to avoid the MOD instruction is to use a degenerate scaled or fixed-point representation. Say "v" is the value returned by the Marsgalia xor-shift PRNG. We can calculate and return (((v & 0x3FF) * N) >> 10). 0x3FF and 10 are related as 0x3FF is equal to ((2^10)-1), of course. Note that we traded MOD for multiplication, which is usually profitable. While this approach is tempting -- and might be viable in some circumstances -- it suffers from range overflow concerns and quantization (really additional discretization). Note that only 1024 possible values can be returned. We might change 0x3FF and 10 to 0x7FF and 11, respectively, allowing 2048 distinct values, but if we keep increasing those values the "interesting" bits will be truncated by overflow in the 32-bit multiplication and the distribution will suffer.

Another approach is to borrow an idea from rejection sampling. For discussion lets assume N is 10. Ignoring overflow, we can easily compute M where M is the least integer power of 2 greater than or equal to N. Since N is 10, M will be 16. See "Hacker's Delight" 2E, Figure 3-3, or the following article. Critically, given N, we can quickly compute M in constant time with no memory references and no conditional branches. Our nextInt(N) implementation can first compute M from N and then invoke the Marsaglia xor-shift PRNG (or PRNG of your choice) to obtain a 32-bit value "v". Next, we mask "v" with (M-1) yielding a value in [0,M). If that value is less than N then we're done and can return -- we accept the value. But if the value is greater than or equal to N then we reject the value and loop, calling the PRNG again and retrying until we finally find a value less than N. We're effectively running Bernoulli trails until the 1st success, and the probability P of success (N/M) will always be greater than 0.5. The number of trials (iterations) until the 1st success has a geometric distribution so the average running time -- iterations required -- is reasonable if you're willing to tolerate the non-deterministic nature of rejection sampling. If you're concerned about the worst-case you could fall back to the MOD instruction after some number of rejections within a nextInt() episode. One downside to this approach is that the branch to exit the loop might not be very well predicted.

Whether or not this approach is profitable depends on the cost of MOD relative to the additional work to compute M and the cost of the retries. Somewhat sadly, it appears useful on a range of systems.

(Beware that the low-order bits from a Marsaglia xor-shift PRNG may have a skewed distribution over short intervals, but you can remedy this with a shift or by using a slightly better but still cheap PRNG).


Dave is a senior research scientist in the Scalable Synchronization Research Group within Oracle Labs : Google Scholar.


« March 2014 »