(Originally published on Medium)
In this post, we review aspects of our AlphaZero implementation that allowed us to dramatically improve the speed of game generation and training.
The task of implementing AlphaZero is daunting, not just because the algorithm itself is intricate, but also due to the massive resources the authors employed to do their research: 5000 TPUs were used over the course of many hours to train their algorithm, and that is presumably after a tremendous amount of time was spent determining the best parameters to allow it to train that quickly.
By choosing Connect Four as our first game, we hoped to make a solid implementation of AlphaZero while utilizing more modest resources. But soon after starting, we realized that even a simple game like Connect Four could require significant resources to train: in our initial implementation, training would have taken weeks on a single gpu-enabled computer.
Fortunately, we were able to make a number of improvements that made our training cycle time shrink from weeks to about a day. In this post I’ll go over some of our most impactful changes.
Before diving into some of the tweaks we made to reduce AZ training time, let’s describe our training cycle. Although the authors of AlphaZero used a continuous and asynchronous process to perform model training and updates, for our experiments we used the following three stage synchronous process, which we chose for its simplicity and debugability:
While (my model is not good enough):
Far and away, the biggest bottleneck of this process is game generation, which was taking more than an hour per cycle when we first got started. Because of this, minimizing game generation time became the focus of our attention.
Alpha Zero is very inference heavy during self-play. In fact, during one of our typcal game generation cycles, MCTS requires over 120 Million position evaluations. Depending on the size of your model, this can translate to siginificant GPU time.
In the original implementation of AlphaZero, the authors used an architecture where the bulk of computation was performed in 20 residual layers each with 256 filters. This amounts to a model in excess of 90 megabytes, which seemed overkill for Connect Four. Also, using a model of that size was impractical given our initially limited GPU resources.
Instead, we started with a very small model, using just 5 layers and 64 filters, just to see if we could make our implementation learn anything at all. As we continued to optimize our pipeline and improve our results, we were able to bump our model size to 20X128 while still maintaining a reasonable game generation speed on our hardware.
From the get-go, we knew that we would need more than one GPU in order to achieve the training cycle time that we were seeking, so we created software that allowed our Connect 4 game agent to perform remote inference to evaluate positions. This allowed us to scale GPU-heavy inference resources separately from game play resources, which need only CPU.
GPU resources are expensive, so we wanted to make sure that we were saturating them as much as possible during playouts. This turned out to be trickier than we imagined.
One of the first optimizations we put in place was to run many games on parallel threads from the same process. Perhaps the largest direct benefit of this, is that it allowed us to cache position evaluations, which could be shared amongst different threads. This cut the number of requests getting sent to our remote inference server by more than a factor of 2:
Caching was a huge win, but we still wanted to deal with the remaining uncached requests in an efficient manner. To minimize network latency and best leverage GPU parallelization, we combined inference requests from different worker threads into a bucket before sending them to our inference service. The downside to this is that if a bucket was not promptly filled, any calling thread would be stuck waiting until the bucket’s timeout expired. Under this scheme, choosing an appropriate inference bucket size and timeout value was very important.
We found that bucket fill rate varied throughout the course of a game generation batch, mostly because some games would finish sooner than others, leaving behind fewer and fewer threads to fill the bucket. This caused the final games of a batch to take a long time to complete, all while GPU utilization dwindled to zero. We needed a better way to keep our buckets filled.
To help with our unfilled bucket problem, we implemented Parallel MCTS, which was discussed in the AZ paper. Initially we had punted on this detail, as it seemed mostly important for competitive one-on-one game play, where parallel game play is not applicable. After running into the issues mentioned previously, we decided to give it a try.
The idea behind Parallel MCTS is to allow multiple threads to take on the work of accumulting tree statistics. While this sounds simple, the naiive approach suffers from a basic problem: if N threads all start at the same time and choose a path based on the current tree statistics, they will all choose exactly the same path, thus crippling MCTS’ exploration component.
To counteract this, AlphaZero uses the concept of Virtual Loss, an algorithm that temporarily adds a game loss to any node that is traversed during a simulation. A lock is used to prevent multiple threads from simultaneously modifying a node’s simulation and virtual loss statistics. After a node is visited and a virtual loss is applied, when the next thread visits the same node, it will be discouraged from following the same path. Once a thread reaches a terminal point and backs up its result, this virtual loss is removed, restoring the true statistics from the simulation.
With virtual loss in place, we were finally able to achieve >95% GPU utilization during most of our game generation cycle, which was a sign that we were approaching the real limits of our hardware setup.
Technically, virtual loss adds some degree of exploration to game playouts, as it forces move selection down paths that MCTS may not naturally be inclined to visit, but we never measured any detrimental (or beneficial) effect due to its use.
Though it was not necessary to use a model quite as large as that described in the AlphaZero paper, we saw better learning from larger models, and so wanted to use the biggest one possible. To help with this, we tried TensorRT, which is a technology created by Nvidia to optimize the performance of model inference.
It is easy to convert an existing Tensorflow/Keras model to TensorRT using just a few scripts. Unfortunately, at the time we were working on this, there was no released TensorRT remote serving component, so we wrote our own.
With TensorRT’s default configuration, we noticed a small increase in inference throughput (~11%). We were pleased by this modest improvement, but were hopeful to see an even larger performance increase by using TensorRT’s INT8 mode. INT8 mode required a bit more effort to get going, since when using INT8 you must first generate a calibration file to tell the inference engine what scale factors to apply to your layer activations when using 8-bit approximated math. This calibration is done by feeding a sample of your data into Nvidia’s calibration library.
Because we observed some variation in the quality of calibration runs, we would attempt calibration against 3 different sets of sample data, and then validate the resulting configuraton against hold-out data. Of the three calibration attempts, we chose the one with the lowest validation error.
Once our INT8 implementation was in place, we saw an almost 4X increase in inference throughput vs. stock libtensorflow, which allowed us to use larger models than would have otherwise been feasible.
One downside of using INT8 is that it can be lossy and imprecise in certain situations. While we didn’t observe serious precision issues during the early parts of training, as learning progressed we would observe the quality of inference start to degrade, particularly on our value output. This initially led us to use INT8 only during the very early stages of training.
Serendipitously, we were able to virtually eliminate our INT8 precision problem when we began experimenting with increasing the number of convolutional filters in our head networks, an idea we got from Leela Chess. Below is a chart of our value output’s mean average error with 32 filters in the value head, vs. the AZ default of 1:
We theorize that adding additional cardinality to these layers reduces the variance in the activations, which makes the model easier to accurately quantize. These days, we always perfom our game generation with INT8 enabled and see no ill effects even towards the end of AZ training.
By using all of these approaches, we were finally able to train a decent-sized model with high GPU utilization and good cycle time. It was initially looking like it would take weeks to perform a full train, but now we could train a decent model in less than a day. This was great, but it turned out we were just getting started — in the next article we’ll talk about how we tuned AlphaZero itself to get even better learning speed.
Part 6 is now out.