Issues with AlphaZero

The AlphaZero algorithm developed by David Silver, Julian Schrittwieser, Karen Simonyan and others at DeepMind has proved very successful in the board games Go, Chess, and Shogi. I personally found the algorithm very elegant upon first reading. However, when I implemented it on a different problem, it failed catastrophically until I implemented a few tricks and tuned the hyperparameters carefully. I still use the algorithm, but I think the failure modes I stumbled on reflect a (perhaps obvious and practically irrelevant) conceptual flaw in the algorithm.

Suppose we have a deterministic MDP where a state is a bit string and the two actions are to append either 0 or 1 to the end of the state. Suppose the initial state is the empty string $$\epsilon$$, and that for some $$n$$, there is one unknown sequence of bits (e.g. $$1^n$$) that terminates with a reward of 1, and all other n-bit sequences terminate with no reward. Suppose $$n = 2$$, so the agent need only append 1 twice to maximize the reward.

Now let’s consider what AlphaZero might do in this environment. First, it queries the neural network $$(P(\epsilon, 0), P(\epsilon, 1), V(\epsilon)) = f(\epsilon)$$ on the root node $$\epsilon$$ of the search tree. Suppose by bad luck $$P(\epsilon, 0)$$ is greater than $$P(\epsilon, 1)$$.

Now, for $$K$$ iterations, the algorithm chooses a new leaf node of the search tree to expand (i.e. to query the network on) by traversing the existing search tree starting from the root until reaching a leaf node. On each traversal, it selects actions according to the following formula:

$\arg\max_a Q(s, a) + c_{puct} P(s, a) \frac{\sqrt{\sum_b n_{v}(s, b)}}{1 + n_{v}(s, a)}$

where $$Q(s, a) = 0$$ when $$n_v(s, a) = 0$$. Since $$P(\epsilon, 0) > P(\epsilon, 1)$$, it expands 0 first. Since the rewards are all between 0 and 1, the network will return a value estimate $$V(0) > 0$$, which will become $$Q(\epsilon, 0)$$. This gives action 0 an even stronger grip on the root node, and depending on $$c_{puct}$$ and the ratio $$P(\epsilon, 0) / P(\epsilon, 1)$$, it may require many rollouts before it even tries taking action 1 at the root node.

Suppose that the number of expansions $$K$$ is too small relative to $$c_{puct}$$ and the ratio $$P(\epsilon, 0) / P(\epsilon, 1)$$, so that it never visits action 1 at the root node before commiting to action 0. This is still reasonable behavior so far, since it has simply gotten unlucky.

However, what the algorithm does next is not reasonable. It uses the visit counts $$n_v(\epsilon, 0) = K$$ and $$n_v(\epsilon, 1) = 0$$ to determine the distribution $$\pi(\epsilon) = (\pi(\epsilon, 0) = 1.0, \pi(\epsilon, 1) = 0.0)$$ the neural network policy ought to have omitted, and trains the neural network to emit $$\pi$$ instead of $$P$$. The algorithm reinforces the arbitrary decision it made in the first episode, and never stumbles on the winning state 11.

There are many ways to modify the algorithm to avoid the issue in this example (e.g. by boosting the scores of all untried actions, by adding a large amount of Dirichlet noise to $$P(s, \cdot)$$, by making $$K$$ and $$c_{puct}$$ larger, etc.), but I think this example illustrates a fundamental problem with the algorithm: reinforcing the visit counts only makes sense as $$K$$ goes to infinity. It is only justified by asymptotic arguments, and can result in pathological behavior in the case of finite $$K$$.

I think the main conceptual flaw of the algorithm is that it conflates the behavior of the exploration strategy with the information gathered by the exploration. One can imagine different exploration strategies with different tradeoffs, and one doesn’t want to blindly train the network to reinforce what the exploration strategy did; one wants to update it based on what the exploration strategy learned.

Neural networks, Occam’s razor, and heuristic search

Neural networks sacrifice Occam’s razor in exchange for the ability to efficiently find parameters that fit the training data. This may be a fantastic trade...

QuickSpec and the quest for good lemmas

Given a fixed set of proofs, it is natural to consider a lemma to be good to the extent it enables compression of the proofs. But how might we know if a lemm...

The “TerpreT problem” and the limits of SGD

Many AI researchers (myself included) are optimistic that stochastic gradient descent (SGD) can somehow be leveraged to help learn sophisticated computer pro...

The general intelligence hypothesis

A friend of mine recently proposed the General Intelligence Hypothesis (GIH), which is a restatement of an old idea, and which has both a weak and a strong v...

Search vs meta-search

Consider the following puzzle, from Martin Gardner’s My Best Mathematical and Logic Puzzles (1994):

Mathematics: our overlooked ability

Suppose you have a list of 25 characters, 13 of which are ‘a’, and no character ever appears twice in a row. What can you tell me about the list?

Intelligence as software engineering

Humans can learn to become competent at playing board games such as Chess and Go, but humans are phenomenally excellent at learning how to play such games qu...