## QuickSpec and the quest for good lemmas

Given a fixed set of proofs, it is natural to consider a lemma to be good to the extent it enables compression of the proofs. But how might we know if a lemm...

The AlphaZero algorithm developed by David Silver, Julian Schrittwieser, Karen Simonyan and others at DeepMind has proved very successful in the board games Go, Chess, and Shogi. I personally found the algorithm very elegant upon first reading. However, when I implemented it on a different problem, it failed catastrophically until I implemented a few tricks and tuned the hyperparameters carefully. I still use the algorithm, but I think the failure modes I stumbled on reflect a (perhaps obvious and practically irrelevant) conceptual flaw in the algorithm.

Suppose we have a deterministic MDP where a state is a bit string and the two actions are to append either 0 or 1 to the end of the state. Suppose the initial state is the empty string \( \epsilon \), and that for some \(n\), there is one unknown sequence of bits (e.g. \(1^n\)) that terminates with a reward of 1, and all other n-bit sequences terminate with no reward. Suppose \(n = 2\), so the agent need only append 1 twice to maximize the reward.

Now let’s consider what AlphaZero might do in this environment. First, it queries the neural network \( (P(\epsilon, 0), P(\epsilon, 1), V(\epsilon)) = f(\epsilon) \) on the root node \( \epsilon \) of the search tree. Suppose by bad luck \( P(\epsilon, 0) \) is greater than \( P(\epsilon, 1) \).

Now, for \(K\) iterations, the algorithm chooses a new leaf node of the search tree to expand (i.e. to query the network on) by traversing the existing search tree starting from the root until reaching a leaf node. On each traversal, it selects actions according to the following formula:

\[ \arg\max_a Q(s, a) + c_{puct} P(s, a) \frac{\sqrt{\sum_b n_{v}(s, b)}}{1 + n_{v}(s, a)} \]

where \( Q(s, a) = 0 \) when \( n_v(s, a) = 0 \). Since \( P(\epsilon, 0) > P(\epsilon, 1) \), it expands 0 first. Since the rewards are all between 0 and 1, the network will return a value estimate \( V(0) > 0 \), which will become \( Q(\epsilon, 0) \). This gives action 0 an even stronger grip on the root node, and depending on \( c_{puct} \) and the ratio \( P(\epsilon, 0) / P(\epsilon, 1) \), it may require many rollouts before it even tries taking action 1 at the root node.

Suppose that the number of expansions \( K \) is too small relative to \( c_{puct} \) and the ratio \( P(\epsilon, 0) / P(\epsilon, 1) \), so that it never visits action 1 at the root node before commiting to action 0. This is still reasonable behavior so far, since it has simply gotten unlucky.

However, what the algorithm does next is *not* reasonable. It
uses the visit counts \( n_v(\epsilon, 0) = K \) and \( n_v(\epsilon,
1) = 0 \) to determine the distribution \( \pi(\epsilon) =
(\pi(\epsilon, 0) = 1.0, \pi(\epsilon, 1) = 0.0) \) the neural network
policy ought to have omitted, and trains the neural network to emit \(
\pi \) instead of \( P \). The algorithm reinforces the arbitrary
decision it made in the first episode, and *never* stumbles on
the winning state 11.

There are many ways to modify the algorithm to avoid the issue in this example (e.g. by boosting the scores of all untried actions, by adding a large amount of Dirichlet noise to \( P(s, \cdot) \), by making \( K \) and \( c_{puct} \) larger, etc.), but I think this example illustrates a fundamental problem with the algorithm: reinforcing the visit counts only makes sense as \( K \) goes to infinity. It is only justified by asymptotic arguments, and can result in pathological behavior in the case of finite \( K \).

I think the main conceptual flaw of the algorithm is that
it conflates the behavior of the exploration strategy with the
information gathered by the exploration. One can imagine different
exploration strategies with different tradeoffs, and one doesn’t want
to blindly train the network to reinforce what the exploration
strategy *did*; one wants to update it based on what the
exploration strategy *learned*.

Given a fixed set of proofs, it is natural to consider a lemma to be good to the extent it enables compression of the proofs. But how might we know if a lemm...

Many AI researchers (myself included) are optimistic that stochastic gradient descent (SGD) can somehow be leveraged to help learn sophisticated computer pro...

A friend of mine recently proposed the General Intelligence Hypothesis (GIH), which is a restatement of an old idea, and which has both a weak and a strong v...

Consider the following puzzle, from Martin Gardner’s My Best Mathematical and Logic Puzzles (1994):

Suppose you have a list of 25 characters, 13 of which are ‘a’, and no character ever appears twice in a row. What can you tell me about the list?

Humans can learn to become competent at playing board games such as Chess and Go, but humans are phenomenally excellent at learning how to play such games qu...

The word “abstraction” is vague in most contexts within the AI community. Has AlphaGo learned abstractions? I don’t think the question is relevant. AlphaGo ...

The AlphaZero algorithm developed by David Silver, Julian Schrittwieser, Karen Simonyan and others at DeepMind has proved very successful in the board games ...