## Neural networks, Occam’s razor, and heuristic search

Neural networks sacrifice Occam’s razor in exchange for the ability to efficiently find parameters that fit the training data. This may be a fantastic trade...

Many AI researchers (myself included) are optimistic that stochastic gradient descent (SGD) can somehow be leveraged to help learn sophisticated computer programs. My colleagues and I recently showed that SGD over a graph neural network can learn a local search procedure for the boolean satisfiability problem, even though the function it is trying to learn is extremely sensitive to individual bits in the input.

However, there are many synthesis problems for which discrete search works well but for which SGD fails. Gaunt et al. have done us the great service of constructing a minimal example of a synthesis problem that exhibits an exponential number of inadequate local optima, for which SGD consistently gets stuck, even when applying every trick in the book. This minimal example is relatively unknown to the ML community, since it is buried in a dense paper with a forbidding amount of PL jargon. In this post I will explain the example for an ML audience. In honor of the probabilistic programming language, TerpreT, introduced in the paper, I refer to this example as “the TerpreT problem”. The code for this post can be found here.

Suppose we have the following program sketch:

Constants | `k : int` |

Inputs | `x_0 : bool` |

Parameters | `xs : array bool (k-1)` |

Outputs | `ys : array bool k` |

```
def foo(x_0):
xs = [x_0] ++ xs
ys = [xor(xs[i], xs[i+1 % k]) for i in range(k)]
return ys
```

In words, the program `foo`

is parameterized by \( k-1 \) booleans, and it connects an input boolean `x_0`

to the \(k-1\) parameters to form a circular chain of length \(k\). Then, it returns the \(k\) values resulting from `xor`

-ing adjacent elements of the chain.

If we observe the single input-output example where 0 maps to the array with every element 0, we can conclude that all the parameters of the program must be 0. Indeed, constraint-propagation can easily deduce this in linear time. However, we will see shortly that SGD cannot discover this fact.

First, we need to explain how to apply SGD to this problem at all. The first step is to *relax* the program, so that every boolean variable and parameter is represented by its *probability distribution*, and the program itself propagates probability distributions instead of specific boolean values. Then instead of observing that every element of `ys`

is 0, we want to minimize the cross-entropy loss between the computed *distribution* of each element of `ys`

with the (deterministic) distribution that puts all its mass on 0.

Next, we will use SGD to minimize this loss as a function of the *parameters* of the distribution for each element of `xs`

. We follow the convention in the paper of representing the parameters for the Bernoulli distribution of `x_i`

by two scalars \(s_i(0)\) and \(s_i(1)\), such that

\[ P(x_i = z) = \frac{e^{s_i(z)}}{e^{s_i(0)} + e^{s_i(1)}} \]

Clearly the loss function can be made arbitrarily small by making \(s_i(0)\) big relative to \(s_i(1)\) for all \(i\), and clearly this is the only way to make it arbitrarily small. However, when we run SGD on this problem, the loss function almost never goes to 0.

Here is example Tensorflow code (complete code is here):

```
# mu0 : T [1, 2]
# This is the marginal for the variable x0 which is observed to be 0.
mu0 = tf.constant([[1.0, 0.0]], dtype=tf.float32)
# ss : T [k-1, 2]
# These are the parameters we are trying to optimize for, that control the marginal distribution
# of the xs in the chain.
# Note: `SharedLogDirichletInitializer` is the initialization method used in the paper, and its details
# are not important here.
ss = tf.get_variable("ss", shape=[opts.k - 1, 2],
initializer=SharedLogDirichletInitializer(opts.alpha, opts.k - 1, 2))
# mus : T [k, 2]
# These are the parameters for the distributions of the full xs chain.
mus = tf.concat([mu0, tf.nn.softmax(ss)], axis=0)
# Returns the probability that `xor(xs[i], xs[i+1%k])` equals `0`
def soft_xor_p0(i):
j = tf.mod(i+1, opts.k)
return (mus[i, 0] * mus[j, 0]) + (mus[i, 1] * mus[j, 1])
# ys_eq_0 : T [k]
ys_eq_0 = tf.map_fn(soft_xor_p0, tf.range(opts.k), dtype=tf.float32)
# cross-entropy loss
loss = - tf.reduce_sum(tf.log(ys_eq_0))
update = tf.train.AdamOptimizer(learning_rate=opts.learning_rate).minimize(loss)
for epoch in range(opts.n_epochs):
_, loss = sess.run([tp.update, tp.loss])
sys.stdout.write("[%d] %.8f\n" % (epoch, loss))
```

When we run this code, we see that (depending on the size of \(k\)) it almost always converges to a suboptimal solution with non-zero loss.
To see what is happening, we can print out \(P(x_i = 0)\) for every `x_i`

in `xs`

at every step in the loop:

```
for epoch in range(opts.n_epochs):
_, mus = sess.run([tp.update, tp.mus])
for i in range(opts.k):
sys.stdout.write("%d " % (round(100 * mus[i, 0])))
sys.stdout.write("\n")
```

Each row of the following table shows the local optima found for various random seeds, with \(k\) = 16. A 100 indicates 100% probability that the corresponding parameter is set to 0 (see code snippet above for details), so for each row we want every element to be 100.

x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

100 | 100 | 100 | 100 | 100 | 100 | 100 | 50 | 0 | 0 | 50 | 100 | 100 | 100 | 100 |

50 | 0 | 0 | 0 | 50 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |

100 | 100 | 50 | 0 | 0 | 0 | 0 | 50 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |

100 | 100 | 50 | 0 | 0 | 0 | 0 | 50 | 100 | 100 | 50 | 0 | 50 | 100 | 100 |

The pattern is clear: any two `x_i`

at 50% create local optima with all elements between them set to 0 instead of 100.
The intuition for this phenemenon is simple: if you only look at your neighbors and they agree, it is best for you to agree with them; if they disagree, it is best for you to hedge your bets between them.

It may be clarifying to see how this emerges from the math. Let \(p_i \) stand for \( P(x_i = 0) \). For a given \(i\), the parameters \(s_i(0)\) and \(s_i(1)\) only contribute to two terms of the loss function:

\[ \log \left( p_{i-1} p_i + (1 - p_{i-1})(1 - p_i) \right) + \log \left( p_{i+1} p_i + (1 - p_{i+1})(1 - p_i) \right) \]

After a few steps of simplification, we can express the derivative of this quantity with respect to \( s_i(0) \) as follows:

\[ \frac{\partial \mathcal{L}}{\partial s_i(0)} = p_i (1 - p_i) \left( \frac{2 p_{i-1} - 1}{p_{i-1} p_i + (1 - p_{i-1})(1 - p_i)} + \frac{2 p_{i+1} - 1}{p_{i+1} p_i + (1 - p_{i+1})(1 - p_i)} \right) \]

When \( p_i = 0 \) and \( p_{i-1}, p_{i+1} < 1 \), then the partial derivative with respect to \( s_i(0) \) is 0. Similarly, when \( p_i = 0.5 \) and \( (p_{i-1}, p_{i+1}) \) equal (0.0, 1.0) or (1.0, 0.0), the partial derivative with respect to \( s_i(0) \) is also 0. To see that there are an exponential number of such local optima, notice that if we fix every other \( p_i \) to either 0 or 1 in any of the \( O(2^{k/2}) \) ways with atleast one \( 0.0 \), we can set the remaining \( p_j \) to construct a local optima.

The authors report trying many optimization heuristics, such as adding random Gaussian noise to the gradients, and adding an exponentially-decayed entropy bonus. However, as \( k \) starts to get large (\( \approx \)128), no amount of tricks or hyperparameter tuning helped: SGD never found the answer.

In contrast, a SAT solver can immediately deduce that all parameters must be 0 using unit propagation without any search.

Neural networks sacrifice Occam’s razor in exchange for the ability to efficiently find parameters that fit the training data. This may be a fantastic trade...

Given a fixed set of proofs, it is natural to consider a lemma to be good to the extent it enables compression of the proofs. But how might we know if a lemm...

Many AI researchers (myself included) are optimistic that stochastic gradient descent (SGD) can somehow be leveraged to help learn sophisticated computer pro...

A friend of mine recently proposed the General Intelligence Hypothesis (GIH), which is a restatement of an old idea, and which has both a weak and a strong v...

Consider the following puzzle, from Martin Gardner’s My Best Mathematical and Logic Puzzles (1994):

Suppose you have a list of 25 characters, 13 of which are ‘a’, and no character ever appears twice in a row. What can you tell me about the list?

Humans can learn to become competent at playing board games such as Chess and Go, but humans are phenomenally excellent at learning how to play such games qu...

The word “abstraction” is vague in most contexts within the AI community. Has AlphaGo learned abstractions? I don’t think the question is relevant. AlphaGo ...

The AlphaZero algorithm developed by David Silver, Julian Schrittwieser, Karen Simonyan and others at DeepMind has proved very successful in the board games ...