## Neural networks, Occam’s razor, and heuristic search

Neural networks sacrifice Occam’s razor in exchange for the ability to efficiently find parameters that fit the training data. This may be a fantastic trade...

The word “abstraction” is vague in most contexts within the AI community. Has AlphaGo learned abstractions? I don’t think the question is relevant. AlphaGo clearly represents some knowledge of the game in its neural weights, but its interface with the world is purely through selecting Go moves. It doesn’t need to represent any specific type of abstraction in any specific way to play at a superhuman level.

In contrast, I consider the fundamental challenge of mathematical
theorem proving to be that:
*sub-astronomic-sized proofs do not even exist unless appropriate abstractions are explicitly reified in the logic*.
Given a fixed set of abstractions, there is only a small set of theorems can feasibly be proved; worse, there are only a small number of theorems that can even feasibly be *stated*. The main challenge of mathematics is to develop the abstractions that permit short statements and proofs of the phenomena you want to establish.

Let’s make “abstraction” precise. There are three types of abstractions in mathematics:

- Definitional abstractions
- Declarative abstractions
- Procedural abstractions

A *definitional abstraction* is a new function that stands for a concept. For example, *prime number* is a definitional abstraction, since it corresponds to defining a function *prime* as follows:

\[ \text{definition prime}(n : \mathbb{N}) := n > 1 \wedge \forall m, \text{divides}(m, n) \to m = 1 \vee m = n \]

Note that the body of a definition can depend on other definitional
abstractions (e.g. *prime* depends on *divides*). Towers of definitions that
depend on each other allow exponential compression of theorem
statements. Even the
*statement* of the trivial theorem \( (0 : \mathbb{R}) = (0 :
\mathbb{R}) \) is unimaginably vast without definitional abstractions,
since without the abstraction of \( \mathbb{R} \) it would involve
e.g. equivalence classes of Cauchy sequences, which in turn require
more abstractions to state concisely, until eventually reaching the
primitives of set theory.

A *declarative abstraction* is a theorem that has been proved, for example:

\[ \forall (n : \mathbb{N}), n > 1 \wedge \neg \text{prime}(n) \to \exists m, m > 1 \wedge m < n \wedge \text{divides}(m, n) \]

The proofs of such theorems may use other theorems, and so towers of declarative abstractions that depend on each other allow exponential compression of proofs, on top of the compression acheived by definitional abstractions.

A *procedural* abstraction is a program (also called
a *tactic*) that lives outside the logic itself and that
automates some aspect of the proving process, in either a complete or
a heuristic way. An example tactic is one that decides if two
arithmetic expressions are equal modulo associativity and
commutativity by sorting both sides and comparing for syntactic
equality. Without such a procedural abstraction, one would need to
tediously rewrite with the commutativity and associativity lemmas a
number of times that scales superlinearly with the size of the terms.

Unlike the first two types of abstractions, procedural abstractions do
not affect which theorems have sub-astronomic sized proofs in the
underlying logic. Specifically, these procedures
simply *construct* the proofs with fewer meta-logical actions.
However, procedural abstractions can still dramatically reduce the
size of the proof *script*, i.e. the sequence of tactics that
must be invoked to construct the proof.

I think all approaches to using machine learning to help prove mathematical theorems that ignore this fundamental challenge will always be severely limited.

Neural networks sacrifice Occam’s razor in exchange for the ability to efficiently find parameters that fit the training data. This may be a fantastic trade...

Given a fixed set of proofs, it is natural to consider a lemma to be good to the extent it enables compression of the proofs. But how might we know if a lemm...

Many AI researchers (myself included) are optimistic that stochastic gradient descent (SGD) can somehow be leveraged to help learn sophisticated computer pro...

A friend of mine recently proposed the General Intelligence Hypothesis (GIH), which is a restatement of an old idea, and which has both a weak and a strong v...

Consider the following puzzle, from Martin Gardner’s My Best Mathematical and Logic Puzzles (1994):

Suppose you have a list of 25 characters, 13 of which are ‘a’, and no character ever appears twice in a row. What can you tell me about the list?

Humans can learn to become competent at playing board games such as Chess and Go, but humans are phenomenally excellent at learning how to play such games qu...

The word “abstraction” is vague in most contexts within the AI community. Has AlphaGo learned abstractions? I don’t think the question is relevant. AlphaGo ...

The AlphaZero algorithm developed by David Silver, Julian Schrittwieser, Karen Simonyan and others at DeepMind has proved very successful in the board games ...