Gabriel Poesia

Thinking Fast and Slow with Deep Learning and Tree Search (@ NeurIPS 2017)

Thomas Anthony, Zheng Tian, David Barber

Link

This paper introduced Expert Iteration: a simple idea that is widely applied today. Fundamentally, this is also the idea behind AlphaZero, and they were concurrent work. The main idea is to start with a search algorithm that can be augmented with neural policy/value networks, like MCTS. The neural policy will be seen as the "fast policy", as it will give a probability distribution over next moves in a single forward pass. In contrast, MCTS augmented with the neural policy is seen as the "slow policy", since it will be guided by the fast policy at each step, but will perform much more computation, looking ahead and expanding states, in order to determine an improved distribution over next states (thus potentially reverting some of the mistakes made by the fast policy, by, for example, looking ahead and realizing that a move wasn't as promising as the fast policy guessed). In this setup, we can incrementally improve the fast policy by applying the slow policy (e.g., playing games with it in a self-play setup), and then using the generated search trees as training data for the fast policy (i.e., the fast policy will learn to approximate the slow policy). Improving the fast policy in turn improves the slow policy, which can be again used to play even better games, and so on.

The idea of Expert Iteration is still quite prolific, and is the core idea behind lots of methods for learning to search in different contexts, including automated theorem proving. It's still a very useful framework to have in mind in a lot of my work.