Gabriel Poesia

Inferring and Executing Programs for Visual Reasoning (@ ICCV 2017)

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

Link

This paper appeals to an intuition I've had for very long: that some problems require more reasoning than others, and yet neural network-based solutions tend to do a fixed amount of computation, which seems unnatural. This is not exactly true in sequence-to-sequence architectures, as there the amount of computation depends on the sequence length. Yet, even then, a long sequence is not necessarily harder than a shorter one.

While I think this is true in general for how humans think (even in, e.g. image classification), this is very easy to see in the Visual Question Answering problem, which is what this paper tackles. To a person, some questions are immediate (e.g. "What color is the cube?", when there's just one cube in the image), while others require more reasoning steps (e.g. "What object is to the left of the rightmost blue object?"). The model described in this paper captures this intuition by generating a program from the question, and then executing the program on the image.

Executing this intuition on natural images is still far-fetched, so they use the CLEVR dataset, which is synthetic, and for which reference programs are available. Syntax comes from a pre-defined grammar in CLEVR, but semantics are learned by their model end-to-end (one neural module for each). Therefore, the question is first used to generate a program, and then the program is executed by this learned engine. They show that the network learns modules that compose relatively well, which is expected since that's the inductive bias being used. Also, they only need a relatively small number of programs to bootstrap from; after a point, they can fine-tune the program generator using REINFORCE. The experimental section of this paper is quite extensive.

In general, I find this inductive bias of neural programs to be very appealing. Still, it's challenging to extend to settings where a program grammar is not pre-defined, which would be more realistic to apply to natural images and contexts.