Learning to Infer Graphics Programs from Hand-Drawn Images

Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, Josh Tenenbaum (NeurIPS 2018)

[link]

This is a very nice paper. I think it really begins to make Josh Tenenbaum’s thought of representing concepts as programs very concrete and feasible (even if method-wise some improvements came in later papers). Besides the content itself, one thing I like a lot about papers is when their first page has an image that gives away what they do. Even if the NeurIPS template leaves such little space in the first page, as ~60% is covered by title + authors + abstract, they still made room for it.

All right, here comes the core idea, simple and neat. Suppose you have hand-drawn images. One way to extract “what” is being drawn, as opposed to just what pixels are on and off, is to try to find a small program that generates that image. This program will be written in a DSL that has primitive operations like “draw a circle at $(x, y)$ with radius $r$”, or draw an arrow from $(x_1, y_1)$ to $(x_2, y_2)$. Moreover, it can have loops and conditionals, that can capture, for instance, that the drawing is a grid of squares.

They decompose the problem in two steps. First, detect what are the primitives being drawn. That generates a (probably long) straight-line program that outputs something close to the hand-drawn image. How to do that? Well, you can train a CNN-based model. On what? Well, on images that you generate yourself, in a self-supervised way! So first you generate a random set of circles, squares, arrows, etc, then render the image, and then have the network guess what primitives were there. It can be trained to output one at a time, in a fixed order (e.g. from the left-most/top-most towards the right/bottom), and take as input one channel with the stuff that was already guessed. So it starts with an empty image, then after it guesses one circle this second channel will have that circle, and so on. This was apparently inspired on the Attend-Infer-Repeat paper, which I haven’t read.

Good, now we have a list of primitives. Now we have to find a small program that outputs those primitives. Now, they train a RL policy to explore the space of programs efficiently. To guide the exploration, they use an idea that’s new to me, called Levin Search. The idea is to spend less time on programs that you believe are less likely to work. More precisely, you spend time exactly proportional to how likely your model thinks a program with a given prefix will work. You shouldn’t be greedy since you can be wrong, but you shouldn’t spend as much time in parts that you judge are unlikely to give you a solution.

In summary, a fun paper to read. The structure is also unusual in that it intersperses experiments with the actual paper, which makes lots of sense in this case.

Gabriel Poesia
Computer Science PhD student