This seems to now be one of the classical papers in neuro-symbolic approaches to deep learning. The idea is simple, sound and nice. They demonstrate it in the domain of Visual Question Answering (VQA), but (unlike many papers that claim so) it's not hard to imagine it being applied elsewhere.
Here's the core idea. A module network will be composed of a set of modules. Each module is a neural network that implements a function of a certain type. For instance, $find$ is a kind of module that takes an image and outputs an attention map. $describe$ is a module that takes an image and an attention map, and returns a label (intuititively, it labels what the attention map is point at, e.g. "dog" or "table"). There's a small number of (hand-designed) kinds of modules, and each module can be instantiated for a word (e.g. $find[dog]$ is a module of the kind $find$ that will be specialized to find dogs in an image, i.e. attend to them). While the kinds of modules are hand-designed, they are not too many and their training is not specialized. Instead, for each training example, they first parse the question and assemble an architecture that answers the question, based on simple heuristics. For example, the question "How many dogs are there?" might start with the network module that takes an image and attends to the dogs (a module of kind $find$ instantiated to $dog$, which they name $find[dog]$), and pass that to the module that takes an attention map and counts the number of objects in it (e.g. it can look at the number of disjoint regions receiving attention). The question "How many cats are there?" would reuse the counting module, but would use a different instance of the $find$ module. Then, for each training example, they first assemble the neural network using the modules, then run it forward and do backprop, taking a gradient step for the parameters of all used modules. Unused modules are of course not touched. So while each example takes more work, as it's not easy to run this in batches (as each question might use a different architecture), each module doesn't need to be as big, as it only needs to learn one specific function. This is intuition - I haven't looked at their code, and that was not in the paper.
The map $question \rightarrow architecture$ is deterministic in this paper, but there seems to be follow-up work that jointly learns that with training (I'm curious to see how to make that work: it's a discrete choice, but REINFORCE seems like terribly slow for this). In the end, besides getting SOTA VQA results for the time, they observe that the concepts the modules are learning indeed map to what they would expect (i.e. $find[dog]$ does actually attend to dogs). This is easy to believe, given the idea - it sounds natural to observe compositionality and specialization of the modules when the training procedure has that very baked in. For example, the $find$ module can't learn to depend on other specific modules, since it will be used along with too many of them. Similarly, the $find[dog]$ instance of that module will mostly be called in images with dogs, so the most useful thing it can do is indeed to learn how to attend to dogs.
In short, I like the idea, and am curious to read the follow-up work.