This is one of the first papers that attempted to do Visual Question Answering (VQA). I arrived at it because I was discussing with my advisor the idea of bootstrapping a VQA model using image captions, by applying wh-movements (e.g. from "three bananas in a basket", you could rearrange it to "how many bananas are in the basket?"). This is less obviously necessary today, now that large-scale VQA datasets are available. But at that time, when the only available VQA dataset was relatively tiny in today's standards, they worked around that by doing just this. As almost always, it is the constraints that make you think. On the other hand, we now have way better image captioning models, so it would be curious to see if that seemingly forgotten trick could give you anything.