This paper studies the problem of interactive classification. Given an initial natural language query, at each step the system can either make a final prediction or ask a question to gain more information, again receiving a natural language answer.
This setting is very much like games such as Akinator or 20Q. However, they assume unstructured data, which is a fundamental difference. For me, their main contribution is on the framing of the problem. They crowdsource two datasets: first, for each possible classification target $y_i$, they collect a set of possible initial queries $\{x_ij\}$ that users could potentially use to find $y_i$. Second, they assume each classification target $y_i$ has a set of natural language tags, from a set pre-defined by a domain expert. These tags are heuristically converted into questions (such that the expected answer is the tag). Assuming you have this data, their method follows unsurprisingly from modern NLP engineering: RNN encoders for predicting the classification target given the image, picking questions by maximizing information gain, training a policy with REINFORCE to decide when to stop, etc. And, of course, this data is much easier to collect than user interaction data, as it doesn't require your system to work to collect the data in the first place. It's a nice workaround that problem: they instead collect data such that it is easy to derive the interaction from it once you can model it.