This paper describes a quite interesting benchmark for robot interaction with a virtual world using language. A task is given with a high-level goal (e.g. "Rinse off a mug and place it in the coffee maker") and also a sequence of lower level instructions (e.g. "1. walk to the coffee maker on the right"). The agent is placed in a realistic virtual environment (AI2-THOR), having a first-person view, and can execute a discrete set of actions.
A simple CNN-LSTM baseline learns something (performs ~3.6% of the goals accurately in seen environments), but not a lot (3.6% is very little, also it basically cannot do anything in unseen environments). One can argue whether the setup can actually lead to the development of meaningful models. For instance, in unseen environments, the agent cannot plan beforehand - it would have to explore. However, during training, no exploration is necessary, as precise human-given instructions are given.
Still, it seems like a rich environment in which current baselines are very far from doing anything too useful. It will be interesting to see if it picks up and actually becomes a serious target for grounded vision and language research.