This paper studies Reinforcement Learning in environments where states and actions are given in unbounded natural language. They use text-based games to test their architectures. The architecture they propose is intuitive: one uses distinct encoders for both the state and action, and then uses the dot product between both embeddings to estimate the $Q$ value of the state. This makes the encoders align states with good actions in embedding space.
This is one of the few viable baselines I can use for an education-realted environment I'm proposing. This space seems surprisingly underexplored, possibly because of the lack of interesting environments. Dialogue systems are a cousin of this setup, but one that is way more messy (data is expensive to collect, rewards are less well-defined, etc).