This is an interesting critique to NLP systems, from someone that has worked a lot on language but from a very different perspective. Herb Clark, who wrote Using Language, makes the general point that when using language, people need to anchor pieces in utterances all the time. But the things that the utterances anchor to are most of the time outside language itself. For example, an isolated sentence such as "Should I put it on?" can't possibly make too much concrete sense, unless we anchor "I" to a specific person that is saying it, "it" to an object, place the entire utterance in some specific time (e.g. the meaning is totally different if it's part of an ongoing conversation versus if it is preceded by "And then I asked him:"), possibly in some specific place, anchor the "on" to "on what", etc. But that cannot stem simply from analyzing this sentence in vacuo.
I believe that, for NLP systems, this means that we cannot strip language produced in settings where there's more ways to establish common ground and hope to get much from it. However, much of what is produced on the Internet was not created with that assumption. Of course I won't point to an object in the real world and write a text message "Is it this one?", and hope that my addressee will know what I'm talking about. Similarly, much of what is produced online does not use that assumption. Instead, when we know that text is the only shared communication medium, we'll try to make up for all the rest in the text itself. But it is true that the text might (and often will) refer to objects that exist outside the text itself, and assume they're common ground. That's an interesting challenge: how do you anchor those? Do we even need to? GPT-3, for example, can clearly and easily "get" references we make to random real-world objects. If you ask about what color are the apples, even if apples were not part of the interaction, GPT-3 will most likely output it is red. Apples, here, to us, refers to this physical object that we experience in all sorts of modalities - in words, vision, taste, touch, the sound of it crunching. For GPT-3, "apple" can at most refer to what people have written about what they see, how it tastes like, and so on. However, if you're interacting with GPT-3 through text, which is the only way we currently have, then there's by construction no way to identify how it is or is not anchoring it to the real-world apples. If it was a person generating GPT-3's output, it would at most be able to write about apples. And what to write would also be learnable through what people have written elsewhere. How would you know the person "actually" knows what an apple is?