This paper attempts to explain the capability for in-context learning in recent models, especially in language models. In-context learning differs from more traditional in-weights learning in that the target prediction cannot be inferred solely from the training data, but strictly depends on the context that is given before the prediction (like the text prompt for large language models).
Here, the authors show that there are qualitative distributional properties of the training data that modulate whether a model trained on those data will be able to perform in-context learning. Notably, they find it important for the data to have:
The authors are additionally unable to get in-context learning to work with LSTMs, only with Transformers. Thus, they conclude that architecture is important together with the training data distribution.
Their setup is quite clean: they use the Omniglot dataset to construct training sequences that neatly vary the properties above, and that depend more or less on in-context vs in-weights learning. They make various analogies between the mentioned properties and the distributions of natural data, such as natural language, encountered in the wild.
This paper does make concrete predictions of how in-context learning could emerge in other scenarios if we change the training data distribution, but they're unfortunately not explored yet. In that way, it's unclear if these knobs that clearly modulate the learning dynamics in their setup (sequences of Omniglot characters and labels) will indeed be meaningful in other contexts. Until these hypotheses are tested in several other scenarios and found that these causal links still hold, I'll think these findings are interesting but not yet a definite explanation. For example, their inability to get in-context learning from LSTMs contradicts this other paper where the authors did get LSTMs to do in-context learning, illustrating how these results might be hard to extrapolate, especially negative ones.