Word Learning as Bayesian Inference

Fei Xu, Joshua B. Tenenbaum (Psychological Review 2007)

[link]

This paper provides a bayesian explanation for how do people attribute meanings to words. In particular, it explains the phenomenon that we can learn from just positive examples, and that further examples are informative. Previous explanations of word learning do not explain that.

The key idea is simple, as in the best papers. Let’s say a person is given examples $X$ of the meaning of a word (e.g. three apples, and the word “apple”). Suppose the person is trying to decide between hypothesis in a class $H$ (e.g. each hypothesis might be the set of things in the world for which the word applies). Then,

$$p(h | X) \propto \frac{p(X | h) p(h)} \enspace .$$

$p(X | h)$ is the probability that, if those are indeed the objects that the word applies to, the examples in $X$ would be independently sampled from that set. This explains why more positive examples make us more confident: basically, if the word “apple” applied to other things (such as pears, or chairs), then it would be too much of a coincidence that three random samples of “apples” just happen to be this same kind of fruit. $p(h)$ is a prior over hypothesis, linked to what we usually use words for. They use a simple hierarchical taxonomy of objects; then, words are most useful when they are not either too general (not very informative) or too specific (not very frequent), so we start from a more middle-ground prior and callibrate from there using the examples.

They have experiments with both children and adults, and the predictions seem to hold pretty well. I think the explanation is quite compelling, and even not knowing the literature of this, it’s clear that the other competing classical explanations are insufficient. I wonder how much this result is still the go-to explanation.

Gabriel Poesia
Computer Science PhD student