Gabriel Poesia

Curriculum learning (@ ICML 2009)

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, Jason Weston

Link

This is the seminal paper on curriculum learning in modern machine learning. It provides an interesting way to formally define a curriculum as a family of weighing functions on the training distribution whose entropy increases, and that converges to being equivalent to the target training distribution. They observe empirically that using a curriculum helps in training models that generalize better in a variety of domains: SVMs in a toy classification domain with noisy overlapping gaussians that shows that a curriculum can help in giving higher density signal when there is noise in the target distributions, language modeling (done with a non-recurrent neural net and a contrastive loss - back then that seemed to scale better!), and an image classification task with a perceptron.

It's interesting that this definition of a curriculum does not necessarily match our intuition from human teaching. Human curricula are certainly not monotonic in entropy: one can have periods of focus on something very specific before going on to the next topic, or to a broader picture that might involve more topics. But the "catastrophic forgetting" that neural models show when trained with SGD makes it a bad idea to reduce the support of training examples during training.

It's also interesting to understand curricula as a possible form of continuation, as they cite from other works in optimization where it is provably the case that a curriculum can help in non-convex optimization by smoothing out the loss landscape.