This is the seminal paper on curriculum learning in modern machine learning. It provides an interesting way to formally define a curriculum as a family of weighing functions on the training distribution whose entropy increases, and that converges to being equivalent to the target training distribution. They observe empirically that using a curriculum helps in training models that generalize better in a variety of domains: SVMs in a toy classification domain with noisy overlapping gaussians that shows that a curriculum can help in giving higher density signal when there is noise in the target distributions, language modeling (done with a non-recurrent neural net and a contrastive loss - back then that seemed to scale better!), and an image classification task with a perceptron.
It's interesting that this definition of a curriculum does not necessarily match our intuition from human teaching. Human curricula are certainly not monotonic in entropy: one can have periods of focus on something very specific before going on to the next topic, or to a broader picture that might involve more topics. But the "catastrophic forgetting" that neural models show when trained with SGD makes it a bad idea to reduce the support of training examples during training.
It's also interesting to understand curricula as a possible form of continuation, as they cite from other works in optimization where it is provably the case that a curriculum can help in non-convex optimization by smoothing out the loss landscape.