Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei (arXiv 2020)

[link]

This is the paper describing the GPT-3 model, which has dominated Twitter and academic Slack channels since it has been released. A lot has been said about it from many sides, so here goes my quick takes.

A brief, incomplete and Probably Approximately Incorrect history of Machine Learning

First, where this fits in my mental history of ML. A few decades ago, we had not much data and not much compute power. Therefore, data-driven algorithms (or ``machine-learning'') needed to be computationally super efficient and also somewhat rigid, since it needed to learn from scratch from very little data. Mostly, this meant linear models: efficient, very strong inductive bias.

In the 90’s and 2000’s, we had more compute power and supervised datasets on the order of tens of thousands of data points, so models such as SVMs, ensembles of decision trees, and even the first convolutional neural networks, such as LeNet on MNIST, showed up. But we still couldn’t scale neural networks very far from a compute standpoint, and the other models, such as SVMs, weren’t really beneffiting from having too much more data than what we already had. It was common knowledge that, with enough data, the gap between linear models and the fanciest models of the time wasn’t that big.

State of the world: fancy ML is mostly useless.

In 2009, Fei-Fei Li and colleagues released ImageNet, a supervised image classification dataset with millions of annotated images. In 2012, AlexNet, the first successful deep learning-based solution submitted to the ImageNet annual challenge, caught everyone’s attention, by beating the second place by a considerable margin. Before AlexNet, none of the winning submissions submitted to the challenge used deep learning. After AlexNet, all of them did.

State of the world: with big supervised datasets, deep learning wins.

While deep learning worked well to develop architectures that could use millions of images to effectively learn, instead of plateauing at a few thousand images, like SVMs did, it started to become apparent that this paradigm was only useful when you had tons of annotated data, and that such data is very often unavailable. It’s very expensive to annotate data, most of all because you need to pay people to write the annotations, so that models can learn from them. We had ImageNet, but no other dataset on its scale.

So people realized that models trained on ImageNet weren’t only learning how to do image classification, but were learning features of images that could be generally useful for other image-related tasks as well. As a way to solve the lack of annotated data problem, the machine learning cookbook thus shifted to incorporate transfer learning. First, you get your large-scale neural network and train it on ImageNet. Then, you throw away the very last layer of the network, which learns to map the learned image feature to ImageNet classes, and plug in an untrained part that uses the features to do some other task you care about (e.g. detecting cats). Now, you have a network that has millions of parameters trained on ImageNet, that already do something useful, and some parameters that were just randomly initialized. You can now use a much smaller amount of data for your own task to fine-tune the network for just your task. It then first learned general features from ImageNet, and then learned how to specialize that general knowledge to a single task.

State of the world: with big supervised datasets to pre-train and small datasets to fine-tune, deep learning works.

This turned out to work well, but it had two crucial limitations: (1) it was still expensive to build such large datasets for things that were not images, so that the same paradigm could be applied to, say, text and audio, and (2) ImageNet has a limited size, so at some point we’d reach models that could still benefit from having even more data. In short, the cost of having annotations was again the bottleneck.

Then, in NLP, people found a way out. The Transformer architecture came about, and with that, self-supervised learning objectives started to catch on. Given a large body of text, for instance, you can train a model to predict the next word in a sequence (a.k.a language modelling). This requires no explicit human annotations, just lots of text, which are available in practically infinite (and counting) supply from the Web. Now, we could train extra large models such as BERT&Co on equally huge amounts of unstructured text, which gives us rich contextual representations for each word in the sentence. Then, these representations have been used with much success in downstream tasks.

State of the world: with small dataset to fine-tune, deep learning works: can pre-train on huge unsupervised data.

However, to apply BERT to a task, you still need data to fine-tune, which is easier than having an entire dataset to train a model from scratch, but still not trivial. This is different from humans - we do have generic representations of words and objects somewhere in our heads, but usually it doesn’t take more than a few examples for someone to pick up a simple task.

Here is where GPT-3 comes. Language modeling is AI-Complete, as we can express most things we can imagine with language. So what happens if we just grow GPT-2, OpenAI’s previous state-of-the-art language model? I mean, not just grow like by a little, not like a 1 year-old kid that grows 20cm in a year and we say it’s a lot, I’m talking about growing by like 150x? This is basically GPT-3. Technically, not much new, despite the scale and the obvious new challenges this brings. GPT-3 is so big that it’s just impossible to fine-tune it - it would be too expensive. But the idea is that you shouldn’t need to: as the title of the paper suggests, GPT-3 can pick up lots of tasks with just a small prompt, a few examples (a.k.a. few-shot learning), or sometimes just a natural language description of the task, with no examples.

Pursued state of the world: pre-train extra huge model that can learn tasks from just a few examples.

The paper tests GPT-3 in tons of classical NLP tasks: from question answering (in TriviaQA) to machine translation to natural language inference. The concept is very interesting, for sure. As for the hype, it’s clearly too much. The paper itself shows that GPT-3 rarely performs better than specialized models for most of the tasks. People playing with it on Twitter rapidly find broken examples. The impressive feat is having a single model that can do all these tasks quite reasonably. On the other hand, it was also just trained in so much more data than any other specialized model, and language is such a flexible representation, that that’s not crazy of an outcome to imagine if you had such a gigantic language model.

Also, about the productization of GPT-3: I think having a company whose secret sauce is having GPT-3 in the backstage is not a good idea. It’s just not an advantage, if everyone else can also just sign up for the API. But as OpenAI repeatedly stated, this is still very far from the end - I’m sure they’ll keep working on better models instead of capitalizing on GPT-3. It might just be a way of seeing what people want to do with it in the wild.

Gabriel Poesia
Computer Science PhD student