Language Models are Few-Shot Learners

Original paper · Brown et al 2020

I originally posted a review of this paper back in November 2022 but as I've become more accustomed to the field and learnt more of the history of the language modelling field I felt it was time to go through this paper again and provide a better summary. In addition to this, given the impact this paper has had on the field it definitely warrants another read.

This truly was a seminal paper for large language models, sprouting from the seeds of GPT-2 and the Scaling Laws paper. OpenAI kept hammering on the need to remove fine-tuning and instead moving towards meta-learning or in-context learning meaning the ability to develop multiple skills at training time and then using these abilities at inference to rapidly adapt to given tasks. They approach this problem by really pushing the scaling laws with a 175B parameter model and 300B tokens. To compare the size of this to some of the other models around this time we have BERT - 340M, Transformer-XL - 250M, XLNet - 110M, GPT-2 - 1.5B. GPT-3 really set of the new wave of LLMs of extraordinary size and capabilities that we see today.

Training, Model and Data

While this was a foundational paper for LLM's, there was quite little innovation (at least revealed to us) when it comes to architectures, training and data. Everything is very similar to GPT-2 with the most notable exception being the alternation of dense and locally banded sparse attention layers in the transformer. I'm not familiar with these architectures so maybe I'll have to cover them soon. Besides this the authors trained a group of modules across three orders of magnitude (125M to 175B) in order to validate scaling laws. The context window was increased to 2048. Training data was curated from the Common Crawl dataset mixed in with higher quality corpora such as Wikipedia. CommonCrawl unfiltered is 45TB of data which was filtered down to 570GB using a linear regression classifier and similarity measures. Finally the dataset was searched for documents that could be an overlap to test sets.

Final thoughts

The presented model is able to, without any finetuning, perform in-context learning and benchmarks of zero-shot, one-shot and few-shot are able to match or even outperform state-of-the-art fine tuned models moving the field closer to a general language model.