Language Models are Unsupervised Multitask Learners
Original paper ยท Radford et al 2018
GPT-2! I'm not sure why I originally missed this paper, it must have feel through the cracks as I was head deep in Jay Alammar's work. Anyway, as always the GPT papers are a great read so let's get into it.
History
Machine learning systems excel as narrow experts. Collecting a dataset of labeled examples, training a system to imitate these behaviors and then evaluating the system on identically distributed (IID) held-out examples is the primary way of creating these systems and this approach has led to increasingly better agents. However, these agents are brittle and prone to failure when presented with slight distribution shifts. The authors suspect that in order to create more general systems - that can perform without the need for downstream labeled training datasets - models need be trained in a multitask fashion. Datasets / Benchmarks such as GLUE and SQUAD have been proposed to begin studying this, I'll be looking at these in the future. To excel in a multitask training environment we need to be smarter, from a meta-learning perspective each (dataset, objective) pair is a single training example and current ML systems need hundreds of thousands of example to induce functions that generalize well. This is not feasible, we can't expect to create the same environment of (dataset, objective) pairs. Instead, the authors propose a model that can perform downstream tasks in a zero-shot setting - without any parameter or architecture modification.
Approach
The authors touch on something very interesting in regards to language modelling and unsupervised multitask learning. Historically task conditioning has been implemented on an architectural level such as task specific encoders / decoders. But language in itself is a flexible tool, and it can be used to specify tasks, input and output as a sequence of symbols. Language models can hence be used to learn tasks without being explicitly told what the expected output should be. As the supervised objective is a subset of the sequence, the global minimum of the unsupervised objective also becomes the global minimum of the supervised objective. The authors speculate that language models with sufficient capacity, trained on large corporas of web text, are able to learn to infer and perform NLP tasks in order to better predict NLP sequences. This train of thought is exciting, it reminds us of the knowledge, intent and emotion that can be derived from text. To think that a model, without task conditioned architectures or task specific supervised data, learns to perform sentiment analysis as a mean to better predict the next word in a sentence is amazing but I think the simplicity of it is what makes it so beautiful.
Dataset
Obviously, the approach demonstrated earlier requires a large and diverse dataset that doesn't make any assumptions on the tasks to be performed. CommonCrawl is the most expansive source of unfiltered text but it has huge quality issues. Instead the authors decided to create their own web scraper which resulted in the 40GB WebText dataset. Unfortunately this dataset was not released to the public.
Tokenization
There are a 3 main routes to choose from when it comes to tokenization. Word-based tokenization, character-based tokenization and sub-word tokenization. In the word-based setting, we end up with a large vocabulary that retains contextual information between words the best. Character tokenization saves space by giving us a small vocabulary but each character in itself has little meaning. Sub-word is the combination of these two. Byte Pair Encoding is an example of such a sub-word tokenizer. Frequent pair of bytes are replaced with new unused bytes to create a smaller set of tokens. This process is repeated until a pre-determined vocabulary size is reached.
Results
The results from the paper are impressive but the details of them don't engage me so check the paper out yourself if you're interested. Overall, GPT-2, which is the name of the largest model with 1.5B parameters, improves the performance on 7 out of 8 selected datasets in a zero-shot setting.
Memorization vs Generalization
An interesting section that discusses the risks of overlap, due to near-duplication, between test and training data. The authors created Bloom filters containing 8-grams to test this on their own data. They draw the conclusion that data overlap provides a small but consistent improvement to the reported results. This is an interesting problem that needs to be accounted for when working with increasingly large datasets especially prevalent in the language modelling field.
Conclusion
This model demonstrated the ability for large LMs to perform large amount of tasks without the need for explicit supervision. While it performs well on some tasks there are still areas where it fails to produce useable results and is at best random. However, this paper opens up for future research within the area of unsupervised task learning.