Universal Language Model Fine-tuning for Text Classification

Original paper · Howard et al 2018

Returning after a long hiatus, I'm currently undertaking my Master's Thesis at Karolinska Institute so haven't really had the time or effort to research outside of all things I'm reading daily. Anyway...I missed this paper on my initial pass through of the Language Modelling field, but I feel like it is worth a quick review given its impact on subsequent seminal models such as GPT and BERT. ULMFiT introduced a novel method of pre-training and fine tuning language models enabling robust inductive transfer learning for any NLP tasks, akin to the fine-tuning ImageNet models.

Inductive transfer learning

The authors recognize the strength and impact that transfer learning has had on the computer vision field; applied CV models are almost always fine-tuned from a pre-trained ImageNet or MS-COCO backbone. In NLP, inductive transfer learning has yet to be successful with most main task models still being trained from scratch, albeit with pre-trained embeddings as fixed parameters. The authors show that LM fine-tuning is possible and has only been hindered by the communities failure in methodology.

ULMFiT

As the name suggests, ULMFiT present a universal method for language model pretraining in the sense that it 1) works across varying document size, number and label type; 2) uses a single architecture and training process; 3) requires no feature engineering; and 4) does not require additional in-domain documents or labels. The method is described in 3 steps:

General-domain LM pretraining

The LM is trained on ImageNet-like corpus capturing general properties of language. For this paper the authors use Wikitext-103 consisting of 28,595 preprocessed Wikipedia articles. This stage is expensive, but only needs to be performed once.

Target task LM fine-tuning

The authors recognize that no general-domain corpus will be able to capture all possible target tasks. Hence, they introduce a smaller step of fine-tuning the LM on a labeled dataset of the target task. This step should converge fast.

Target task classifier fine-tuning

Finally, a classifier is introduced by augmenting the model with two linear layers and these layers are learnt from scratch.

Analysis

The authors present an analysis of each contribution and find a number of intersting points that have truly translated into later NLP work. They show the importance of general-domain LM pretraining and the gain of overall pretraining especially in situations where the available dataset is low or medium-sized. Additionally the authors show the importance of in-domain fine-tuning pointing to the fact that target tasks often lie out of distribution of the general domain corpus.

Final thoughts

It is really cool to see the first introduction of this kind of universal pretraining / fine-tuning method that has since been greatly adopted in the field. Sure some of the specific fine-tuning methods are not employed but the overarching approach is very familiar. These guys really set the standard for modern language models.