RoBERTa: A Robustly Optimized BERT Pretraining Approach
Original paper ยท Liu et al 2019
A lot of papers surrounding BERT and it's pretraining objective were released in 2019. This paper is especially alluring as it is a replication study of BERT pretraining that carefully tries to measure the impact of certain hyperparameters ultimately re-establishing that BERT's masked language model training objective is competitive with other recently proposed objectives such as XLNet and XLM. Funny enough, the authors of XLNet actually punched back following this publication, introducing a version of XLNet that again outperformed RoBERTA, but more on that in a later blog post. Anyway, RoBERTa is a training recipe for BERT-like models, motivated by the fact that the authors find BERT to be significantly undertrained. Given the improved training recipe the authors find that RoBERTa can match or even exceed the performance of all post-BERT methods.
Before we dive into the details of RoBERTa, remember that this is a replication study. The authors begin by carefully controlling for different parameters in BERT's setup, evaluating their influence with the ultimate goal of reaching an improved training recipe. Let's freshen up our memory on BERT
Refresher on BERT
BERT is a pre-training approach which takes a concatenation of two segments as input. These segments consist of natural sentences, commonly multiple, and are seperated with a special [SEP] token. BERT trains on these sentences with two objectives in mind. First, BERT is tasked with predicting masked tokens in the input sentence. Tokens are randomly sampled for masking. Secondly, BERT used the two sentences to predict whether they follow each other or not in the original text. Positive and negative examples are sampled equally. As for the architecture BERT uses the ubiquitous Transformer architecture. The model is optimized with Adam and trained for 1 million updates, with mini-batches containing 256 sequences and maximum token length of 512. Finally, BERT was trained on a combination of BookCorpus plus English Wikipedia for a total of 16GB of uncompressed text.
Replication study
Now that we've covered the previous BERT setup, let's jump in to the replication study. The authors reimplement original BERT and make no evident changes to the hyperparameters or architecture. They do however drastically increase the pre-training dataset from 16GB to 160GB of uncompressed english text. This extensive increase is analogous to other studies at the time (GPT and XLNet).
The authors explore and quantify different parts of the BERT training procedure finding a couple of important modifications that either boost efficiency or performance. Switching from a static mask to a dynamic masking scheme skips the need to duplicate the training data with different masks and slightly boost downstream performance. This is particularly beneficial as the training data has increased 10 fold. Next, omitting the Next Sentence Prediction loss matches baseline performance and the authors propose instead to use a Full-Sentence input where sentences are sampled contiguously from one document until the total length is 512 tokens. They also propose the use of much larger mini-batch sizes, both for efficiency (parallelization of larger batches is much easier) and performance. Finally they move away from standard BPE encoding to byte-level BPE which was introduced in GPT-2.
RoBERTa
All of the procedural modifications noted in the last section are combined and evaluated. This configuration is named RoBERTa and it is evaluated while controlled for both data size and training passes. Results indicate that even when trained on the same dataset as BERT, RoBERTa provides significant improvement. This performance is improved further by training on the combined 160GB dataset mentioned earlier with the same number of training passes indicating the importance of a wide and diverse dataset. Finally the authors train RoBERTa for much longer and continue to see performance improvements all the way from 100k - 500k steps and note that even at 500k have yet to notice overfitting. All in all I find that the authors proved the competence of BERTs pre-training approach given small modifications in addition to once again highlighting the impact that data availability and training time has.