Cramming: Training A Language Model On A Single GPU In One Day

Original paper ยท Geiping et al 2022

A lot of the NLP papers I cover on this blog take a scaling approach to improving performance, in line with recent trends in research. Interestingly, this paper takes the opposite approach and poses the question How far can we get with a single GPU in just one day?. As a point of comparison, the authors pose the original BERT paper and its later derivatives where the overarching goal is to compete with the downstream ability of BERT particularly on the GLUE benchmark.

Dataset

Instead of limiting themselves to the original dataset used in BERT the authors use a recent dump of wikipedia combined with the english bookcorpus, opening up for improvements in data curation and quality. They choose WordPiece as tokenizer with a vocabulary size of 2^15 = 32768 after finding that anything smaller resulted in worse performance while larger were not reliably better. The raw tokenized data was then sampled into random sequences of length 128.

Scaling down?

The clearest way of scaling down training is by limiting the capacity of the model. However, the authors present experiments showing that varying transformer type and size only has minimal effects on the final loss. While models with a larger capacity learn more efficiently with their MLM loss decreasing faster on a per-gradient basis, smaller architectures make up for their slow training by throughputting more data. Now, one might imagine that this poses a problem, the authors have shown that changes in transformer size and types have little to no effect on gains. But, because per-gradient efficiency remains nearly constant across models of the same size they can instead focus on optimizations that speed up gradient computation.

Architectural Optimizations

The proposed architectural changes fall into the gradient speed up category. Below, I list the most interesting modifications and describe them shortly.

Attention Block
All Query, Key and Value biases are removed exploiting the scaling laws by removing layers of computation. Multi-head self-attention is a costly operation but the authors find that they are ultimately worth the cost and keep all 12 heads. Efficient transformers have seen a lot of research work, but due to the limited sequence length the authors see no real gain in implementing mechanisms such as FLASH attention, Fourier attention, rotary embeddings etc.

Feedforward Block
The original block is largely unchanged. They observe no benefit from changing activation function to the likes of e.g GELU but just as in the attention block the removal of all linear layer biases leverages the scaling law by accelerating gradient computation without noticeable changes to the model size.

Layer Structure Recent years have shown that pre-normalization with Layer Norms is more beneficial then post. The key of these layers seems to be stabilized training, enabling of larger learning rates and reduced warmup.

Training Setup Optimizations

The original BERT setup doesn't perform well in a cramming setup. Therefor the authors propose a few changes to the training.

Optimizer They keep AdamW as their optimizer after finding no benefit to first-order adaptive optimizers or higher-order optimizers.

Dropout In the cramming setup, training data is large compared to the compute as opposed to the opposite in BERTs setup. Due to the single epoch schedule, overfitting is not possible and because dropout effectively reduces the number of gradient updates seen by each parameter this layer is dropped by the authors.

Dataset Optimizations

Four different datasets are considered each of which get their own vocabulary generated from WordPiece. To improve the data the authors first attempt deduplication, but find it not to improve overall performance. Next, they filter data which doesn't compress well in the tokenizer and find that this results in measurable improvements. Finally, empirical test show that a vocabulary size improves performance up to around 32k.

Results

The results from this cramming setup are impressive as they almost match those of the fully-pretrained BERT-base. The authors find that about two percentage points where gained in average GLUE score through architectural changes, one percentage point from data changes and half a percentage point from training modifications.