Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Original paper ยท Rae et al 2021

Gopher is the first language model family from Google DeepMind released about a year after GPT-3. DeepMind push the parameter limits and evaluate six transformer models ranging from 44 million to 280 billion which establishes them as one of the "top dogs" in this field. There isn't too much to talk about in this paper specifically as it mostly strikes me as DeepMinds entry into this field and sort of their opportunity to test the waters, see what they can learn from producing such a model etc. One important thing to note is that even though the paper was published more than a year after GPT-3 the models were actually trained just a couple of months after GPT-3.

Models, training and data

So there seems to be very little that separates the methodologies behind this paper and GPT as its constantly referenced in architectural descriptions and so on. There seem to be two key modifications: the use of RMSNorm instead of LayerNorm and relative positional encodings (established in Transformer-XL) for longer contextual understanding. I think this is the first time I'm seeing RMSNorm in a transformer model. RMSNorm is a simplification of LayerNorm which normalizes input across the features but does away with the mean-centering operation as it comes with computational overhead. Anyway, Gopher uses SentencePiece (an open-sourced unsupervised text tokenizer developed by Google) with a vocabulary of 32,000 (significantly smaller than GPT) and is trained for 300B tokens with a 2048 context window. DeepMind also uses the warm-up learning rate strategy we've seen before combined with an increasing batch size. Finally, the models are trained on MassiveText, a collection of English text documents coming from the web, books, news articles and code totaling 10.5TB of text. The authors sub-sample from MassiveText using about 12.8% of the total tokens.

Results

DeepMind does a massive evaluation across 124 tasks providing an immense amount of data and results which is probably used extensively when looking forward for future models. Gopher proves very capable and outperforms SOTA LMs across 100 of these tasks where the baselines include GPT-3, Jurassic-1 and Megatron-Turing NLG. These are great results which continue to indicate the power of the scaling laws as performance just keeps getting better and better. Some interesting ablation studies are performed in the appendices and look at

  • Adam vs Adafactor - finding that Adam was much more stable
  • Low precision training - finding that using float32 in the optimizer saves substantial memory while only mitigating the performance loss
  • Scaling context length - showing that performance improves with context length roughly proportionate to the square root of the context length