LLaMA: Open and Efficient Foundation Language Models
Original paper ยท Touvron et al 2023
I love two-column papers, I don't know what it is about them but I just really enjoy how they read. Anyway, today we're taking a look at LLaMA, a model that's gotten a lot of coverage, at least on my timeline, since its release. This is the first paper I've covered in the post-chinchilla era, so let's see how the LLM landscape has changed.
LLaMA
Meta presents 4 models as part of the LLaMA family, ranging from 7B to 65B. Right of the bat it's clear how much the Chinchilla scaling laws have inspired this work; the entire family is trained for at least 1T tokens. That is however not the only motivation here, they note that the optimal training compute budget objective completely disregards inference costs which is vital for actual use of the produced model, and they state clearly how the objective of LLaMA is to produce the best possible performance at various inference costs. This makes a lot of sense given the fact that LLaMA, in line with Meta's research philosophy, is open-sourced and available to the community, along-side being trained on completely public domain data. I love seeing this from a company that's at the forefront of AI research, open-sourcing code and models is crucial to democratizing the space we're in and making it more available to anyone who wants to evolve the field. If anyone can afford making their work public it's going to be one of the big guys, teams like OpenAI, Anthropic, Mosiac etc are most likely too pressed to actually release something completely open. Side-tangent done, let's check out how LLaMA is modeled and trained.
Training
There is really very little novelty surrounding the training and architecture of LLaMA. The dataset used is inspired by previous LLM work with a collection of CommonCrawl, ArXiv, Books3, Github, Stack Exchange and Wikipedia. The Transformer decoder uses pre-normalization of the input to every transformer block (GPT-3), SwiGLU activation functions (PaLM) and rotary embeddings (PaLM). AdamW optimizer with LR scheduling, efficient attention implementation and activation checkpointing to avoid expensive re-computations during the backward pass. This effective yet simple setup provides an excellent recipe for SOTA LLM training.
Comparison to PaLM, Chinchilla and GPT-3
LLaMA is foremost inspired by PaLM, Chinchilla and GPT-3, and is therefore compared to these during evaluation - a combination of common sense reasoning, QA, mathematical reasoning, code generation and reading comprehension. Throughout LLaMA 65B proves to be one of the best performing models, consistently outperforming models considerable larger. The models are especially competitive when compared to pre-chinchilla models such as GPT-3, with LLaMA 7B winning out on multiple benchmarks despite being 20x smaller.
Final Thoughts
There isn't too much more to touch on really, LLaMA-13B outperforms GPT-3 and LLaMA-65B seems competitive with Chinchilla-70B and PaLM-540B. Unlike previous work this paper shows that it is possible to produce state of the art foundational models using only publicly available data, without resorting to proprietary datasets. I would of liked to see more of a discussion surrounding inference costs, I feel like that was something the authors forgot especially considering how they framed their introduction. Glad that more SOTA models are being open-sourced as it really opens up for a lot of great work from the whole research community.