LoRA: Low-Rank Adaptation of Large Language Models

Original paper · Hu et al 2021

Models continue to grow, the interest for task-specific use-cases grows with them. A lot of people are looking to train models on their own data. Unfortunately, models are quite incompetent until we get into the billion parameter range. For a user looking to use the latest foundational model for their downstream task, this means re-training billions of parameters every few months (when a new better model is released). One can quickly note how infeasible this becomes. Additionally, if I want to use models for several downstream tasks I need to re-train multiple instances of the foundational model...

Several attempts have been made to mitigate this issue by adapting only some parameters or learning external modules for new tasks. This way, we only need to store and load a small number of task-specific parameters in addition to the pre-trained model for each task, boosting operational efficiency when deployed. This is similar to pre-trained backbones in Computer Vision, such as ResNet. However, these existing techniques often introduce inference latency, a critical aspect of deployed LLMs.

The authors propose Low-Rank Adaption (LoRA), a fine-tuning approach that is more efficient than existing techniques, merges completely with frozen pre-trained weights inducing zero inference latency and makes task-switching overhead significantly lower. LoRA builds on the hypothesis that adjustments needed to adapt (think fine-tune) a large language model for specific tasks can be effectively represented by a small number of parameters. LoRA tunes the dense layers of a pre-trained network indirectly, by optimizing rank decomposition matrices of the dense layers' change during adaption. Optimizing rank decomposition matrices involves finding this compact, low-rank representation of the changes. This is done by decomposing the weight adjustments into smaller matrices that, when combined, approximate the necessary changes in the model's large weight matrix.

LoRA possesses several key advantages:

A pre-trained model can be shared and used to build many small LoRA modules for different tasks. The modules themselves can be shared and switched on/off frozen weights with very little overhead.
More efficient fine-tuning, lowering the hardware barrier to entry by up to 3 times when using adaptive optimizers since most gradient and optimizers states are discarded.
A simple linear design enables merging of LoRA modules and frozen weights, introducing zero inference latency.

Low rank update

For a pre-trained network, in this case of Transformer type, $W$ or $W_0$ refers to the pre-trained weight matrix and $\triangle W$ its accumulated gradient update during adaptation. As such, a forward pass through a fine-tuned network is formalized as

$h = W_0x + \triangle Wx$

LoRA constrains the gradient update by representing the latter with a low-rank decomposition

$h = W_0x + BAx$

where $B$ and $A$ have significantly lower rank than $W$ . During training, $W_0$ is frozen, while $A$ and $B$ contain trainable parameters. $A$ is initialized from a random Gaussian distribution and $B$ is zero.

When deployed during production, you can explicitly compute and store $W = W_0 + BA$ and perform inference as usual. What's especially neat is that $W_0$ and $BA$ have the same dimensions, if you need to switch to another downstream task you simply recover $W_0$ by subtracting $BA$ and adding a different $B'A'$ .

LoRA applied to Transformers

In principle, LoRA can be applied to any subset of weight matrices in a neural network to reduce the number of trainable parameters. To concretize its use-case, the authors present LoRA implementation for the Transformer architecture. The study is limited to only adapting the attention weights for downstream tasks while completely freezing the MLP modules. When evaluated on a GPT-3 175B setting, VRAM consumption during training is reduced from 1.2TB to 350GB with $r = 4$ . It is important to remember that $r$ needs to be significantly smaller than $d_{model}$ for benefits to be reaped. Additionally, the authors are able to store fine-tuning checkpoints using 10,000x less memory (from 350GB to 35MB). This all thanks to the fact that LoRA modules can be saved separate from the pre-trained weights. Finally, switching between tasks while deployed comes at a much lower costs as we're only swapping the LoRA weights (35 MB) as opposed to all parameters.

Targeting Weight Matrices for Maximum Impact

A crucial element of LoRA's strategy is determining which weight matrices in a pre-trained Transformer are most beneficial to adapt. This decision is vital for optimizing performance on downstream tasks. By focusing on the most impactful parameters within a limited parameter budget, LoRA ensures efficient fine-tuning without overwhelming computational demands. At the heart of LoRA's efficiency is its use of a rank-deficient adaptation matrix, ∆W. Investigating the optimal rank of ∆W is essential to balance parameter reduction and learning capability. This rank-deficiency enables LoRA to maintain model performance while significantly cutting down the number of parameters required for adaptation.

Understanding the ∆W and W Relationship

A fascinating aspect is the analysis of the relationship between the adaptation matrix (∆W) and the original weight matrix (W). This exploration into how closely ∆W correlates with W, and its relative size compared to W, is crucial. It sheds light on how LoRA's updates leverage and complement the pre-existing structure of the models, explaining why LoRA can adapt large models effectively with minimal retraining.

Striking a Balance Between Efficiency and Interpretability

The technical explorations around LoRA highlight a balance between computational efficiency and model interpretability. Its low-rank structure not only makes adaptation computationally feasible but also enhances our understanding of the interaction between new updates and pre-trained models. This blend of efficiency and clarity in understanding model adaptations is a significant advancement in the field.

In essence, LoRA's approach to adapting large language models, as unveiled through these studies, showcases a path to harnessing the power of advanced AI models in a resource-efficient and transparent manner.