ALBERT: A Lite BERT For Self-Supervised Learning of Language Representations
Original paper ยท Lan et al 2019
This paper introduces ALBERT, another BERT-inspired pre-training model, released a few months after XLNet and RoBERTa. ALBERT brings its own unique advantages to the table. As I review ALBERT, I can't help but feel that it marks the end of my exploration of BERT-like models, and it's time to move on. It has been fascinating to observe the connections and influences among the authors of these papers, each offering their distinct analysis and understanding of the strengths and challenges of these models. Having read three BERT-based papers with similar objectives, I found it intriguing to consider the writing styles employed by the authors. Among them, I need to commend the RoBERTa paper for its concise and clear writing. The use of a two-column format made it super easy to assimilate the content and I really must emphasize how well written that paper was. Anyway let's move on with ALBERT.
This pre-training method's approach is based on wanting to provide a Lite BERT model that improves on both scalability and parameter efficiency. To this end, the authors propose three important methods that empirically improve the scalability of BERT and as always these modifications are evaluated on GLUE, RACE and SQuAD.
Architectural Modifications
As in all of the previous BERT-like papers I've covered, the authors of ALBERT use a decoder only transformer network similar to BERT as their backbone. To this they introduce three distinct modifications all with the goal of improving performance per parameter.
Factorized embedding parameterization involves decoupling the embedding dimension (E) from the hidden state dimension to make more efficient usage of the total model parameters. The authors propose decomposing the embedding parameters into two smaller matrices resulting in a decrease from (O(V \times H)O(V \times E + E \times H)). This reduction is significant when (H > E).
Cross-layer parameter sharing is performed across the entire transformer module meaning that all parameters are shared across layers. This strategy draws inspiration from the Universal Transformer network published in 2018.
Inter-sentence coherence loss is the final method employed by ALBERT and once again this is a modification that addresses the NSP loss proposed in the original BERT paper. A lot of research at the time pointed to the ineffectiveness and perhaps even hindering effect that the NSP loss had on downstream performance. While the authors of ALBERT agree on NSP's ineffectiveness, they attribute it the an inherent lack of difficulty in the task itself. I find this reasoning really intriguing and it's not something I've read before but makes a lot of sense. They state that: in its original formulation, NSP conflates topic prediction and coherence prediction in a single task. This occurs because negative examples are generated from separate documents. The problem here is that topic prediction is a much easier task to learn compared to coherence prediction and it also overlaps more with what is learned using MLM loss.
As a solution the authors propose a loss based solely on coherence prediction as they hypothesis that inter-sentence modelling is still an important aspect of language modelling. To be exact, they propose the Sentence-order prediction loss (SOP) which uses positive examples constructed the same way as BERT, and as negative examples uses the same two consecutive segments but with their orders swapped. I really like this proposal as it is simple to execute and minimizes the topic inference to a maximum.
Final thoughts
The proposed modifications all seem to improve the parameter efficiency of the network as they are evaluated separately and compared to BERT controlled using both training steps and training time. The largest ALBERT model provides State of the art performance on the evaluation datasets with a lower parameter requirement. I really liked the different approaches the authors took in this paper, it was really interesting to read especially the part about inter-sentence coherence loss.