Reading Roundup
November 16, 2023.
Short format collection covering a couple of blog posts / papers I've read recently.
The Reversal Curse: A Stark Reflection on LLMs' Limitations in Reasoning
The paper on the "Reversal Curse" in Auto-regressive Large Language Models (LLMs) casts a sobering light on the overestimated generalization capabilities of these models. It's a striking revelation that LLMs, long thought to be adept at extrapolating and reasoning beyond their training data, actually demonstrate a significant shortfall in this regard. This failure to generalize is exemplified in the Reversal Curse, where LLMs, trained on statements in a specific order, fail to infer the logical reverse.
This limitation was starkly demonstrated through two types of experiments. In the first, the models were fine-tuned on descriptions of fictitious celebrities in a particular order. Surprisingly, when asked to infer the reverse logic, these models faltered, unable to apply simple reasoning to reverse the order of the information. The second experiment involved real-world knowledge, particularly the relationships between celebrities and their parents. Here again, the models showed a marked inability to reason in reverse; they could identify a celebrity’s parent but struggled to do the reverse, identifying a child from the parent’s name. This indicates a deeper, more systemic issue in the way LLMs process and generalize information, challenging the previously held belief in their expansive reasoning capabilities.
RoFormer and Rotary Positional Embedding: A New Paradigm in Position Encoding
Moving to the second paper, "RoFormer: Enhanced Transformer with Rotary Position Embedding," we encounter a groundbreaking development in position encoding within transformer models. The introduction of Rotary Positional Embedding (RoPE) represents a significant leap, unifying the absolute and relative approaches in a novel and effective manner.
RoPE ingeniously uses the concept of rotations in complex number space to encode positional information. It treats token embeddings as complex numbers, applying rotations to represent positions. This method ensures that when both the query and key in the transformer's self-attention mechanism are shifted by the same amount, the relative position remains consistent, even as the absolute position changes. This is because the rotations applied to both representations are identical, leaving the angle between them—and thus the dot product used in self-attention—unaltered. RoPE's design elegantly preserves relative positional information while disregarding absolute position, solving a longstanding challenge in transformer model design. By leveraging the nature of rotations, RoFormer, the transformer model enhanced with RoPE, not only improves upon existing models but also provides a new framework for understanding and implementing position encoding in language models.
Refining Sampling Methods in Large Language Models
I've also read a post from r/LocalLlama which offered a critical examination of the standard sampling methods in Large Language Models (LLMs).
Top P Sampling: A Popular but Flawed Method The author points out that Top P sampling, despite its popularity (used by OpenAI's API), has inherent flaws. This method selects tokens to reach a cumulative probability sum, which can inadvertently include many low-probability options. This approach may lead to models considering irrelevant choices, potentially contributing to hallucinations. The author argues that this happens especially when the model's confidence is spread thinly across several options, causing it to consider a wider array of less likely tokens.
Top K Sampling: Limitations in Linearity Top K sampling, another method discussed in the post, is more linear than Top P. It only considers a fixed number of top tokens (e.g., the top 5 tokens for Top K 5). This method can be overly restrictive, as it always limits the choices to a set number, potentially overlooking viable options that fall outside the top selections.
Introducing Min P: A Balanced Approach The blog introduces Min P, a novel sampling method created by the author to address the limitations of Top P and Top K. Min P sets a minimum threshold for tokens to be considered, based on the confidence level of the highest probability token. For instance, with a Min P of 0.1, only tokens at least 1/10th as probable as the top token are considered. This method, as per the author's experiments, not only improves the model's performance but also allows for more diverse choices than Top P, ensuring a better balance between too many and too few token considerations.
The Impact of Min P on Model Performance The author asserts that Min P particularly enhances model performance at higher temperatures, where it helps to include more diverse choices without falling into the trap of considering irrelevant low-probability options. This method is posited as a more balanced approach, allowing for a more nuanced and effective sampling that adapts to the confidence levels of the top choices.