rlvr entropy collapse
June 10, 2025.
there's more to talk about in the rlvr space today; the wave of papers hasn't slowed down, although we're getting fewer hill-climbs on math now, so i think we're moving in a solid direction. i'm certainly an rl devotee. i think scaling rl—and particularly long-horizon rl—is going to accelerate automation big time. the open research community probably doesn't have the compute to scale rl to the moon, but there is a lot of work that needs to be done on long-horizon rl, and especially multi-turn rl, which has yet to see the attention it deserves. if you're interested in material related to where the rl space is heading and why it's important, i recommend: (1) a blog post from nathan lambert, what comes next with reinforcement learning, (2) a podcast episode with sholto douglas and trenton bricken from anthropic, and (3) a blog post from semianalysis, scaling reinforcement learning: environments, reward hacking, agents, scaling data.
but this isn't what i want to talk about today—hence the links away from this post. today is more small-picture, more technical. it builds on, and perhaps even refines, a line of thought that i started in a previous post, rl elicits the procedural art of reasoning, where i argued that hill-climbing rlvr papers exemplified models learning a scaffolding for how to approach general reasoning problems. this is why models were able to improve test time performance through only a single training example. since that writeup, new information has come to light, and things really got out of hand with the paper spurious rewards: rethinking training signals in rlvr, which showed that models improved from rewarding incorrect responses and even completely random (!!) reward assignments. this is where my mental model kind of broke and things got really confusing. as it turns out, the qwen model family—particularly qwen-math—has some intrinsic reasoning behavior learned during pretraining that, when combined with grpo, is amplified even when the reward signal carries no information. this paper was probably the first to put a small damper on recent open rlvr work, and i wanted to give it the recognition it deserves because it's good work. there have been way too many papers over-indexed on only using the qwen model family; we need to validate work on multiple models, especially when the models are not truly open source (i.e., their data).
anyway, that was a tangent. what i really wanted to get to was the discussion of entropy and exploration. i've covered three papers in recent literature reviews that relate to this discussion and for which i want to provide a unified picture. in the entropy mechanism of reinforcement learning for reasoning language models (referred to as p1 henceforth), the authors provide evidence for an entropy collapse phenomenon which, according to the paper, occurs during rl when you don't explicitly formulate entropy or kl regularization. this entropy collapse leads to a predictable upper bound on rl performance for a given task. for a softmax policy like an llm's, entropy is explicitly tied to exploration; as entropy decreases, exploration diminishes, and so does the model's ability to find novel solutions. similar behavior, or perhaps we should say limitations, were noted in does reinforcement learning really incentivize reasoning capacity in llms beyond the base model?, whose authors argued that rlvr traded sampling efficiency for reasoning boundary (increased pass@1 for decreased pass@256). this is a direct consequence of the entropy-reward exchange.
the experiments in p1 are great, and they provide a strong case for why entropy is a crucial metric to track during rlvr, but i find the suggested solutions unnecessary given that we already have proven methods to control entropy collapse. in the dapo paper, the authors noted grpo's tendency for entropy collapse (figure 2b) and proposed the clip-higher technique as a solution. this line of thought is solidified in prorl: prolonged reinforcement learning expands reasoning boundaries in large language models. to quote the authors:
a key challenge in prolonged policy optimization is entropy collapse, a phenomenon where the model’s output distribution becomes overly peaked early in training, resulting in sharply reduced entropy. when entropy collapses, the policy prematurely commits to a narrow set of outputs, severely limiting exploration. this is particularly detrimental in methods like grpo, where the learning signal depends on having a diverse set of sampled outputs to effectively estimate relative advantages.
they address this through: (1) dapo, and (2) kl regularization with a reference policy reset. ultimately, their training scheme provides consistent improvement in pass@1 and pass@16 throughout training, providing compelling evidence that extended, stable rl training develops novel reasoning patterns beyond a base model's initial capabilities.
finally, to really hammer this point home, the qwen team published beyond the 80/20 rule: high-entropy minority tokens drive effective reinforcement learning for llm reasoning last week, which i haven't had time to cover yet. the gist is that during cot reasoning, only a small fraction of tokens exhibit high entropy, and these tokens act as critical "forks" that steer the model toward diverse reasoning pathways. crucially, slight increases in the entropy of these fork tokens improve performance. additionally, restricting policy gradient updates to these forking tokens (about 20% of the total) improves rlvr performance and generalization. this work also uses dapo.
all in all, my takeaway is to track your entropy. entropy collapse is a strong signal that performance will stagnate. you should at least track the average entropy over the generation; even better, track the entropy distribution across tokens. http://googleusercontent.com/youtube_content/0