Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language

June 10, 2025.

This paper follows the trajectory of recent studies questioning the nature of reinforcement learning (RL) in LLMs: does RL unlock genuinely new capabilities, or does it merely surface behaviors already embedded during pretraining? A growing body of literature, such as Dang et al. (2025) and the pass@k analyses from Echo Chamber and SimpleRL, suggests the latter. This paper re-engages with that core question:

Does reinforcement learning truly expand reasoning capabilities in language models, or does it merely improve the sampling efficiency of pre-existing solutions?

Prior examples like the pass@k paper and others involving intrinsic or even misaligned reward signals have fueled the argument that RL may just reweight the output distribution. In contrast, this paper argues that such conclusions arise from methodological limitations:

Overreliance on domains like mathematics, where models are often saturated through pretraining and fine-tuning.
Premature termination of RL training—often capped at a few hundred steps—before reasoning exploration can mature.

ProRL: Prolonged Reinforcement Learning

The authors introduce ProRL, a training methodology that extends Group Relative Policy Optimization (GRPO) over long time horizons. The key contribution is prolonged training, accompanied by mitigation strategies for entropy collapse and training instability.

Entropy Collapse

Prolonged training amplifies a known issue: entropy collapse. As with prior studies (e.g., The Entropy Mechanism of Reinforcement Learning for Reasoning LMs), low-entropy output distributions harm exploration. GRPO in particular depends on diverse rollouts for robust advantage estimation. If entropy collapses early, the policy update becomes biased and training stagnates.

Mitigation Techniques

Rollout Temperature The model uses a high rollout temperature (1.2) to encourage early exploration, though this merely delays entropy collapse, rather than resolving it.

DAPO Components The method borrows two ideas from the DAPO framework:

Decoupled Clipping: Separate $\epsilon_{\text{low}}$ and $\epsilon_{\text{high}}$ values (0.2, 0.4) are used to differentially bound the PPO objective.
Dynamic Sampling: Prompts with high/low accuracy are filtered out, as they provide little learning signal. This focuses learning on examples in the intermediate difficulty band.

KL Penalty and Reference Policy Reset While prior work advocates discarding the KL term, ProRL keeps it—though not blindly. To prevent the KL loss from dominating, they periodically reset the reference model to the current online policy, thereby refreshing the anchor without losing the stabilizing effect of KL regularization.

Experimental Setup

Training is conducted on a curated 136K-task dataset across five domains: math, code, STEM, logic puzzles, and instruction following. The base model is DeepSeek-R1-Distill-Qwen-1.5B, already capable of generating long Chain-of-Thought (CoT) traces. Training uses verl, a variant of GRPO, with a batch size of 256, temperature 1.2, and 16 rollouts per prompt. It runs across four H100 nodes (~16K GPU hours over ~20 days).

According to Figure 1 (Page 2), the model's pass@1 and pass@16 scores increase consistently across training steps. The entropy is tracked throughout (Figure 2, Page 5), showing that the model avoids collapse across eight training runs, aided by resets and reward shaping.

Does ProRL Elicit New Reasoning Patterns?

Yes. The model's pass@128 score increases on tasks where the base model fails completely, implying that ProRL leads to genuine reasoning expansion. These gains are not uniformly distributed, however. The authors categorize benchmarks into three regimes:

1. Diminished Reasoning Boundary

Observed in some math tasks. Here, although pass@1 improves, pass@128 declines, suggesting the model narrows its output distribution around known solutions. These tasks typically had high initial pass@128 scores—i.e., the base model already did well.

2. Gains Plateau with RL

For many tasks, RL yields early gains in pass@1 and pass@128, but further training offers diminishing returns. The model appears to saturate its learning capacity for these domains early on.

3. Sustained Gains from ProRL

On more complex tasks, especially coding and instruction following, the model continues to improve throughout training. These tasks benefit from long-horizon exploration, and the model generalizes more effectively to test prompts. Figures 4 and 12 (Pages 7 and 22) showcase continued gains across increasing k.

Out-of-Distribution Generalization

A key strength of ProRL is its ability to generalize:

BoxNet: A novel task not included in training. The base model fails entirely; the ProRL-trained model achieves near-perfect performance (Figure 5, Page 8).
Graph Coloring: Tested on graph sizes up to 20 nodes (training used 10). The ProRL model maintains higher pass@1 and pass@128 across increasing graph size (Figure 6, Page 8).

Distributional Shifts

Unlike previous work (e.g., Dang et al.), which observed performance degradation due to reduced diversity, ProRL induces rightward shifts in pass@1 distributions across multiple domains (Figures 14–17). The model doesn't just concentrate on "winning answers"—it generates a wider range of correct outputs across broader input space.

Final Takeaway

ProRL demonstrates that reinforcement learning, when applied with sufficient training duration and stability techniques, can expand the reasoning boundary of language models in meaningful, quantifiable ways. This paper counters the prevailing pessimism around RL’s effectiveness by showing that its limitations are often procedural, not fundamental.

Critically, the methodology—KL-controlled GRPO, reference resets, high-temperature rollouts, and diverse task domains—serves as a blueprint for future long-horizon reasoning-centric RL work. While ProRL is compute-intensive, it sets a strong precedent that extended RL can deliver qualitatively new reasoning strategies, even in small models.