rl elicits the procedural art of reasoning

May 15, 2025.

I've been a big fan of the RL wave that has drenched the LLM space since the release of DeepSeek-R1. Seeing the level of personal experimentation that this has enabled has been fascinating to follow; there's been a lot of fun low-level nitty-gritty details that have been discussed thoroughly, and I've enjoyed following all of it. Recently, there's been some interesting - should we call it "backlash" - against the RL enjoyers, with some arguing that RL is not doing anything particularly novel and we are wasting our time with it. I posit that RL is doing exactly what it is designed to be doing: exploration and exploitation in the most basic sense, but applied to the incredibly rich, pre-existing landscape of a foundational LLM. In this post, I will present how I conceptualize these "thinking models," and what I believe RL's true contribution is. I want to discuss RL as an explorative elicitation tool, what these models are truly learning through RL, and how this perspective makes perfect sense given recently published papers.

Base (large language) models, by virtue of their massive pretraining, are already vast repositories of latent knowledge and potential reasoning steps. They "know" a lot, but they don't necessarily know how to effectively assemble that knowledge into coherent, multi-step reasoning to solve complex problems. Reinforcement Learning, particularly RL with Verifiable Rewards (RLVR), doesn't teach them new fundamental facts or entirely novel logical axioms. Instead, RL is the critical process that teaches them how to navigate their own internal knowledge space, how to structure their thought processes, and how to deploy effective problem-solving procedures. It transforms a model that can potentially reason into one that does reason, reliably and efficiently.

What these models truly learn through RL is the construction and refinement of a dynamic procedural scaffolding. This isn't about memorizing specific problem-solution pairs; it's about internalizing a flexible framework for approaching tasks. This framework includes learning when to express uncertainty, how to validate a hypothesis by generating examples, when to backtrack and correct course, and how to string together deductive steps into a coherent chain of thought. The RL from one training example paper that provides compelling evidence for this. Their finding that even a single training example can dramatically uplift a model's performance on complex mathematical benchmarks is astounding. A model isn't learning advanced algebra from one problem; rather, that single rewarded instance is enough to "ignite" or "elicit" a more general, pre-existing reasoning capability. It’s learning that a certain method of applying its knowledge is fruitful. The phenomenon they term "post-saturation generalization"—where test performance continues to climb long after training accuracy on the (single) example has maxed out—strongly indicates that the model is refining this general problem-solving scaffolding, making it more robust and broadly applicable, not just better at that one problem. The observed increase in self-reflection behaviors, like using terms such as "rethink" or "recheck", is a direct manifestation of this learned procedural refinement.

In Understanding Reasoning in Thinking Language Models via Steering Vectors the authors identify specific reasoning behaviors like "expressing uncertainty," "example testing," and "backtracking" as characteristic of these thinking models. Crucially, they demonstrate that these behaviors are not just abstract descriptions but are mediated by linear directions in the model's activation space and can be actively controlled. These identifiable and controllable behaviors are the very components, the struts and joints, of the procedural scaffolding that RL helps to build or solidify.

This perspective also neatly explains the findings from the pass@k paper everyone was talking about last month. For the uninformed, the authors critically examined whether RLVR grants LLMs reasoning abilities fundamentally beyond their base models. They found that while RLVR improves sampling efficiency for correct solutions (better pass@k for small kk), the base models often retain a broader reasoning coverage when many samples are allowed (higher pass@k for large kk). This suggests that RLVR is exceptionally good at teaching the model to reliably find and follow effective reasoning paths that were already within its potential, essentially optimizing the deployment of its existing scaffolding for problems it can already, in principle, solve. It's making the model more focused and efficient but not necessarily endowing it with entirely new primitive cognitive tools beyond what pretraining afforded. The reasoning paths generated by RLVR models, as they show, are largely already present in the base model's sampling distribution.

Therefore, the argument that RL is "doing nothing" for reasoning misses the point. Discovering and internalizing an effective methodology for problem-solving is an incredibly valuable form of learning. Teaching a model how to systematically and adaptively apply its vast declarative knowledge to arrive at correct solutions is precisely the leap from a knowledgeable machine to a reasoning one. RL, acting as this explorative elicitation tool, uses rewards not to impart new factual knowledge, but to guide the model in its search for, and reinforcement of, these effective procedural scaffolds. It's the difference between having a library full of books and knowing how to research a topic, synthesize information, and write an essay. RL, in this context, is teaching the LLM the art of intellectual inquiry and structured thought, using the raw materials already provided by its pretraining.