DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

February 14, 2025.

Supervised fine-tuning (SFT) has long been the go-to method for refining large-scale language models. While effective, it suffers from well-known bottlenecks: high dependency on curated annotations, scalability constraints, and an inherent ceiling on generalization. DeepSeek-R1-Zero turns this paradigm on its head, proving that reinforcement learning (RL), when applied correctly, can independently drive substantial improvements in reasoning and logical inference.

This article dissects DeepSeek-R1-Zero—an approach that ditches SFT entirely in favor of pure reinforcement learning—and DeepSeek-R1, which integrates RL into a more conventional pipeline with an initial fine-tuning phase. The goal? To assess RL’s viability as a standalone reasoning mechanism and explore how it synergizes with traditional alignment techniques.

DeepSeek-R1-Zero: Pure Reinforcement Learning

Historically, reinforcement learning in large language models has been seen as a fine-tuning mechanism—useful for behavioral alignment but lacking the power to drive fundamental improvements in reasoning. DeepSeek-R1-Zero challenges that assumption. Instead of relying on neural reward models, it leans on rule-based, outcome-driven rewards to guide training.

The reinforcement learning framework is built on Group Relative Policy Optimization (GRPO), as introduced in the DeepSeek Math paper. It adopts an outcome-based supervision strategy: rather than providing dense, stepwise supervision, rewards are assigned based on final outputs and normalized across all contributing tokens. This eliminates the need for manually labeled intermediate reasoning steps. Importantly, R1-Zero forgoes supervised data entirely, relying instead on deterministic, rule-based reward functions:

Accuracy rewards – Applied in domains where correctness is objectively verifiable (e.g., mathematics, competitive programming). Outputs must adhere to predefined formats, allowing automated validation via compilers or symbolic verification.
Format rewards – Encouraging structured reasoning through explicit token markers (e.g., <think> and </think> tags) to delineate reasoning steps from final answers.

The significance of this shift cannot be overstated. Unlike prior approaches, which incorporated reasoning supervision explicitly, R1-Zero's reward is tied purely to outcome correctness. This suggests that at scale, reinforcement learning alone can induce structured reasoning capabilities, even when the intermediate reasoning process itself is not explicitly rewarded. This contradicts prior assumptions that correct reasoning must be guided explicitly during training.

A lot of the RL details are most likely interchangeable, the key is most likely in the dataset containing many many verifiable prompts with answers. In order for the open-source community to move forward in this domain we need open versions of such datasets.

This reward scheme is only feasible in domains with deterministic outcomes. Mathematical problem-solving, programming challenges, and other structured tasks with formal verification mechanisms fit this paradigm. However, extending this methodology to open-ended tasks such as creative writing remains an open question. Without a verifiable notion of correctness, defining effective reward functions for broader reasoning capabilities remains nontrivial.

Beyond outcome-driven reinforcement, structured outputs are reinforced through format rewards. The model adheres to a strict prompt format:

A conversation between User and Assistant. The User asks a question, and the Assistant solves it. The Assistant first thinks through the reasoning process and then provides the answer. The reasoning process and answer are enclosed within <think> and </think> tags, respectively.

Empirical Performance of R1-Zero

A key insight from the DeepSeek Math paper was that traditional RL tuning in LLMs generally improves answer confidence (Maj@K) but does not enhance fundamental reasoning capabilities (Pass@K). This was not surprising, RL has been used to enforce human-aligned behavior. But the vision has long been for RL to provide training time improvement, replicating what was demonstrated by o1 back in fall. This leads us to what in my mind is the figure of this paper.

This is the first time we've seen training time improvements through pure RL, and it is a foundational result we've been waiting for. It confirms that RL, when structured correctly, is sufficient to induce robust reasoning behaviors without the need for supervised training data.

Another notable outcome is the model’s increasing response time throughout training. As shown below, R1-Zero exhibits progressively longer reasoning sequences as training progresses:

This suggests that the model is leveraging extended inference time to refine its thought process—an emergent behavior rather than an explicitly programmed heuristic. The ability to iteratively refine responses and reevaluate previous steps emerges naturally as a function of the RL training paradigm. This recalls early-stage pretraining scaling laws but with RL as the principal driver of reasoning capability development. This becomes super interesting in the whole o1 debate, because it may indicate that there actually was never any Monte Carlo style inference search inside o1, it was simply a matter of increased response length that resulted in more test time search -> better results.

One particularly intriguing aspect is the spontaneous emergence of complex problem-solving behaviors. With extended test-time computation, R1-Zero exhibits reflection—revisiting prior steps—and alternative solution exploration, neither of which were explicitly trained for. This underscores the potential of reinforcement learning as a mechanism for self-evolving reasoning models.

Given these findings, an exciting direction for future work would be integrating this rule-based RL framework with MuZero-style approaches, allowing for implicit planning within the RL optimization process. Such an approach could refine RL-driven reasoning without explicit correctness signals, potentially broadening its applicability to less structured domains.

DeepSeek-R1: Reinforcement Learning with Cold Start

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? DeepSeek-R1 builds on R1-Zero by introducing an initial supervised fine-tuning phase to mitigate instability in early-stage RL training. The pipeline follows a structured, multi-stage approach

Cold start. Unlike DeepSeek-R1-Zero, to mitigate the instability associated with early-stage RL training from a randomly initialized base model, DeepSeek-R1 introduces a supervised fine-tuning (SFT) phase using a small set of high-quality long-form Chain-of-Thought (CoT) examples. This allows the model to establish an initial reasoning framework before reinforcement learning is applied.

Reasoning RL. Following the cold-start phase, the model undergoes large-scale reinforcement learning with the same rule-based reward methodology used in DeepSeek-R1-Zero. The focus here is on further refining the model’s reasoning capabilities, particularly in domains requiring structured problem-solving such as coding, mathematics, science, and logical reasoning.

Rejection sampling & SFT. Once RL-based reasoning training converges, the model is used to generate a dataset for a secondary supervised fine-tuning (SFT) phase. Unlike the initial cold-start dataset, which is narrowly focused on reasoning tasks, this dataset broadens coverage to include domains such as writing, role-playing, and general-purpose text generation. This stage results in a dataset of approximately 800k samples.

RL(HF). To align the model with human preferences while retaining its reasoning strength, a final reinforcement learning phase is applied. This stage integrates a mixture of rule-based rewards (for structured domains like math and coding) and reward models trained on human feedback (for tasks requiring subjective assessment, such as creative writing or open-ended dialogue). By doing so, DeepSeek-R1 balances strong logical reasoning with a more user-friendly and instruction-following behavior.

DeepSeek-R1 demonstrates that while RL alone can drive reasoning improvements (as seen in R1-Zero), incorporating high-quality supervision in a structured pipeline allows for both accelerated convergence and broader applicability across general tasks.

Distillation

One of the most intriguing aspects of this study is its exploration of distillation as a means of transferring reasoning capabilities to smaller, more efficient models. To extend DeepSeek-R1’s capabilities to lightweight models, the authors fine-tuned open-source models such as Qwen and Llama using the 800k samples curated in the SFT stage. The results indicate that even this straightforward distillation approach leads to significant improvements in the reasoning abilities of smaller models.

The base models used for this evaluation include Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5-14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. For the distilled versions, only SFT was applied—no reinforcement learning. While incorporating RL could further enhance performance, the primary objective was to isolate and evaluate the effectiveness of pure distillation.

Distilling DeepSeek-R1’s outputs directly resulted in models such as DeepSeekR1-7B (DeepSeek-R1-Distill-Qwen-7B), which outperformed non-reasoning models like GPT-4o-0513 across multiple benchmarks. Additionally, DeepSeek-R1-14B surpassed QwQ-32BPreview on all evaluation metrics, demonstrating the efficiency of this transfer process.

A fundamental question remains: can the performance achieved through distillation be replicated by direct RL training on smaller models? Two key conclusions emerge. First, distilling reasoning-optimized models into smaller architectures consistently yields strong results, while training smaller models directly with large-scale RL requires extensive computational resources and may not even reach the same level of performance. Second, while distillation is a practical and effective strategy, achieving breakthroughs beyond existing intelligence thresholds will likely require both larger base models and more advanced reinforcement learning techniques.

Failures on the path to R1

Process Reward Model (PRM) have been proposed as a mechanism to guide models toward better reasoning strategies (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023). However, the practical implementation of PRMs presents several fundamental challenges:

Defining intermediate reasoning steps – Precisely delineating fine-grained reasoning steps is inherently difficult, as general reasoning does not always follow a discrete, well-defined progression.
Evaluating intermediate steps – Determining correctness at each intermediate step is nontrivial. Automated model-based annotation is often unreliable, while manual annotation is infeasible at scale.
Reward hacking risks – Introducing a model-based PRM inevitably leads to optimization exploits (Gao et al., 2022). Additionally, retraining the reward model incurs extra computational costs and increases pipeline complexity.

While PRMs can effectively rerank top-N responses or assist in guided search (Snell et al., 2024), their computational overhead limits their viability in large-scale RL settings. Our experiments suggest that while PRMs have merit in structured applications, they fail to provide consistent advantages over simpler rule-based reward mechanisms in reinforcement-driven reasoning tasks.

Monte Carlo Tree Search (MCTS). Inspired by AlphaGo (Silver et al., 2017b) and AlphaZero (Silver et al., 2017a), Monte Carlo Tree Search (MCTS) was explored as a means of enhancing test-time compute efficiency. This involved segmenting solutions into smaller components, allowing the model to explore the search space systematically. The process consisted of two main phases:

Prompting the model to generate structured reasoning steps, which were then mapped to specific search branches.
Using collected prompts to guide MCTS exploration via a pre-trained value model, iteratively refining the actor and value models.

However, scaling this approach introduced several key challenges:

Exponential search space expansion – Unlike board games, where the search space is constrained, token generation expands exponentially. Imposing a maximum extension limit on search nodes helped mitigate this but led to local optima issues.
Value model dependency – The quality of MCTS outputs depended heavily on the value model’s accuracy. Training a sufficiently granular value model proved difficult, leading to inconsistencies in search effectiveness.
Generalization difficulties – While MCTS is highly effective in environments with well-defined rules, applying it to open-ended reasoning tasks posed fundamental challenges due to the fluid nature of language generation.