DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning[J]. arXiv preprint arXiv:2501.12948, 2025.
https://huggingface.co/deepseek-ai

9930+DeepSeek

DeepSeek-R1：通过强化学习激励大语言模型的推理能力

Abstract

General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) (Brown et al., 2020; OpenAI, 2023) and chain-of-thought prompting (Wei et al., 2022b), have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.

通用推理是人工智能领域一个长期存在且艰巨的挑战。最近的突破，以大语言模型和思维链提示为例，在基础推理任务上取得了显著成功。然而，这一成功高度依赖于大量的人工标注演示，并且模型在处理更复杂问题时的能力仍显不足。在此，我们展示了大语言模型的推理能力可以通过纯强化学习来激励，从而无需人工标注的推理轨迹。所提出的强化学习框架促进了高级推理模式的涌现式发展，例如自我反思、验证和动态策略适应。因此，经过训练的模型在数学、编程竞赛和STEM领域等可验证任务上取得了优越的性能，超越了通过传统监督学习在人类演示上训练的同级别模型。此外，这些大规模模型所展现出的涌现推理模式可以被系统地利用，以指导和增强较小模型的推理能力。

Introduction

Reasoning capability, the cornerstone of human intelligence, enables complex cognitive tasks ranging from mathematical problem-solving to logical deduction and programming. Recent advances in artificial intelligence have demonstrated that large language models (LLMs) can exhibit emergent behaviors, including reasoning abilities, when scaled to a sufficient size. However, achieving such capabilities in pre-training typically demands substantial computational resources. In parallel, a complementary line of research has demonstrated that large language models can be effectively augmented through chain-of-thought (CoT) prompting. This technique, which involves either providing carefully designed few-shot examples or using minimalistic prompts such as “Let’s think step by step”, enables models to produce intermediate reasoning steps, thereby substantially enhancing their performance on complex tasks. Similarly, further performance gains have been observed when models learn high-quality, multi-step reasoning trajectories during the post-training phase. Despite their effectiveness, these approaches exhibit notable limitations. Their dependence on human-annotated reasoning traces hinders scalability and introduces cognitive biases. Furthermore, by constraining models to replicate human thought processes, their performance is inherently capped by the human-provided exemplars, which prevents the exploration of superior, non-human-like reasoning pathways.

推理能力是人类智能的基石，它使得从数学问题求解到逻辑演绎和编程等复杂的认知任务成为可能。近期人工智能的进展表明，当大语言模型扩展到足够规模时，能够展现出涌现行为，包括推理能力。然而，在预训练中实现这样的能力通常需要大量的计算资源。与此同时，另一条互补的研究路线表明，通过思维链提示可以有效地增强大语言模型。这项技术通过提供精心设计的少样本示例或使用诸如“让我们一步步思考”这样极简的提示，使模型能够产生中间推理步骤，从而显著提升其在复杂任务上的表现。类似地，当模型在后训练阶段学习高质量、多步推理轨迹时，也观察到了进一步的性能提升。尽管这些方法有效，但它们存在显著的局限性。对人工标注推理轨迹的依赖阻碍了可扩展性，并引入了认知偏见。此外，通过限制模型复制人类思维过程，其性能本质上受限于人类提供的示例，这阻碍了对更优的、非人类式推理路径的探索。

To tackle these issues, we aim to explore the potential of LLMs for developing reasoning abilities through self-evolution in an RL framework, with minimal reliance on human labeling efforts. Specifically, we build upon DeepSeek-V3-Base and employ Group Relative Policy Optimization (GRPO) as our RL framework. The reward signal is solely based on the correctness of final predictions against ground-truth answers, without imposing constraints on the reasoning process itself. Notably, we bypass the conventional supervised fine-tuning (SFT) phase before RL training. This design choice stems from our hypothesis that human-defined reasoning patterns may limit model exploration, whereas unrestricted RL training can better incentivize the emergence of novel reasoning capabilities in LLMs. Through this process, detailed in Section 2, our model (referred to as DeepSeek-R1-Zero) naturally developed diverse and sophisticated reasoning behaviors. In solving reasoning problems, the model exhibits a tendency to generate longer responses, incorporating verification, reflection, and the exploration of alternative approaches within each response. Although we do not explicitly teach the model how to reason, it successfully learns improved reasoning strategies through reinforcement learning.

为了解决这些问题，我们旨在探索LLM在强化学习框架中通过自我进化发展推理能力的潜力，同时最大限度地减少对人类标注工作的依赖。具体来说，我们基于DeepSeek-V3-Base，采用群组相对策略优化作为强化学习框架。奖励信号仅基于最终预测与真实答案的正确性，而不对推理过程本身施加任何约束。值得注意的是，我们在强化学习训练之前跳过了传统的监督微调阶段。这一设计选择源于我们的假设：人类定义的推理模式可能会限制模型探索，而无约束的强化学习训练能更好地激励LLM涌现出新颖的推理能力。通过这一过程，我们的模型自然地发展出多样且复杂的推理行为。在解决推理问题时，模型倾向于生成更长的响应，并在每个响应中融入验证、反思和替代方法的探索。虽然我们没有明确教导模型如何推理，但它通过强化学习成功学会了改进的推理策略。

Although DeepSeek-R1-Zero demonstrates excellent reasoning capabilities, it faces challenges such as poor readability and language mixing, occasionally combining English and Chinese within a single chain-of-thought response. Furthermore, the rule-based RL training stage of DeepSeek-R1-Zero is narrowly focused on reasoning tasks, resulting in limited performance in broader areas such as writing and open-domain question answering. To address these challenges, we introduce DeepSeek-R1, a model trained through a multi-stage learning framework that integrates rejection sampling, reinforcement learning, and supervised fine-tuning, detailed in Section 3. This training pipeline enables DeepSeek-R1 to inherit the reasoning capabilities of its predecessor, DeepSeek-R1-Zero, while aligning model behavior with human preferences through additional non-reasoning data.

尽管DeepSeek-R1-Zero展现出卓越的推理能力，但它面临着可读性差和语言混杂等挑战，有时会在单条思维链响应中混杂中英文。此外，DeepSeek-R1-Zero基于规则的强化学习训练阶段狭隘地聚焦于推理任务，导致在写作和开放域问答等更广泛领域的性能有限。为了应对这些挑战，我们引入了DeepSeek-R1，这是一个通过多阶段学习框架训练的模型，该框架整合了拒绝采样、强化学习和监督微调。这一训练流程使得DeepSeek-R1能够继承其前身DeepSeek-R1-Zero的推理能力，同时通过额外的非推理数据使模型行为与人类偏好对齐。

To enable broader access to powerful AI at a lower energy cost, we have distilled several smaller models and made them publicly available. These distilled models exhibit strong reasoning capabilities, surpassing the performance of their original instruction-tuned counterparts. We believe that these instruction-tuned versions will also significantly contribute to the research community by providing a valuable resource for understanding the mechanisms underlying long chain-of-thought (CoT) reasoning models and for fostering the development of more powerful reasoning models.

为了让更多人能够以更低的能源成本使用强大的AI，我们蒸馏了几个较小的模型并将其公开。这些蒸馏后的模型展现出强大的推理能力，超越了其原始指令调优版本的性能。我们相信，这些指令调优版本也将为研究社区做出重要贡献，为理解长思维链推理模型的底层机制以及促进更强大推理模型的开发提供宝贵资源。

DeepSeek-R1-Zero

We begin by elaborating on the training of DeepSeek-R1-Zero, which relies exclusively on reinforcement learning without supervised fine-tuning. To facilitate large-scale RL efficiency, we adopt Group Relative Policy Optimization (GRPO).

我们首先详细阐述DeepSeek-R1-Zero的训练，该模型完全依赖强化学习，不涉及监督微调。为了实现大规模强化学习的高效性，我们采用了群组相对策略优化。

2.1. Group Relative Policy Optimization

GRPO is the reinforcement learning algorithm that we adopt to train DeepSeek-R1-Zero and DeepSeek-R1. It was originally proposed to simplify the training process and reduce the resource consumption of Proximal Policy Optimization (PPO), which is widely used in the RL stage of LLMs.

2.1. 群组相对策略优化

群组相对策略优化是我们用来训练DeepSeek-R1-Zero和DeepSeek-R1的强化学习算法。它最初是为了简化训练过程并降低LLM强化学习阶段广泛使用的近端策略优化的资源消耗而提出的。

For each question $q$ , GRPO samples a group of outputs ${o_{1}, o_{2}, \dots, o_{G}}$ from the old policy $π_{θ_{o l d}}$ and then optimizes the policy model $π_{θ}$ by maximizing the following objective:

对于每个问题 $q$ ，群组相对策略优化从旧策略 $π_{θ_{o l d}}$ 中采样一组输出 ${o_{1}, o_{2}, \dots, o_{G}}$ ，然后通过最大化以下目标来优化策略模型 $π_{θ}$ ：

\begin{matrix} (1) & \begin{array}{l} J_{G R P O} (θ) = E [q \sim P (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{old}} (O ∣ q)] \\ \frac{1}{G} \sum_{i = 1}^{G} (min (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{old}} (o_{i} ∣ q)} A_{i}, clip (\frac{π_{θ} (o_{i} ∣ q)}{π_{θ_{old}} (o_{i} ∣ q)}, 1 - ε, 1 + ε) A_{i}) - β D_{K L} (π_{θ} ∥ π_{ref})) \end{array} \end{matrix}

\begin{matrix} (2) & D_{K L} (π_{θ} ∥ π_{r e f}) = \frac{π_{r e f} (o_{i} ∣ q)}{π_{θ} (o_{i} ∣ q)} - \log \frac{π_{r e f} (o_{i} ∣ q)}{π_{θ} (o_{i} ∣ q)} - 1 \end{matrix}

where $π_{r e f}$ is a reference policy, $ε$ and $β$ are hyper-parameters, and $A_{i}$ is the advantage, computed using a group of rewards ${r_{1}, r_{2}, \dots, r_{G}}$ corresponding to the outputs within each group:

其中 $π_{r e f}$ 是参考策略， $ε$ 和 $β$ 是超参数， $A_{i}$ 是优势值，使用与每组内输出相对应的一组奖励 ${r_{1}, r_{2}, \dots, r_{G}}$ 计算：

\begin{matrix} (3) & A_{i} = \frac{r_{i} - mean ({r_{1}, r_{2}, \dots, r_{G}})}{std ({r_{1}, r_{2}, \dots, r_{G}})} \end{matrix}

We give a comparison of GRPO and PPO in Supplementary A.3. To train DeepSeek-R1-Zero, we set the learning rate to 3e-6, the KL coefficient to 0.001, and the sampling temperature to 1 for rollout. For each question, we sample 16 outputs with a maximum length of 32,768 tokens before the 8.2k step and 65,536 tokens afterward. As a result, both the performance and response length of DeepSeek-R1-Zero exhibit a significant jump at the 8.2k step, with training continuing for a total of 10,400 steps, corresponding to 1.6 training epochs. Each training step consists of 32 unique questions, resulting in a training batch size of 512. Every 400 steps, we replace the reference model with the latest policy model. To accelerate training, each rollout generates 8,192 outputs, which are randomly split into 16 mini-batches and trained for only a single inner epoch.

我们在补充材料A.3中对群组相对策略优化和近端策略优化进行了比较。为了训练DeepSeek-R1-Zero，我们将学习率设置为3e-6，KL系数为0.001，采样温度为1用于生成。对于每个问题，我们在8.2k步之前采样16个输出，最大长度为32,768个词元，之后为65,536个词元。因此，DeepSeek-R1-Zero的性能和响应长度在8.2k步处都显示出显著跃升，训练总共持续10,400步，相当于1.6个训练周期。每个训练步骤包含32个独特的问题，因此训练批次大小为512。每400步，我们用最新的策略模型替换参考模型。为了加速训练，每次生成产生8,192个输出，它们被随机分成16个迷你批次，并且仅训练一个内部周期。

Our high-performance RL infrastructure is described in Supplementary B.1, ensuring scalable and efficient training.

我们的高性能强化学习基础设施在补充材料B.1中进行了描述，确保了可扩展且高效的训练。

2.2. Reward Design

The reward is the source of the training signal, which decides the direction of RL optimization. For DeepSeek-R1-Zero, we employ rule-based rewards to deliver precise feedback for data in mathematical, coding, and logical reasoning domains. Our rule-based reward system mainly consists of two types of rewards: accuracy rewards and format rewards.

2.2. 奖励设计

奖励是训练信号的来源，它决定了强化学习优化的方向。对于DeepSeek-R1-Zero，我们采用基于规则的奖励来为数学、编程和逻辑推理领域的数据提供精确反馈。我们的基于规则的奖励系统主要由两种类型的奖励组成：准确性奖励和格式奖励。

Accuracy rewards evaluate whether the response is correct. For example, in the case of math problems with deterministic results, the model is required to provide the final answer in a specified format (e.g., within a box), enabling reliable rule-based verification of correctness. Similarly, for code competition prompts, a compiler can be utilized to evaluate the model's responses against a suite of predefined test cases, thereby generating objective feedback on correctness.

准确性奖励评估响应是否正确。例如，对于具有确定性结果的数学问题，模型需要以指定格式提供最终答案，从而实现基于规则的可靠性验证。同样地，对于编程竞赛提示，可以利用编译器根据一套预定义的测试用例评估模型的响应，从而产生关于正确性的客观反馈。

Format rewards complement the accuracy reward model by enforcing specific formatting requirements. In particular, the model is incentivized to encapsulate its reasoning process within designated tags, specifically <think> and </think>. This ensures that the model's thought process is explicitly delineated, enhancing interpretability and facilitating subsequent analysis.

格式奖励通过强制特定的格式要求来补充准确性奖励模型。特别是，模型被激励将其推理过程封装在指定的标签内，即 <think> 和 </think>。这确保了模型的思考过程被明确地划分出来，增强了可解释性并便于后续分析。

\begin{matrix} (4) & R e w a r d_{r u l e} = R e w a r d_{a c c} + R e w a r d_{f o r m a t} \end{matrix}

The accuracy, reward and format reward are combined with the same weight. Notably, we abstain from applying neural reward models—whether outcome-based or process-based—to reasoning tasks. This decision is predicated on our observation that neural reward models are susceptible to reward hacking during large-scale reinforcement learning. Moreover, retraining such models necessitates substantial computational resources and introduces additional complexity into the training pipeline, thereby complicating the overall optimization process.

准确性奖励和格式奖励以相同的权重组合。值得注意的是，我们避免在推理任务中应用神经奖励模型——无论是基于结果的还是基于过程的。这一决定基于我们的观察，即神经奖励模型在大规模强化学习过程中容易受到奖励破解的影响。此外，重新训练这样的模型需要大量的计算资源，并给训练流程引入额外的复杂性，从而使整体优化过程复杂化。

2.3. Incentivize Reasoning Capability in LLMs

Specifically, we apply the RL technique on the DeepSeek-V3 base to train DeepSeek-R1-Zero. During training, we design a straightforward template, to require DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases to ensure that we can accurately observe the model's natural progression during the RL process.

2.3. 激励LLM的推理能力

具体来说，我们将强化学习技术应用于DeepSeek-V3基础模型以训练DeepSeek-R1-Zero。在训练过程中，我们设计了一个简单的模板，要求DeepSeek-R1-Zero首先生成推理过程，然后给出最终答案。我们有意将约束限制在这种结构格式上，避免任何内容特定的偏见，以确保我们能够准确观察模型在强化学习过程中的自然进展。

Figure 1(a) depicts the performance trajectory of DeepSeek-R1-Zero on the AIME 2024 benchmark throughout the RL training process, where the average pass@1 score on AIME 2024 shows a significant increase, jumping from an initial 15.6% to 77.9%. In addition, by leveraging the self-consistency decoding, the model's performance can be further improved, achieving an accuracy of 86.7%. This performance significantly surpasses the average performance across all human competitors. Besides the math competitions, as shown in Figure 10, DeepSeek-R1-Zero also achieves remarkable performance in coding competitions and graduate-level biology, physics, and chemistry problems. These results underscore the effectiveness of RL in enhancing the reasoning capabilities of large language models.

图1(a)描述了DeepSeek-R1-Zero在整个强化学习训练过程中在AIME 2024基准上的性能轨迹，其中AIME 2024的平均pass@1分数显著增加，从初始的15.6%跃升至77.9%。此外，通过利用自一致性解码，模型的性能可以进一步提高，达到86.7%的准确率。这一表现显著超越了所有人类参赛者的平均水平。除了数学竞赛，如图10所示，DeepSeek-R1-Zero在编程竞赛和研究生水平的生物、物理和化学问题上也取得了显著的成绩。这些结果强调了强化学习在增强大语言模型推理能力方面的有效性。

The self-evolution of DeepSeek-R1-Zero exemplifies how RL can autonomously enhance a model's reasoning capabilities.

DeepSeek-R1-Zero的自我进化 exemplifies 了强化学习如何自主增强模型推理能力。

As shown in Figure 1(b), DeepSeek-R1-Zero exhibits a steady increase in thinking time throughout training, driven solely by intrinsic adaptation rather than external modifications. Leveraging long CoT, the model progressively refines its reasoning, generating hundreds to thousands of tokens to explore and improve its problem-solving strategies.

如图1(b)所示，DeepSeek-R1-Zero在整个训练过程中表现出思考时间的稳步增长，这完全由内在适应而非外部修改驱动。通过利用长思维链，模型逐步优化其推理，生成数百到数千个词元来探索和改进其问题解决策略。

The increase in thinking time fosters the autonomous development of sophisticated behaviors. Specifically, DeepSeek-R1-Zero increasingly exhibits advanced reasoning strategies such as reflective reasoning and systematic exploration of alternative solutions (see Figure 9(a) in Supplementary C.2 for details), significantly boosting its performance on verifiable tasks like math and coding. Notably, during training, DeepSeek-R1-Zero exhibits an "aha moment" (Table 2), characterized by a sudden increase in the use of the word "wait" during reflections (see Figure 9(b) in Supplementary C.2 for details). This moment marks a distinct change in reasoning patterns and clearly shows the self-evolution process of DeepSeek-R1-Zero.

思考时间的增加促进了复杂行为的自主发展。具体来说，DeepSeek-R1-Zero逐渐展现出高级推理策略，如反思性推理和对替代解决方案的系统性探索，显著提升了其在数学和编程等可验证任务上的表现。值得注意的是，在训练过程中，DeepSeek-R1-Zero展现出一种"顿悟时刻"，其特点是反思期间对"wait"一词的使用突然增加。这一时刻标志着推理模式的明显变化，并清晰地展示了DeepSeek-R1-Zero的自我进化过程。

The self-evolution of DeepSeek-R1-Zero underscores the power and beauty of RL: rather than explicitly teaching the model how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. This serves as a reminder of the potential of RL to unlock higher levels of capabilities in LLMs, paving the way for more autonomous and adaptive models in the future.

DeepSeek-R1-Zero的自我进化强调了强化学习的力量和美妙之处：不是明确地教导模型如何解决问题，而是简单地为其提供正确的激励，它便能自主发展出高级的问题解决策略。这提醒我们，强化学习有潜力解锁LLM更高级别的能力，为未来更自主和适应性更强的模型铺平道路。

DeepSeek-R1

Although DeepSeek-R1-Zero exhibits strong reasoning capabilities, it faces several issues. DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing, as DeepSeek-V3-Base is trained on multiple languages, especially English and Chinese. To address these issues, we develop DeepSeek-R1, whose pipeline is illustrated in Figure 2.

尽管DeepSeek-R1-Zero展现出强大的推理能力，但它面临着几个问题。由于DeepSeek-V3-Base是在多种语言上训练的，特别是英语和中文，DeepSeek-R1-Zero在可读性差和语言混杂等挑战上存在困难。为了解决这些问题，我们开发了DeepSeek-R1，其训练流程如图2所示。

In the initial stage, we collect thousands of cold-start data that exhibits a conversational, human-aligned thinking process. RL training is then applied to improve the model performance with the conversational thinking process and language consistency. Subsequently, we apply rejection sampling and SFT once more. This stage incorporates both reasoning and non-reasoning datasets into the SFT process, enabling the model to not only excel in reasoning tasks but also demonstrate advanced writing capabilities. To further align the model with human preferences, we implement a secondary RL stage designed to enhance the model's helpfulness and harmlessness while simultaneously refining its reasoning capabilities.

在初始阶段，我们收集了数千条冷启动数据，这些数据展现了对话式的、与人类对齐的思考过程。然后应用强化学习训练，通过对话式思考过程和语言一致性来提升模型性能。随后，我们再次应用拒绝采样和监督微调。此阶段将推理和非推理数据集都纳入监督微调过程，使模型不仅在推理任务上表现出色，还能展示高级的写作能力。为了进一步使模型与人类偏好对齐，我们实施了第二阶段的强化学习，旨在增强模型的有用性和无害性，同时完善其推理能力。

The remainder of this section details the key components of this pipeline: Section 3.1 introduces the Reward Model utilized in our RL stages, and Section 3.2 elaborates on the specific training methodologies and implementation details. Data we used in this stage is detailed in Supplementary B.3.

本节的剩余部分详细介绍了该流程的关键组成部分：第3.1节介绍了我们在强化学习阶段使用的奖励模型，第3.2节阐述了具体的训练方法和实现细节。此阶段使用的数据在补充材料B.3中有详细说明。

3.1. Model-based Rewards

For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process.

3.1. 基于模型的奖励

对于一般数据，我们采用奖励模型来捕捉复杂和细微场景中的人类偏好。我们在DeepSeek-V3流程的基础上构建，并采用了类似的偏好对分布和训练提示。对于有用性，我们仅关注最终摘要，确保评估侧重于响应对用户的实用性和相关性，同时最大限度地减少对底层推理过程的干扰。对于无害性，我们评估模型的整个响应，包括推理过程和摘要，以识别并减轻生成过程中可能出现的任何潜在风险、偏见或有害内容。

Helpful Reward Model Regarding helpful reward model training, we first generate preference pairs by prompting DeepSeek-V3 using the arena-hard prompt format, listed in Supplementary B.2, where each pair consists of a user query along with two candidate responses. For each preference pair, we query DeepSeek-V3 four times, randomly assigning the responses as either Response A or Response B to mitigate positional bias. The final preference score is determined by averaging the four independent judgments, retaining only those pairs where the score difference ( $Δ$ ) exceeds 1 to ensure meaningful distinctions. Additionally, to minimize length-related biases, we ensure that the chosen and rejected responses of the whole dataset have comparable lengths. In total, we curated 66,000 data pairs for training the reward model. The prompts used in this dataset are all non-reasoning questions and are sourced either from publicly available open-source datasets or from users who have explicitly consented to share their data for the purpose of model improvement. The architecture of our reward model is consistent with that of DeepSeek-R1, with the addition of a reward head designed to predict scalar preference scores.

有用性奖励模型 关于有用性奖励模型的训练，我们首先通过使用 arena-hard 提示格式提示DeepSeek-V3来生成偏好对，该格式列于补充材料B.2中，每个对包含一个用户查询和两个候选响应。对于每个偏好对，我们查询DeepSeek-V3四次，随机将响应分配为响应A或响应B以减轻位置偏差。最终的偏好分数通过对四个独立判断求平均来确定，仅保留那些分数差异大于1的对以确保有意义的区分。此外，为了最小化与长度相关的偏差，我们确保整个数据集中被选和拒选响应具有可比拟的长度。我们总共为训练奖励模型整理了66,000个数据对。此数据集中使用的提示都是非推理问题，来源于公开可用的开源数据集或已明确同意为模型改进目的共享其数据的用户。我们奖励模型的架构与DeepSeek-R1一致，增加了一个用于预测标量偏好分数的奖励头。

\begin{matrix} (5) & R e w a r d_{h e l p f u l} = R M_{h e l p f u l} (R e s p o n s e_{A}, R e s p o n s e_{B}) \end{matrix}

The helpful reward models were trained with a batch size of 256, a learning rate of 6e-6, and for a single epoch over the training dataset. The maximum sequence length during training is set to 8192 tokens, whereas no explicit limit is imposed during reward model inference.

有用性奖励模型在训练数据集上训练了一个周期，批次大小为256，学习率为6e-6。训练期间的最大序列长度设置为8192个词元，而在奖励模型推理期间没有施加明确的限制。

Safety Reward Model To assess and improve model safety, we curated a dataset of 106,000 prompts with model-generated responses annotated as "safe" or "unsafe" according to predefined safety guidelines. Unlike the pairwise loss employed in the helpfulness reward model, the safety reward model was trained using a point-wise methodology to distinguish between safe and unsafe responses. The training hyperparameters are the same as the helpful reward model.

安全性奖励模型 为了评估和提高模型安全性，我们整理了一个包含106,000个提示的数据集，并根据预定义的安全指南将模型生成的响应标注为"安全"或"不安全"。与有用性奖励模型中采用的成对损失不同，安全性奖励模型使用逐点方法进行训练，以区分安全和不良响应。训练超参数与有用性奖励模型相同。

\begin{matrix} (6) & R e w a r d_{s a f e t y} = R M_{s a f e t y} (R e s p o n s e) \end{matrix}

For general queries, each instance is categorized as belonging to either the safety dataset or the helpfulness dataset. The general reward, $R e w a r d_{G e n e r a l}$ , assigned to each query corresponds to the respective reward defined within the associated dataset.

对于一般查询，每个实例被归类为属于安全数据集或有用性数据集。分配给每个查询的一般奖励对应于相关数据集中定义的相应奖励。

3.2. Training Details

3.2.1. Training Details of the First RL Stage

In the first stage of RL, we set the learning rate to 3e-6, the KL coefficient to 0.001, the GRPO clip ratio $ϵ$ to 10, and the sampling temperature to 1 for rollout. For each question, we sample 16 outputs with a maximum length of 32,768. Each training step consists of 32 unique questions, resulting in a training batch size of 512 per step. Every 400 steps, we replace the reference model with the latest policy model. To accelerate training, each rollout generates 8,192 outputs, which are randomly split into 16 minibatches and trained for only a single inner epoch. However, to mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT.

3.2. 训练细节

3.2.1. 第一阶段强化学习的训练细节

在第一阶段强化学习中，我们设置学习率为3e-6，KL系数为0.001，GRPO剪切比 $ϵ$ 为10，生成时的采样温度为1。对于每个问题，我们采样16个输出，最大长度为32,768。每个训练步骤包含32个独特的问题，因此每步的训练批次大小为512。每400步，我们用最新的策略模型替换参考模型。为了加速训练，每次生成产生8,192个输出，它们被随机分成16个迷你批次，并且仅训练一个内部周期。然而，为了减轻语言混杂的问题，我们在强化学习训练期间引入了一个语言一致性奖励，该奖励计算为思维链中目标语言单词的比例。

\begin{matrix} (7) & R e w a r d_{language} = \frac{Num ({Words}_{target})}{Num (Words)} \end{matrix}

Although ablation experiments in Supplementary B.6 show that such alignment results in a slight degradation in the model's performance, this reward aligns with human preferences, making it more readable. We apply the language consistency reward to both reasoning and non-reasoning data by directly adding it to the final reward.

尽管补充材料B.6中的消融实验表明，这种对齐会导致模型性能略有下降，但此奖励与人类偏好一致，使其更具可读性。我们通过直接将语言一致性奖励添加到最终奖励中，将其应用于推理数据和非推理数据。

Note that the clip ratio plays a crucial role in training. A lower value can lead to the truncation of gradients for a significant number of tokens, thereby degrading the model's performance, while a higher value may cause instability during training.

注意，剪切比在训练中起着至关重要的作用。较低的值可能导致大量词元的梯度被截断，从而降低模型性能，而较高的值可能导致训练不稳定。

3.2.2. Training Details of the Second RL Stage

Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we follow the methodology outlined in DeepSeek-R1-Zero, which employs rule-based rewards to guide learning in mathematical, coding, and logical reasoning domains. During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. For general data, we utilize reward models to guide training. Ultimately, the integration of reward signals with diverse data distributions enables us to develop a model that not only excels in reasoning but also prioritizes helpfulness and harmlessness. Given a batch of data, the reward can be formulated as

3.2.2. 第二阶段强化学习的训练细节

具体来说，我们使用奖励信号和多样化提示分布的组合来训练模型。对于推理数据，我们遵循DeepSeek-R1-Zero中概述的方法，该方法采用基于规则的奖励来指导数学、编程和逻辑推理领域的学习。在训练过程中，我们观察到思维链常常出现语言混杂，特别是当强化学习提示涉及多种语言时。对于一般数据，我们利用奖励模型来指导训练。最终，奖励信号与多样化数据分布的整合使我们能够开发出一个不仅在推理方面表现出色，而且优先考虑有用性和无害性的模型。给定一批数据，奖励可以表示为

\begin{aligned} (8) & Reward & = {Reward}_{reasoning} + {Reward}_{general} + {Reward}_{language} \\ (9) & {where}_{,} {Reward}_{reasoning} & = {Reward}_{rule} \\ (10) & {Reward}_{general} & = {Reward}_{reward_model} + {Reward}_{format} \end{aligned}

The second stage of RL retains most of the parameters from the first stage, with the key difference being a reduced temperature of 0.7, as we find that higher temperatures in this stage lead to incoherent generation. The stage comprises a total of 1,700 training steps, during which general instruction data and preference-based rewards are incorporated exclusively in the final 400 steps. We find that more training steps with the model based preference reward signal may lead to reward hacking, which is documented in Supplementary B.5. The total training cost is listed in Supplementary B.4.4.

第二阶段强化学习保留了第一阶段的多数参数，关键区别在于将温度降低到0.7，因为我们发现此阶段较高的温度会导致生成不连贯。该阶段总共包含1,700个训练步骤，其中一般指令数据和基于偏好的奖励仅在最后400步中纳入。我们发现，使用基于模型的偏好奖励信号进行更多训练步骤可能导致奖励破解，这在补充材料B.5中有记录。总训练成本列于补充材料B.4.4中。

Ethics and Safety Statement

With the advancement in the reasoning capabilities of DeepSeek-R1, we deeply recognize the potential ethical risks. For example, R1 can be subject to jailbreak attacks, leading to the generation of dangerous content such as explosive manufacturing plans, while the enhanced reasoning capabilities enable the model to provide plans with better operational feasibility and executability. Besides, a public model is also vulnerable to further fine-tuning that could compromise inherent safety protections.

随着 DeepSeek-R1 推理能力的提升，我们深刻认识到潜在的伦理风险。例如，R1 可能遭受越狱攻击，导致生成危险内容，而增强的推理能力使模型能够提供具有更好操作可行性和可执行性的方案。此外，公开的模型也容易受到进一步的微调，这可能会削弱其固有的安全防护。

In Supplementary D.3, we present a comprehensive safety report from multiple perspectives, including performance on open-source and in-house safety evaluation benchmarks, and safety levels across multiple languages and against jailbreak attacks. These comprehensive safety analyses conclude that the inherent safety level of the DeepSeek-R1 model, compared to other state-of-the-art models, is generally at a moderate level (comparable to GPT-4o (2024-05-13)). Besides, when coupled with the risk control system, the model's safety level is elevated to a superior standard.

在补充材料 D.3 中，我们从多个角度呈现了一份全面的安全报告，包括在开源和内部安全评估基准上的表现，以及在多种语言和针对越狱攻击下的安全水平。这些全面的安全分析得出结论，与其他最先进模型相比，DeepSeek-R1 模型固有的安全水平总体上处于中等水平。此外，当与风险控制系统结合时，模型的安全水平将提升至卓越标准。

Conclusion, Limitation, and Future Work

We present DeepSeek-R1-Zero and DeepSeek-R1, which rely on large-scale RL to incentivize model reasoning behaviors. Our results demonstrate that pre-trained checkpoints inherently possess substantial potential for complex reasoning tasks. We believe that the key to unlocking this potential lies not in large-scale human annotation but in the provision of hard reasoning questions, a reliable verifier, and sufficient computational resources for reinforcement learning. Sophisticated reasoning behaviors, such as self-verification and reflection, appeared to emerge organically during the reinforcement learning process.

我们提出了 DeepSeek-R1-Zero 和 DeepSeek-R1，它们依赖大规模强化学习来激励模型的推理行为。我们的结果表明，预训练检查点本身对复杂推理任务具有巨大潜力。我们认为，释放这种潜力的关键不在于大规模的人工标注，而在于提供困难的推理问题、可靠的验证器以及足够的强化学习计算资源。复杂的推理行为，如自我验证和反思，似乎在强化学习过程中有机地涌现出来。

Even if DeepSeek-R1 achieves frontier results on reasoning benchmarks, it still faces several capability limitations, as outlined below:

尽管 DeepSeek-R1 在推理基准上取得了前沿成果，它仍然面临若干能力上的局限，如下所述：

Structure Output and Tool Use: Currently, the structural output capabilities of DeepSeek-R1 remain suboptimal compared to existing models. Moreover, DeepSeek-R1 cannot leverage tools, such as search engines and calculators, to improve the performance of output. However, as it is not hard to build an RL environment for structure output and tool use, we believe the issue will be addressed in the next version.

结构化输出与工具使用： 目前，DeepSeek-R1 的结构化输出能力与现有模型相比仍不理想。此外，DeepSeek-R1 无法利用搜索引擎和计算器等工具来提升输出性能。然而，由于为结构化输出和工具使用构建强化学习环境并不困难，我们相信这个问题将在下一个版本中得到解决。

Token efficiency: Unlike conventional test-time computation scaling approaches, such as majority voting or Monte Carlo Tree Search (MCTS), DeepSeek-R1 dynamically allocates computational resources during inference according to the complexity of the problem at hand. Specifically, it uses fewer tokens to solve simple tasks, while generating more tokens for complex tasks. Nevertheless, there remains room for further optimization in terms of token efficiency, as instances of excessive reasoning—manifested as overthinking—are still observed in response to simpler questions.

词元效率： 与传统的测试时计算扩展方法不同，DeepSeek-R1 在推理过程中根据手头问题的复杂程度动态分配计算资源。具体来说，它使用较少的词元解决简单任务，而为复杂任务生成更多的词元。尽管如此，在词元效率方面仍有进一步优化的空间，因为在回答较简单问题时，仍然可以观察到过度推理的表现。

Language Mixing: DeepSeek-R1 is currently optimized for Chinese and English, which may result in language mixing issues when handling queries in other languages. For instance, DeepSeek-R1 might use English for reasoning and responses, even if the query is in a language other than English or Chinese. We aim to address this limitation in future updates. The limitation may be related to the base checkpoint, DeepSeek-V3-Base, mainly utilizes Chinese and English, so that it can achieve better results with the two languages in reasoning.

语言混杂： DeepSeek-R1 目前针对中文和英文进行了优化，在处理其他语言的查询时可能会导致语言混杂问题。例如，即使查询使用的是非英语或非中文的语言，DeepSeek-R1 也可能使用英语进行推理和响应。我们旨在未来的更新中解决这一局限。这一局限可能与基础检查点 DeepSeek-V3-Base 主要使用中文和英文有关，因此它在推理这两种语言时能取得更好的结果。

Prompting Engineering: When evaluating DeepSeek-R1, we observe that it is sensitive to prompts. Few-shot prompting consistently degrades its performance. Therefore, we recommend users directly describe the problem and specify the output format using a zero-shot setting for optimal results.

提示工程： 在评估 DeepSeek-R1 时，我们观察到它对提示很敏感。少样本提示会持续降低其性能。因此，我们建议用户直接描述问题并使用零样本设置指定输出格式，以获得最佳结果。

Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency.

软件工程任务： 由于评估时间较长，影响了强化学习过程的效率，大规模强化学习尚未在软件工程任务中广泛应用。因此，DeepSeek-R1 在软件工程基准上并未展现出远超 DeepSeek-V3 的巨大改进。未来的版本将通过在软件工程数据上实施拒绝采样或在强化学习过程中引入异步评估来提高效率，从而解决这个问题。

Beyond specific capability limitations, the pure RL methodology itself also presents inherent challenges:

除了具体的能力局限，纯强化学习方法本身也带来了固有的挑战：

Reward Hacking: The success of pure RL depends on reliable reward signals. In this study, we ensure reward reliability through a reasoning-domain rule-based reward model (RM). However, such dependable RMs are difficult to construct for certain tasks, such as writing. If the reward signal is assigned by a model instead of predefined rules, it becomes more susceptible to exploitation as training progresses, which means the policy model may find shortcuts to hack the reward model. Consequently, for complex tasks that cannot be effectively evaluated by a reliable reward model, scaling up pure RL methods remains an open challenge.

奖励破解： 纯强化学习的成功依赖于可靠的奖励信号。在本研究中，我们通过推理领域的基于规则的奖励模型确保了奖励的可靠性。然而，对于某些任务，例如写作，很难构建这样可靠的奖励模型。如果奖励信号是由模型而非预定义规则分配的，那么随着训练的进行，它就更容易被利用，这意味着策略模型可能会找到破解奖励模型的捷径。因此，对于无法通过可靠奖励模型有效评估的复杂任务，扩展纯强化学习方法仍然是一个开放的挑战。

In this work, for tasks that cannot obtain a reliable signal, DeepSeek-R1 uses human annotation to create supervised data, and only conduct RL for hundreds of steps. We hope in the future, a robust reward model can be obtained to address such issues.

在这项工作中，对于无法获得可靠信号的任务，DeepSeek-R1 使用人工标注来创建监督数据，并且只进行几百步的强化学习。我们希望未来能够获得一个强大的奖励模型来解决此类问题。

With the advent of pure RL methods like DeepSeek-R1, the future holds immense potential for solving any task that can be effectively evaluated by a verifier, regardless of its complexity for humans. Machines equipped with such advanced RL techniques are poised to surpass human capabilities in these domains, driven by their ability to optimize performance iteratively through trial and error. However, challenges remain for tasks where constructing a reliable reward model is inherently difficult. In such cases, the lack of a robust feedback mechanism may hinder progress, suggesting that future research should focus on developing innovative approaches to define and refine reward structures for these complex, less verifiable problems.

随着像 DeepSeek-R1 这样的纯强化学习方法的出现，未来在解决任何可以通过验证器有效评估的任务方面拥有巨大潜力，无论其对人类来说多么复杂。配备这种先进强化学习技术的机器，凭借其通过试错迭代优化性能的能力，有望在这些领域超越人类能力。然而，对于构建可靠奖励模型本身就困难的任务，挑战依然存在。在这种情况下，缺乏强大的反馈机制可能会阻碍进展，这表明未来的研究应侧重于开发创新方法来定义和改进这些复杂、难以验证的问题的奖励结构。

Furthermore, leveraging tools during the reasoning process holds significant promise. Whether it's utilizing tools like compilers or search engines to retrieve or compute necessary information, or employing external tools—such as biological or chemical reagents, to validate final results in the real world, this integration of tool-augmented reasoning could dramatically enhance the scope and accuracy of machine-driven solutions.

此外，在推理过程中利用工具前景广阔。无论是利用编译器或搜索引擎等工具来检索或计算必要信息，还是使用生物或化学试剂等外部工具在现实世界中验证最终结果，这种工具增强推理的整合可以极大地提升机器驱动解决方案的范围和准确性。

🤖 Rasa

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning ​

DeepSeek-R1：通过强化学习激励大语言模型的推理能力 ​

Abstract ​

Introduction ​

DeepSeek-R1-Zero ​

2.1. Group Relative Policy Optimization ​

2.1. 群组相对策略优化 ​

2.2. Reward Design ​

2.2. 奖励设计 ​

2.3. Incentivize Reasoning Capability in LLMs ​

2.3. 激励LLM的推理能力 ​

DeepSeek-R1 ​

3.1. Model-based Rewards ​

3.1. 基于模型的奖励 ​

3.2. Training Details ​

3.2.1. Training Details of the First RL Stage ​

3.2. 训练细节 ​

3.2.1. 第一阶段强化学习的训练细节 ​

3.2.2. Training Details of the Second RL Stage ​

3.2.2. 第二阶段强化学习的训练细节 ​

Ethics and Safety Statement ​

Conclusion, Limitation, and Future Work ​