DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek
DeepSeek-R1:通过强化学习激励大语言模型的推理能力
General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) (Brown et al., 2020; OpenAI, 2023) and chain-of-thought prompting (Wei et al., 2022b), have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.
通用推理是人工智能领域一个长期存在且艰巨的挑战。最近的突破,以大语言模型和思维链提示为例,在基础推理任务上取得了显著成功。然而,这一成功高度依赖于大量的人工标注演示,并且模型在处理更复杂问题时的能力仍显不足。在此,我们展示了大语言模型的推理能力可以通过纯强化学习来激励,从而无需人工标注的推理轨迹。所提出的强化学习框架促进了高级推理模式的涌现式发展,例如自我反思、验证和动态策略适应。因此,经过训练的模型在数学、编程竞赛和STEM领域等可验证任务上取得了优越的性能,超越了通过传统监督学习在人类演示上训练的同级别模型。此外,这些大规模模型所展现出的涌现推理模式可以被系统地利用,以指导和增强较小模型的推理能力。
MiniMax-01: Scaling Foundation Models with Lightning Attention
We introduce MiniMax-01 series, including MiniMax-Text-01 and MiniMax-VL-01, which are comparable to top-tier models while offering superior capabilities in processing longer contexts. The core lies in lightning attention and its efficient scaling. To maximize computational capacity, we integrate it with Mixture of Experts (MoE), creating a model with 32 experts and 456 billion total parameters, of which 45.9 billion are activated for each token. We develop an optimized parallel strategy and highly efficient computation-communication overlap techniques for MoE and lightning attention. This approach enables us to conduct efficient training and inference on models with hundreds of billions of parameters across contexts spanning millions of tokens. The context window of MiniMax-Text-01 can reach up to 1 million tokens during training and extrapolate to 4 million tokens during inference at an affordable cost. Our vision-language model, MiniMax-VL-01 is built through continued training with 512 billion vision-language tokens. Experiments on both standard and in-house benchmarks show that our models match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering a 20-32 times longer context window.
我们推出了 MiniMax-01 系列,包括 MiniMax-Text-01 和 MiniMax-VL-01,这些模型与顶级模型性能相当,同时提供了处理更长上下文的卓越能力。其核心在于闪电注意力及其高效扩展。为了最大化计算能力,我们将其与混合专家相结合,创建了一个拥有 32 个专家、总计 4560 亿参数的模型,其中每个词元激活 459 亿参数。我们为 MoE 和闪电注意力开发了一种优化的并行策略和高效的计算-通信重叠技术。这种方法使我们能够在跨百万词元的上下文中,对数以千亿计参数的模型进行高效训练和推理。MiniMax-Text-01 的上下文窗口在训练期间可达 100 万词元,并在推理期间以可承受的成本外推至 400 万词元。我们的视觉语言模型 MiniMax-VL-01 是通过使用 5120 亿视觉语言词元进行持续训练而构建的。在标准和内部基准上的实验表明,我们的模型与 GPT-4o 和 Claude-3.5-Sonnet 等最先进模型的性能相匹配,同时提供了 20-32 倍的更长上下文窗口。
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
rStar-Math:小型语言模型可通过自我进化的深度思考掌握数学推理
We present rStar-Math to demonstrate that small language models (SLMs) can rival or even surpass the math reasoning capability of OpenAI o1, without distillation from superior models. rStar-Math achieves this by exercising “deep thinking” through Monte Carlo Tree Search (MCTS), where a math policy SLM performs test-time search guided by an SLM-based process reward model. rStar-Math introduces three innovations to tackle the challenges in training the two SLMs: (1) a novel code-augmented CoT data synthesis method, which performs extensive MCTS rollouts to generate step-by-step verified reasoning trajectories used to train the policy SLM; (2) a novel process reward model training method that avoids naïve step-level score annotation, yielding a more effective process preference model (PPM); (3) a self-evolution recipe in which the policy SLM and PPM are built from scratch and iteratively evolved to improve reasoning capabilities. Through 4 rounds of self-evolution with millions of synthesized solutions for 747k math problems, rStar-Math boosts SLMs’ math reasoning to state-of-the-art levels. On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students.
我们提出了 rStar-Math,以证明小型语言模型能够在数学推理能力上与 OpenAI o1 相媲美甚至超越,而无需从更优模型进行蒸馏。rStar-Math 通过蒙特卡洛树搜索实现“深度思考”,其中数学策略小型语言模型在基于小型语言模型的过程奖励模型指导下进行测试时搜索。rStar-Math 引入了三项创新来应对训练这两个小型语言模型的挑战:(1) 一种新颖的代码增强型思维链数据合成方法,它执行广泛的蒙特卡洛树搜索展开,以生成逐步验证的推理轨迹,用于训练策略小型语言模型;(2) 一种新颖的过程奖励模型训练方法,避免了简单的步骤级分数标注,产生了更有效的过程偏好模型;(3) 一种自我进化方案,其中策略小型语言模型和过程偏好模型从零开始构建,并迭代进化以提高推理能力。通过四轮自我进化,利用为 747k 数学问题合成的数百万个解决方案,rStar-Math 将小型语言模型的数学推理提升至最先进水平。在 MATH 基准上,它将 Qwen2.5-Math-7B 从 58.8% 提高到 90.0%,将 Phi3-mini-3.8B 从 41.4% 提高到 86.4%,分别超越 o1-preview 4.5% 和 0.9%。在美国数学奥林匹克竞赛中,rStar-Math 平均解决了 53.3% 的问题,跻身最顶尖的 20% 高中数学学生之列。
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
REINFORCE++:一种简单高效的大语言模型对齐方法
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning Large Language Models (LLMs). The dominant algorithm, Proximal Policy Optimization (PPO), employs a critic network to estimate advantages, which introduces significant computational and memory overhead. To address this, a family of critic-free algorithms (e.g., GRPO, RLOO) has emerged. However, these methods typically rely on prompt-level (local) advantage normalization, which suffers from inaccurate advantage estimation, a tendency to overfit, and, as we show, is a theoretically biased estimator. To solve these challenges, we introduce REINFORCE++, a critic-free framework centered on Global Advantage Normalization. By normalizing advantages across the entire global batch rather than small, prompt-specific groups, our method provides a more stable and theoretically sound, effectively unbiased estimate (whose bias vanishes as batch size increases). We introduce two variants: REINFORCE++, a highly efficient and general algorithm (
基于人类反馈的强化学习在对齐大语言模型中扮演着至关重要的角色。主流算法近端策略优化采用评论家网络来估计优势值,这引入了显著的计算和内存开销。为解决此问题,出现了一系列无评论家算法。然而,这些方法通常依赖于提示级优势归一化,这存在优势估计不准确、易过拟合的问题,并且如我们所示,它在理论上是一个有偏估计器。为了解决这些挑战,我们引入了 REINFORCE++,一个以全局优势归一化为核心的无评论家框架。通过在整个全局批次而非小的、特定于提示的组上对优势进行归一化,我们的方法提供了一个更稳定、理论上合理、实际上无偏的估计。我们引入了两个变体:REINFORCE++,一个用于通用领域RLHF的高效通用算法;以及 REINFORCE++
Hunyuan3D 2.0: Scaling Diffusion Models for High Resolution Textured 3D Assets Generation
腾讯
Hunyuan3D 2.0:面向高分辨率纹理化3D资产生成的扩散模型扩展
We present Hunyuan3D 2.0, an advanced large-scale 3D synthesis system for generating high-resolution textured 3D assets. This system includes two foundation components: a large-scale shape generation model – Hunyuan3D-DiT, and a large-scale texture synthesis model – Hunyuan3D-Paint. The shape generative model, built on a scalable flow-based diffusion transformer, aims to create geometry that properly aligns with a given condition image, laying a solid foundation for downstream applications. The texture synthesis model, benefiting from strong geometric and diffusion priors, produces high-resolution and vibrant texture maps for either generated or hand-crafted meshes. Furthermore, we build Hunyuan3D-Studio – a versatile, user-friendly production platform that simplifies the re-creation process of 3D assets. It allows both professional and amateur users to manipulate or even animate their meshes efficiently. We systematically evaluate our models, showing that Hunyuan3D 2.0 outperforms previous state-of-the-art models, including the open-source models and closed-source models in geometry details, condition alignment, texture quality, and etc. Hunyuan3D 2.0 is publicly released in order to fill the gaps in the open-source 3D community for large-scale foundation generative models.
我们推出了 Hunyuan3D 2.0,一个用于生成高分辨率纹理化3D资产的先进大规模3D合成系统。该系统包含两个基础组件:一个大规模形状生成模型——Hunyuan3D-DiT,以及一个大规模纹理合成模型——Hunyuan3D-Paint。形状生成模型构建于可扩展的基于流的扩散Transformer之上,旨在创建与给定条件图像正确对齐的几何体,为下游应用奠定坚实基础。纹理合成模型得益于强大的几何和扩散先验,能够为生成或手工制作的网格生成高分辨率且充满活力的纹理贴图。此外,我们还构建了 Hunyuan3D-Studio——一个多功能、用户友好的生产平台,它简化了3D资产的再创作过程,允许专业和业余用户高效地操作甚至为其网格添加动画。我们系统地评估了我们的模型,结果表明 Hunyuan3D 2.0 在几何细节、条件对齐、纹理质量等方面优于先前的先进模型。Hunyuan3D 2.0 已公开发布,旨在填补开源3D社区在大规模基础生成模型方面的空白。