Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
反思、重试、奖励:通过强化学习实现大语言模型的自我改进
We explore a method for improving the performance of large language models through self-reflection and reinforcement learning. By incentivizing the model to generate better self-reflections when it answers incorrectly, we demonstrate that a model's ability to solve complex, verifiable tasks can be enhanced even when generating synthetic data is infeasible and only binary feedback is available. Our framework operates in two stages: first, upon failing a given task, the model generates a self-reflective commentary analyzing its previous attempt; second, the model is given another attempt at the task with the self-reflection in context. If the subsequent attempt succeeds, the tokens generated during the self-reflection phase are rewarded. Our experimental results show substantial performance gains across a variety of model architectures, as high as 34.7% improvement at math equation writing and 18.1% improvement at function calling. Notably, smaller fine-tuned models (1.5 billion to 7 billion parameters) outperform models in the same family that are 10 times larger. Our novel paradigm is thus an exciting pathway to more useful and reliable language models that can self-improve on challenging tasks with limited external feedback.
我们探索了一种通过自我反思和强化学习提升大语言模型性能的方法。通过激励模型在回答错误时生成更好的自我反思,我们证明,即使在无法生成合成数据且仅有二元反馈可用的情况下,模型解决复杂、可验证任务的能力也能得到增强。我们的框架分两个阶段运行:首先,当给定任务失败时,模型会生成一段自我反思性评论,分析其之前的尝试;其次,模型在包含该自我反思的上下文中再次尝试该任务。如果后续尝试成功,则在自我反思阶段生成的词元将获得奖励。我们的实验结果表明,在各种模型架构上均取得了显著的性能提升,在数学方程编写方面最高提升了 34.7%,在函数调用方面最高提升了 18.1%。值得注意的是,较小的微调模型(15 亿到 70 亿参数)的性能优于同一系列中规模大 10 倍的模型。因此,我们的新范式为开发更有用、更可靠的语言模型提供了一条令人兴奋的途径,这些模型能够在有限外部反馈的情况下,在挑战性任务上实现自我改进。
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1:利用闪电注意力高效扩展测试时计算
We introduce MiniMax-M1, the world’s first open-weight, large-scale hybrid-attention reasoning model. MiniMax-M1 is powered by a hybrid Mixture-of-Experts (MoE) architecture combined with a lightning attention mechanism. The model is developed based on our previous MiniMax-Text-01 model (MiniMax et al., 2025), which contains a total of 456 billion parameters with 45.9 billion parameters activated per token. The M1 model natively supports a context length of 1 million tokens, 8x the context size of DeepSeek R1. Furthermore, the lightning attention mechanism in MiniMax-M1 enables efficient scaling of test-time compute – For example, compared to DeepSeek R1, M1 consumes 25% of the FLOPs at a generation length of 100K tokens. These properties make M1 particularly suitable for complex tasks that require processing long inputs and thinking extensively. MiniMax-M1 is trained using large-scale reinforcement learning (RL) on diverse problems ranging from traditional mathematical reasoning to sandbox-based, real-world software engineering environments. In addition to the inherent efficiency advantage of lightning attention for RL training, we propose CISPO, a novel RL algorithm to further enhance RL efficiency. CISPO clips importance sampling weights rather than token updates, outperforming other competitive RL variants. Combining hybrid-attention and CISPO enables MiniMax-M1's full RL training on 512 H800 GPUs to complete in only three weeks, with a rental cost of just $534,700. We release two versions of MiniMax-M1 models with 40K and 80K thinking budgets respectively, where the 40K model represents an intermediate phase of the 80K training. Experiments on standard benchmarks show that our models are comparable or superior to strong open-weight models such as the original DeepSeek-R1 and Qwen3-235B, with particular strengths in complex software engineering, tool utilization, and long-context tasks. Through efficient scaling of test-time compute, MiniMax-M1 serves as a strong foundation for next-generation language model agents to reason and tackle real-world challenges.
我们推出了 MiniMax-M1,全球首个开源权重的大规模混合注意力推理模型。MiniMax-M1 采用混合混合专家架构与闪电注意力机制相结合。该模型基于我们之前的 MiniMax-Text-01 模型开发,后者总参数量达 4560 亿,每个词元激活 459 亿参数。M1 模型原生支持 100 万词元的上下文长度,是 DeepSeek R1 上下文大小的 8 倍。此外,MiniMax-M1 中的闪电注意力机制能够高效扩展测试时计算——例如,与 DeepSeek R1 相比,在生成 10 万词元长度时,M1 仅消耗 25% 的 FLOPs。这些特性使 M1 特别适合需要处理长输入和广泛思考的复杂任务。MiniMax-M1 使用大规模强化学习进行训练,任务范围从传统数学推理到基于沙箱的真实世界软件工程环境。除了闪电注意力在强化学习训练中固有的效率优势,我们还提出了 CISPO,一种新颖的强化学习算法,以进一步提高强化学习效率。CISPO 对重要性采样权重而非词元更新进行裁剪,优于其他有竞争力的强化学习变体。混合注意力与 CISPO 的结合使得 MiniMax-M1 的完整强化学习训练仅需在 512 块 H800 GPU 上耗时三周,租赁成本仅为 534,700 美元。我们分别发布了具有 40K 和 80K 思考预算的两个版本的 MiniMax-M1 模型,其中 40K 模型是 80K 训练过程中的一个中间阶段。在标准基准上的实验表明,我们的模型与原始 DeepSeek-R1 和 Qwen3-235B 等强大的开源权重模型相当或更优,在复杂软件工程、工具利用和长上下文任务方面尤为突出。通过高效扩展测试时计算,MiniMax-M1 为下一代语言模型智能体进行推理和应对现实世界挑战奠定了坚实基础。
Reinforcement Pre-Training
强化学习预训练
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
在本工作中,我们引入了强化学习预训练作为大语言模型和强化学习的一种新的扩展范式。具体来说,我们将下一个词元预测重新定义为一个使用强化学习训练的任务,其中模型因在给定上下文中正确预测下一个词元而获得可验证的奖励。RPT 提供了一种可扩展的方法,利用海量文本数据进行通用强化学习,而不是依赖特定领域的标注答案。通过激励下一词元推理能力,RPT 显著提高了预测下一个词元的语言建模精度。此外,RPT 为进一步的强化微调提供了强大的预训练基础。扩展曲线表明,增加训练计算量持续改善了下个词元预测的准确性。这些结果将 RPT 定位为一种有效且有前景的扩展范式,以推进语言模型的预训练。