Skip to content

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

https://huggingface.co/papers/2601.05242

https://arxiv.org/abs/2601.05242

https://nvlabs.github.io/GDPO/

https://github.com/NVlabs/GDPO

NVIDIA

GDPO:面向多奖励强化学习的群组奖励解耦归一化策略优化

As language models become increasingly capable, users expect them to provide not only accurate responses but also behaviors aligned with diverse human preferences across a variety of scenarios. To achieve this, Reinforcement learning (RL) pipelines have begun incorporating multiple rewards, each capturing a distinct preference, to guide models toward these desired behaviors. However, recent work has defaulted to apply Group Relative Policy Optimization (GRPO) under multi-reward setting without examining its suitability. In this paper, we demonstrate that directly applying GRPO to normalize distinct rollout reward combinations causes them to collapse into identical advantage values, reducing the resolution of the training signal and resulting in suboptimal convergence and, in some cases, early training failure. We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues by decoupling the normalization of individual rewards, more faithfully preserving their relative differences and enabling more accurate multi-reward optimization, along with substantially improved training stability. We compare GDPO with GRPO across three tasks: tool calling, math reasoning, and coding reasoning, evaluating both correctness metrics (accuracy, bug ratio) and constraint adherence metrics (format, length). Across all settings, GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.

随着语言模型能力日益增强,用户期望它们不仅能提供准确的响应,还能在各种场景中展现出与多样化人类偏好一致的行为。为实现这一目标,强化学习流程已开始纳入多个奖励,每个奖励捕捉一种不同的偏好,以引导模型趋向这些期望行为。然而,近期工作默认在多奖励设置下应用群组相对策略优化,而未检验其适用性。在本文中,我们证明直接应用 GRPO 对不同生成结果的奖励组合进行归一化,会导致它们坍缩为相同的优势值,从而降低训练信号的分辨率,导致收敛次优,并在某些情况下造成早期训练失败。随后,我们引入了 GDPO(群组奖励解耦归一化策略优化),一种新的策略优化方法,通过对单个奖励的归一化进行解耦来解决这些问题,更忠实地保留它们的相对差异,实现更精确的多奖励优化,并显著提高训练稳定性。我们在三个任务上比较了 GDPO 与 GRPO:工具调用、数学推理和代码推理,评估了正确性指标和约束遵守指标。在所有设置下,GDPO 均持续优于 GRPO,证明了其在多奖励强化学习优化中的有效性和泛化能力。


Agentic Reasoning for Large Language Models

https://huggingface.co/papers/2601.12538

https://arxiv.org/abs/2601.12538

https://github.com/weitianxin/Awesome-Agentic-Reasoning

伊利诺大学香槟分校MetaAmazonGoogle Deepmind

面向大语言模型的智能体推理

Reasoning is a fundamental cognitive process underlying inference, problem-solving, and decision-making. While large language models (LLMs) demonstrate strong reasoning capabilities in closed-world settings, exemplified by standard benchmarks in mathematics and code, they struggle in open-ended and dynamic environments. The emergence of agentic reasoning marks a paradigm shift, bridging thought and action by reframing LLMs as autonomous agents that plan, act, and learn through continual interaction. In this survey, we provide a systematic roadmap by organizing agentic reasoning along three complementary dimensions. First, we characterize environmental dynamics through three layers: foundational agentic reasoning establishes core single-agent capabilities, including planning, tool use, and search, that operate in stable environments; self-evolving agentic reasoning examines how agents refine these capabilities through feedback, memory, and adaptation in evolving settings; and collective multi-agent reasoning extends intelligence to collaborative scenarios where multiple agents coordinate roles, share knowledge, and pursue shared goals. Across all layers, we analyze system constraints and optimization settings by distinguishing in-context reasoning, which scales test-time interaction through structured orchestration and adaptive workflow design, from post-training reasoning, which optimizes behaviors through reinforcement learning and supervised fine-tuning. We further review and contextualize agentic reasoning frameworks in real-world applications and benchmarks spanning science, robotics, healthcare, autonomous research, and math, illustrating how different reasoning mechanisms are instantiated and evaluated across domains. This survey synthesizes agentic reasoning methods into a unified roadmap that bridges thoughts and actions, offering actionable guidance for agentic systems across environmental dynamics, optimization settings, and agent interaction settings. Finally, we outline open challenges and future directions, situating how agentic reasoning has developed while identifying what remains ahead: personalization, long-horizon interaction, world modeling, scalable multi-agent training, and governance frameworks for real-world deployment.

推理是支撑推断、问题解决和决策制定的基本认知过程。虽然大语言模型在封闭世界环境中展现出强大的推理能力,这体现在数学和编程等标准基准上,但它们在开放和动态环境中仍存在困难。智能体推理的出现标志着一个范式转变,通过将大语言模型重塑为能够通过持续交互进行规划、行动和学习的自主智能体,从而桥接了思想和行动。在本综述中,我们通过沿三个互补维度组织智能体推理,提供了一个系统性的路线图。首先,我们通过三个层次来描述环境动态:基础智能体推理建立了在稳定环境中运行的核心单智能体能力,包括规划、工具使用和搜索;自我进化智能体推理考察了智能体如何在不断变化的环境中通过反馈、记忆和适应来完善这些能力;集体多智能体推理将智能扩展到协作场景,其中多个智能体协调角色、共享知识并追求共同目标。在所有层次上,我们通过区分上下文推理后训练推理来分析系统约束和优化设置,前者通过结构化编排和自适应工作流设计来扩展测试时交互,后者通过强化学习和监督微调来优化行为。我们进一步回顾并阐述了智能体推理框架在科学、机器人、医疗保健、自主研究和数学等领域的实际应用和基准,说明了不同推理机制是如何在各个领域中被实例化和评估的。本综述将智能体推理方法综合成一个连接思想和行动的统一路线图,为跨越环境动态、优化设置和智能体交互设置的智能体系统提供了可操作的指导。最后,我们概述了当前的挑战和未来的方向,明确了智能体推理的发展历程,并指出了未来的发展方向:个性化、长程交互、世界建模、可扩展的多智能体训练以及面向真实世界部署的治理框架。


LTX-2: Efficient Joint Audio-Visual Foundation Model

https://huggingface.co/papers/2601.03233

https://arxiv.org/abs/2601.03233

https://app.ltx.studio/ltx-2-playground/i2v

https://github.com/Lightricks/LTX-2


LTX-2:高效的联合音视频基础模型

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent—missing the semantic, emotional, and atmospheric cues that audio provides. We introduce LTX-2, an open-source foundational model capable of generating high-quality, temporally synchronized audiovisual content in a unified manner. LTX-2 consists of an asymmetric dual-stream transformer with a 14B-parameter video stream and a 5B-parameter audio stream, coupled through bidirectional audio-video cross-attention layers with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning. This architecture enables efficient training and inference of a unified audiovisual model while allocating more capacity for video generation than audio generation. We employ a multilingual text encoder for broader prompt understanding and introduce a modality-aware classifier-free guidance (modality-CFG) mechanism for improved audiovisual alignment and controllability. Beyond generating speech, LTX-2 produces rich, coherent audio tracks that follow the characters, environment, style, and emotion of each scene—complete with natural background and foley elements. In our evaluations, the model achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. All model weights and code are publicly released.

最近的文本到视频扩散模型能够生成引人入胜的视频序列,但它们仍然是无声的——缺失了音频提供的语义、情感和氛围线索。我们推出了 LTX-2,一个开源的基础模型,能够以统一的方式生成高质量、时间同步的视听内容。LTX-2 由一个非对称双流 Transformer 组成,包含一个 140 亿参数的视频流和一个 50 亿参数的音频流,通过双向音视频交叉注意力层进行耦合,并配有时间位置嵌入和跨模态 AdaLN 以实现共享时间步长条件。这种架构使得统一视听模型的高效训练和推理成为可能,同时为视频生成分配了比音频生成更多的容量。我们采用多语言文本编码器以实现更广泛的提示理解,并引入了一种模态感知的无分类器引导机制,以改进视听对齐和可控性。除了生成语音,LTX-2 还能生成丰富、连贯的音频轨道,跟随每个场景的角色、环境、风格和情感——包含自然的背景音和拟音元素。在我们的评估中,该模型在开源系统中达到了最先进的视听质量和提示遵循度,同时以极低的计算成本和推理时间提供了与专有模型相媲美的结果。所有模型权重和代码均已公开发布。