Skip to content

Group Sequence Policy Optimization

https://huggingface.co/papers/2507.18071

https://arxiv.org/abs/2507.18071

阿里

群组序列策略优化

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.

本文介绍了群组序列策略优化——我们用于训练大语言模型的稳定、高效且性能卓越的强化学习算法。与以往采用词元级别重要性比率的算法不同,GSPO 基于序列似然定义重要性比率,并执行序列级别的裁剪、奖励和优化。我们证明,与 GRPO 算法相比,GSPO 实现了更优的训练效率和性能,显著稳定了混合专家模型的强化学习训练,并具有简化强化学习基础设施设计的潜力。GSPO 的这些优点为最新 Qwen3 模型的显著改进做出了贡献。


A Survey of Context Engineering for Large Language Models

https://huggingface.co/papers/2507.13334

https://arxiv.org/abs/2507.13334

https://github.com/Meirtz/Awesome-Context-Engineering


大语言模型上下文工程综述

The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational Components and the sophisticated Implementations that integrate them into intelligent systems. We first examine the foundational Components: (1) Context Retrieval and Generation, encompassing prompt-based generation and external knowledge acquisition; (2) Context Processing, addressing long sequence processing, self-refinement, and structured information integration; and (3) Context Management, covering memory hierarchies, compression, and optimization. We then explore how these components are architecturally integrated to create sophisticated System Implementations: (1) Retrieval-Augmented Generation (RAG), including modular, agentic, and graph-enhanced architectures; (2) Memory Systems, enabling persistent interactions; (3) Tool-Integrated Reasoning, for function calling and environmental interaction; and (4) Multi-Agent Systems, coordinating communication and orchestration. Through this systematic analysis of over 1400 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.

大语言模型的性能根本上由推理过程中提供的上下文信息决定。本综述介绍了上下文工程,这是一门超越简单提示设计的形式化学科,涵盖了对大语言模型信息载荷的系统性优化。我们提出了一个全面的分类法,将上下文工程分解为其基础组件以及将这些组件集成到智能系统中的复杂实现。我们首先审视了基础组件:上下文检索与生成,涵盖基于提示的生成和外部知识获取;上下文处理,涉及长序列处理、自我优化和结构化信息整合;以及上下文管理,包括记忆层次结构、压缩和优化。随后,我们探讨了这些组件如何通过架构集成以创建复杂的系统实现:检索增强生成,包括模块化、智能体和图增强架构;记忆系统,支持持久化交互;工具集成推理,用于函数调用和环境交互;以及多智能体系统,协调通信和编排。通过对 1400 多篇研究论文的系统分析,我们的综述不仅为该领域建立了技术路线图,还揭示了一个关键的研究空白:模型能力之间存在根本性的不对称。虽然当前模型在先进上下文工程的加持下,在理解复杂上下文方面表现出卓越的能力,但在生成同等复杂的长形式输出方面却显示出明显的局限性。解决这一差距是未来研究的首要任务。最终,本综述为推进上下文感知 AI 的研究人员和工程师提供了一个统一的框架。


GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

https://huggingface.co/papers/2507.01006

https://arxiv.org/abs/2507.01006

https://github.com/zai-org/GLM-V

清华大学

GLM-4.1V-Thinking:面向通用多模态推理的可扩展强化学习探索

We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive—achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window.

我们推出了 GLM-4.1V-Thinking、GLM-4.5V 和 GLM-4.6V,一系列旨在推进通用多模态理解和推理的视觉语言模型。在本报告中,我们分享了在以推理为中心的训练框架开发中的关键发现。我们首先通过大规模预训练开发了一个具有显著潜力的强大视觉基础模型,这可以说是最终性能的上限。随后,我们提出了基于课程采样的强化学习来释放模型的全部潜力,从而在广泛任务中实现全面的能力提升,包括 STEM 问题解决、视频理解、内容识别、编程、基础定位、基于 GUI 的智能体以及长文档解读。在涵盖 42 个公共基准的全面评估中,GLM-4.5V 在几乎所有任务上都达到了同类规模开源模型中的最先进性能,并在包括编程和 GUI 智能体在内的挑战性任务上,展现出与 Gemini-2.5-Flash 等闭源模型竞争甚至更优的结果。同时,较小的 GLM-4.1V-9B-Thinking 保持了极高的竞争力——在 29 个基准上优于规模大得多的 Qwen2.5-VL-72B。我们将 GLM-4.1V-9B-Thinking 和 GLM-4.5V 开源。我们还进一步介绍了 GLM-4.6V 系列,这是具有原生工具使用能力和 128K 上下文窗口的开源多模态模型。