Skip to content

The Dragon Hatchling: The Missing Link between the Transformer and Models of the Brain

https://arxiv.org/abs/2509.26507

https://github.com/pathwaycom/bdh


龙雏:Transformer与大脑模型之间的缺失环节

The relationship between computing systems and the brain has served as motivation for pioneering theoreticians since John von Neumann and Alan Turing. Uniform, scale-free biological networks, such as the brain, have powerful properties, including generalizing over time, which is the main barrier for Machine Learning on the path to Universal Reasoning Models.

计算系统与大脑之间的关系自冯·诺依曼和图灵以来一直激励着开创性的理论家。像大脑这样均匀、无尺度的生物网络具有强大的特性,包括随时间泛化的能力,这是机器学习在通往通用推理模型道路上的主要障碍。

We introduce ‘Dragon Hatchling’ (BDH), a new Large Language Model architecture based on a scale-free biologically inspired network of n locally-interacting neuron particles. BDH couples strong theoretical foundations and inherent interpretability without sacrificing Transformer-like performance.

我们引入了“龙雏”(BDH),一种基于无尺度、生物启发的由 n 个局部相互作用的神经元粒子构成网络的新型大语言模型架构。BDH 在保持类似 Transformer 性能的同时,结合了强大的理论基础和固有的可解释性。

BDH is a practical, performant state-of-the-art attention-based state space sequence learning architecture. In addition to being a graph model, BDH admits a GPU-friendly formulation. It exhibits Transformer-like scaling laws: we find empirically that BDH rivals GPT2-architecture Transformer performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.

BDH 是一种实用、高性能、基于注意力的最先进状态空间序列学习架构。除了作为图模型外,BDH 还具有适合 GPU 的实现形式。它展现出类似 Transformer 的缩放定律:我们通过实验发现,在相同参数数量(10M 到 1B)和相同训练数据下,BDH 在语言和翻译任务上与 GPT-2 架构的 Transformer 性能相当。

BDH provides theoretical foundations for understanding model behavior in the limit of large size and reasoning time. Our results, formalized as a chain of reductions of expressiveness in the framework of computational Complexity Theory and Distributed Computing, and combined with findings on the BDH model, show a macro-to-micro correspondence of function between the general attention mechanisms in state-of-the-art Language Models, and attention mechanisms observed in the brain. These attention mechanisms formally converge as closed-form local graph dynamics at neurons and synapses: “the equations of reasoning”.

BDH 为理解模型在大规模和长推理时间下的行为提供了理论基础。我们的研究结果在计算复杂性理论和分布式计算框架中被形式化为表达能力的一系列规约,并结合 BDH 模型的发现,展示了当前最先进语言模型中的通用注意力机制与大脑中观察到的注意力机制之间功能的宏观到微观对应关系。这些注意力机制在形式上收敛为神经元和突触处的闭式局部图动力学,即“推理方程”。

BDH can be represented as a brain model. It contains n neurons, organized as an excitatory circuit and an inhibitory circuit with integrate-and-fire thresholding of input signals at neurons. The working memory of BDH during inference entirely relies on synaptic plasticity with Hebbian learning using spiking neurons, at potentiation scales of minutes for the brain (up to hundreds of tokens). We confirm empirically that specific, individual synapses strengthen connection whenever BDH hears or reasons about a specific concept while processing language inputs. The neuron interaction network of BDH is a graph of high modularity with heavy-tailed degree distribution. The BDH model is biologically plausible, explaining one possible mechanism which human neurons could use to achieve speech.

BDH 可以表示为大脑模型。它包含 n 个神经元,组织为兴奋回路和抑制回路,并在神经元处对输入信号进行整合发放阈值处理。BDH 在推理过程中的工作记忆完全依赖于使用脉冲神经元的赫布学习所产生的突触可塑性,其增强时间尺度对应大脑的几分钟(最多数百个词元)。我们通过实验证实,每当 BDH 在处理语言输入时听到或推理特定概念,特定的单个突触连接就会增强。BDH 的神经元交互网络是一个高度模块化的图,具有重尾度分布。该模型具有生物学合理性,解释了人类神经元可能用于实现语音的一种可能机制。

BDH is designed for interpretability. Activation vectors of BDH are sparse and positive. We demonstrate monosemanticity in BDH on language tasks, including representation of concept abstractions, which happens even for small models, below 100M-parameter scale. Interpretability of state, which goes beyond interpretability of neurons and model parameters, is an inherent feature of the BDH architecture.

BDH 专为可解释性设计。BDH 的激活向量是稀疏且正的。我们在语言任务上展示了 BDH 的单语义性,包括概念抽象的表征,这甚至在低于 1 亿参数的小模型上也会出现。状态的可解释性超越了神经元和模型参数的可解释性,是 BDH 架构的固有特征。

We believe BDH opens the door to a new theory of “Thermodynamic Limit” behavior for language and reasoning models, with the ultimate goal of Probably Approximately Correct (PAC)-like bounds for generalization of reasoning over time.

我们相信 BDH 为语言和推理模型的“热力学极限”行为理论打开了大门,其最终目标是获得关于推理随时间的泛化的类似于概率近似正确(PAC)的界。


Sharing is Caring: Efficient LM Post-Training with Collective RL Experience Sharing

https://huggingface.co/papers/2509.08721

https://arxiv.org/abs/2509.08721

https://github.com/gensyn-ai/rl-swarm

https://blog.gensyn.ai/sapo-efficient-lm-post-training-with-collective-rl/


共享即关怀:通过集体强化学习经验共享实现高效的语言模型后训练

Post-training language models (LMs) with reinforcement learning (RL) can enhance their complex reasoning capabilities without supervised fine-tuning, as demonstrated by DeepSeek-R1-Zero (DeepSeek-AI et al., 2025). However, effectively utilizing RL for LMs requires significant parallelization to scale-up inference, which introduces non-trivial technical challenges (e.g. latency, memory, and reliability) alongside ever-growing financial costs. We present Swarm sAmpling Policy Optimization (SAPO), a fully decentralized and asynchronous RL post-training algorithm. SAPO is designed for decentralized networks of heterogenous compute nodes, where each node manages its own policy model(s) while "sharing" rollouts with others in the network; no explicit assumptions about latency, model homogeneity, or hardware are required and nodes can operate in silo if desired. As a result, the algorithm avoids common bottlenecks in scaling RL post-training while also allowing (and even encouraging) new possibilities. By sampling rollouts "shared" across the network, it enables "Aha moments" to propagate, thereby bootstrapping the learning process. In this paper we show SAPO achieved cumulative reward gains of up to 94% in controlled experiments. We also share insights from tests on a network with thousands of nodes contributed by Gensyn community members running the algorithm on diverse hardware and models during an open-source demo.

如 DeepSeek-R1-Zero 所示,通过强化学习对语言模型进行后训练可以在无需监督微调的情况下增强其复杂推理能力。然而,有效利用强化学习进行语言模型后训练需要大规模的并行化以扩展推理,这引入了不容忽视的技术挑战,以及不断增长的财务成本。我们提出了 Swarm sAmpling Policy Optimization (SAPO),一种完全去中心化和异步的强化学习后训练算法。SAPO 专为异构计算节点的去中心化网络而设计,其中每个节点管理自己的策略模型,同时与网络中的其他节点“共享”采样轨迹;该算法不对延迟、模型同质性或硬件做任何显式假设,并且节点可以根据需要独立运行。因此,该算法避免了扩展强化学习后训练中的常见瓶颈,同时也允许并鼓励新的可能性。通过在网络中“共享”采样轨迹,它使得“顿悟时刻”得以传播,从而引导学习过程。在本文中,我们展示了 SAPO 在控制实验中实现了高达 94% 的累积奖励增益。我们还分享了在由 Gensyn 社区成员贡献的数千个节点组成的网络上进行测试的见解,这些节点在开源演示期间运行了该算法在多样化的硬件和模型上。


A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code

https://huggingface.co/papers/2508.18106

https://arxiv.org/abs/2508.18106

https://github.com/Tencent/AICGSecEval

https://aicgseceval.tencent.com/home

腾讯北京大学复旦大学上海交通大学清华大学浙江大学

A.S.E:评估AI生成代码安全性的仓库级基准

The increasing adoption of large language models (LLMs) in software engineering necessitates rigorous security evaluation of their generated code. However, existing benchmarks often lack relevance to real-world AI-assisted programming scenarios, making them inadequate for assessing the practical security risks associated with AI-generated code in production environments. To address this gap, we introduce A.S.E (AI Code Generation Security Evaluation), a repository-level evaluation benchmark designed to closely mirror real-world AI programming tasks, offering a comprehensive and reliable framework for assessing the security of AI-generated code. Our evaluation of leading LLMs on A.S.E reveals several key findings. In particular, current LLMs still struggle with secure coding. The complexity in repository-level scenarios presents challenges for LLMs that typically perform well on snippet-level tasks. Moreover, a larger reasoning budget does not necessarily lead to better code generation. These observations offer valuable insights into the current state of AI code generation and help developers identify the most suitable models for practical tasks. They also lay the groundwork for refining LLMs to generate secure and efficient code in real-world applications.

随着大语言模型在软件工程中的日益普及,对其生成代码进行严格的安全性评估变得十分必要。然而,现有基准通常缺乏与真实世界AI辅助编程场景的相关性,因此不足以评估与生产环境中AI生成代码相关的实际安全风险。为填补这一空白,我们引入了A.S.E,一个仓库级别的评估基准,旨在紧密模拟真实世界的AI编程任务,为评估AI生成代码的安全性提供一个全面且可靠的框架。我们在A.S.E上对领先的大语言模型的评估揭示了几项关键发现。特别是,当前的大语言模型在安全编码方面仍然存在困难。仓库级场景的复杂性对那些通常在代码片段级任务上表现良好的大语言模型提出了挑战。此外,更大的推理预算并不一定会带来更好的代码生成。这些观察为AI代码生成的现状提供了宝贵的见解,并帮助开发者为实际任务识别最合适的模型。它们也为改进大语言模型以在真实世界应用中生成安全高效的代码奠定了基础。


VLA-Adapter: An Effective Paradigm for Tiny-Scale Vision-Language-Action Model

https://huggingface.co/papers/2509.09372

https://arxiv.org/abs/2509.09372

https://github.com/OpenHelix-Team/VLA-Adapter

https://vla-adapter.github.io/


Vision-Language-Action (VLA) models typically bridge the gap between perceptual and action spaces by pre-training a large-scale Vision-Language Model (VLM) on robotic data. While this approach greatly enhances performance, it also incurs significant training costs. In this paper, we investigate how to effectively bridge vision-language (VL) representations to action (A). We introduce VLA-Adapter, a novel paradigm designed to reduce the reliance of VLA models on large-scale VLMs and extensive pre-training. To this end, we first systematically analyze the effectiveness of various VL conditions and present key findings on which conditions are essential for bridging perception and action spaces. Based on these insights, we propose a lightweight Policy module with Bridge Attention, which autonomously injects the optimal condition into the action space. In this way, our method achieves high performance using only a 0.5B-parameter backbone, without any robotic data pre-training. Extensive experiments on both simulated and real-world robotic benchmarks demonstrate that VLA-Adapter not only achieves state-of-the-art level performance, but also offers the fast inference speed reported to date. Furthermore, thanks to the proposed advanced bridging paradigm, VLA-Adapter enables the training of a powerful VLA model in just 8 hours on a single consumer-grade GPU, greatly lowering the barrier to deploying the VLA model.

视觉-语言-动作模型通常通过在机器人数据上预训练大规模视觉-语言模型来弥合感知空间与动作空间之间的差距。虽然这种方法极大地提升了性能,但也带来了显著的训练成本。在本文中,我们研究了如何有效地将视觉-语言表示桥接到动作空间。我们引入了 VLA-Adapter,一种旨在减少 VLA 模型对大规模视觉-语言模型和广泛预训练依赖的新范式。为此,我们首先系统地分析了各种视觉-语言条件的有效性,并提出了关于哪些条件对于桥接感知和动作空间至关重要的关键发现。基于这些见解,我们提出了一个带有桥接注意力的轻量级策略模块,该模块能够自主地将最优条件注入到动作空间中。通过这种方式,我们的方法仅使用一个 0.5B 参数的主干网络,无需任何机器人数据预训练,就能实现高性能。在模拟和真实世界机器人基准上的大量实验证明,VLA-Adapter 不仅达到了最先进水平的性能,而且提供了迄今为止报告的最快推理速度。此外,得益于所提出的先进桥接范式,VLA-Adapter 能够在单个消费级 GPU 上仅用 8 小时就训练出一个强大的 VLA 模型,极大地降低了部署 VLA 模型的门槛。


The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

https://huggingface.co/papers/2509.02547

https://arxiv.org/abs/2509.02547

https://github.com/xhyumiracle/Awesome-AgenticLLM-RL-Papers


面向大语言模型的智能体强化学习综述

The emergence of agentic reinforcement learning (Agentic RL) marks a paradigm shift from conventional reinforcement learning applied to large language models (LLM RL), reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds. This survey formalizes this conceptual shift by contrasting the degenerate single-step Markov Decision Processes (MDPs) of LLM RL with the temporally extended Partially Observable Markov Decision Processes (POMDPs) that define Agentic RL. Building on this foundation, we propose a comprehensive twofold taxonomy: one organized around core agentic capabilities, including planning, tool use, memory, reasoning, self-improvement, and perception, and the other around their applications across diverse task domains. Central to our thesis is that reinforcement learning serves as the critical mechanism for transforming these capabilities from static, heuristic modules into adaptive, robust agentic behavior. To support and accelerate future research, we consolidate the landscape of open-source environments, benchmarks, and frameworks into a practical compendium. By synthesizing over five hundred recent works, this survey charts the contours of this rapidly evolving field and highlights the opportunities and challenges that will shape the development of scalable, general-purpose AI agents.

智能体强化学习的出现标志着从应用于大语言模型的传统强化学习的一次范式转变,它将大语言模型从被动的序列生成器重塑为嵌入复杂、动态世界中的自主决策智能体。本综述通过对比传统LLM强化学习中退化的单步马尔可夫决策过程与定义智能体强化学习的具有时间延展性的部分可观测马尔可夫决策过程,形式化了这一概念转变。在此基础之上,我们提出了一个全面的双重分类法:一个围绕核心智能体能力组织,包括规划、工具使用、记忆、推理、自我改进和感知;另一个则围绕它们在不同任务领域中的应用组织。我们论文的核心观点是,强化学习是将这些能力从静态、启发式模块转变为适应性、稳健的智能体行为的关键机制。为了支持和加速未来研究,我们将开源环境、基准和框架的格局整合成一个实用的纲要。通过综合五百余篇近期工作,本综述描绘了这一快速发展领域的轮廓,并强调了将塑造可扩展、通用AI智能体发展的机遇与挑战。


A Survey of Reinforcement Learning for Large Reasoning Models

https://huggingface.co/papers/2509.08827

https://arxiv.org/abs/2509.08827

https://github.com/TsinghuaC3I/Awesome-RL-for-LRMs

清华大学

In this paper, we survey recent advances in Reinforcement Learning (RL) for reasoning with Large Language Models (LLMs). RL has achieved remarkable success in advancing the frontier of LLM capabilities, particularly in addressing complex logical tasks such as mathematics and coding. As a result, RL has emerged as a foundational methodology for transforming LLMs into LRMs. With the rapid progress of the field, further scaling of RL for LRMs now faces foundational challenges not only in computational resources but also in algorithm design, training data, and infrastructure. To this end, it is timely to revisit the development of this domain, reassess its trajectory, and explore strategies to enhance the scalability of RL toward Artificial SuperIntelligence (ASI). In particular, we examine research applying RL to LLMs and LRMs for reasoning abilities, especially since the release of DeepSeek-R1, including foundational components, core problems, training resources, and downstream applications, to identify future opportunities and directions for this rapidly evolving area. We hope this review will promote future research on RL for broader reasoning models.

本文综述了强化学习在大语言模型推理能力方面的最新进展。强化学习在推动大语言模型能力前沿发展方面取得了显著成功,特别是在处理数学和编程等复杂逻辑任务上。因此,强化学习已成为将大语言模型转化为大推理模型的基础方法论。随着该领域的快速发展,进一步扩展面向大推理模型的强化学习现在面临着基础性挑战,不仅涉及计算资源,还包括算法设计、训练数据和基础设施。因此,及时重新审视该领域的发展,重新评估其轨迹,并探索增强强化学习向超级人工智能可扩展性的策略正逢其时。我们特别考察了将强化学习应用于大语言模型和大推理模型以提升推理能力的研究,尤其是自 DeepSeek-R1 发布以来的工作,涵盖基础组件、核心问题、训练资源和下游应用,以识别这一快速发展领域的未来机遇和方向。我们希望本综述能够促进未来关于强化学习在更广泛推理模型上的研究。


Qwen3-Omni Technical Report

https://huggingface.co/papers/2509.17765

https://arxiv.org/abs/2509.17765

https://github.com/QwenLM/Qwen3-Omni

阿里

Qwen3-Omni 技术报告

We present Qwen3-Omni, a single multimodal model that for the first time maintains state-of-the-art performance across text, image, audio, and video without any degradation relative to single-modal counterparts. Qwen3-Omni matches the performance of same-sized single-modal models within the Qwen series and excels particularly on audio tasks. Across 36 audio and audio-visual benchmarks, Qwen3-Omni achieves open-source state-of-the-art (SOTA) on 32 benchmarks and overall SOTA on 22, outperforming strong closed-source models such as Gemini-2.5-Pro, Seed-ASR, and GPT-4o-Transcribe. Qwen3-Omni adopts a Thinker–Talker Mixture-of-Experts (MoE) architecture that unifies perception and generation across text, images, audio, and video, yielding fluent text and natural real-time speech. It supports text interaction in 119 languages, speech understanding in 19 languages and speech generation in 10 languages. The system can process audio recordings up to 40 minutes per instance for ASR and spoken-language understanding, enabling high-quality audio and audiovisual experiences across locales. It demonstrates strong instruction following and allows fine-grained customization of conversational tone and persona via user-defined system prompts. To reduce first-packet latency in streaming synthesis, the Talker autoregressively predicts discrete speech codecs using a multi-codebook scheme. Leveraging the representational capacity of these codebooks, we replace computationally intensive block-wise diffusion with a lightweight causal ConvNet, enabling streaming from the first codec frame. In cold-start settings (no prior context), Qwen3-Omni achieves a theoretical end-to-end first-packet latency of 234 ms. To further strengthen multimodal reasoning, we introduce a Thinking model that explicitly reasons over inputs from any modality. Since the research community currently lacks a general-purpose audio captioning model, we fine-tuned Qwen3-Omni-30B-A3B to obtain Qwen3-Omni-30B-A3B-Captioner, which produces detailed, low-hallucination captions for arbitrary audio inputs. Qwen3-Omni-30B-A3B, Qwen3-Omni-30B-A3B-Thinking and Qwen3-Omni-30B-A3B-Captioner are publicly released under the Apache 2.0 license.

我们推出了 Qwen3-Omni,这是首个在文本、图像、音频和视频上均保持最先进性能且相较单模态版本毫无退化的单一多模态模型。Qwen3-Omni 与 Qwen 系列中同等大小的单模态模型性能相当,并在音频任务上尤其出色。在 36 个音频和音视频基准测试中,Qwen3-Omni 在 32 个基准上取得了开源模型最优水平,并在 22 个基准上取得了整体最优水平,超越了 Gemini-2.5-Pro、Seed-ASR 和 GPT-4o-Transcribe 等强大的闭源模型。Qwen3-Omni 采用了“思考者-说话者”混合专家架构,统一了跨文本、图像、音频和视频的感知与生成,能够生成流畅的文本和自然的实时语音。它支持 119 种语言的文本交互、19 种语言的语音理解和 10 种语言的语音生成。该系统可处理长达 40 分钟的音频录音,用于自动语音识别和口语理解,在不同地区都能实现高质量的音频和音视频体验。它展现出强大的指令遵循能力,并允许通过用户定义的系统提示对对话语气和角色进行细粒度定制。为降低流式合成的首包延迟,说话者模块采用多码书方案自回归地预测离散语音编解码。利用这些码书的表示能力,我们用轻量级因果卷积网络替代了计算密集的逐块扩散,实现了从首个编解码帧开始的流式处理。在冷启动设置下,Qwen3-Omni 的理论端到端首包延迟达到 234 毫秒。为进一步增强多模态推理,我们引入了一个对任意模态输入进行显式推理的思考模型。鉴于当前研究社区缺乏通用的音频描述模型,我们对 Qwen3-Omni-30B-A3B 进行了微调,得到了 Qwen3-Omni-30B-A3B-Captioner,可为任意音频输入生成详细且低幻觉的描述。Qwen3-Omni-30B-A3B、Qwen3-Omni-30B-A3B-Thinking 和 Qwen3-Omni-30B-A3B-Captioner 已根据 Apache 2.0 许可证公开发布。


MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

https://huggingface.co/papers/2509.22186

https://arxiv.org/abs/2509.22186

https://opendatalab.github.io/MinerU/

https://github.com/opendatalab/MinerU


MinerU2.5:用于高效高分辨率文档解析的解耦视觉语言模型

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our approach employs a coarse-to-fine, two-stage parsing strategy that decouples global layout analysis from local content recognition. In the first stage, the model performs efficient layout analysis on downsampled images to identify structural elements, circumventing the computational overhead of processing high-resolution inputs. In the second stage, guided by the global layout, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, we developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. Ultimately, MinerU2.5 demonstrates strong document parsing ability, achieving state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead.

我们推出了 MinerU2.5,一个拥有 12 亿参数的文档解析视觉语言模型,它在实现最先进识别精度的同时,保持了卓越的计算效率。我们的方法采用了一种由粗到细的两阶段解析策略,将全局布局分析与局部内容识别解耦。在第一阶段,模型对下采样图像执行高效的布局分析以识别结构元素,从而避免了处理高分辨率输入的计算开销。在第二阶段,在全局布局的指导下,它对从原始图像中提取的原生分辨率图像块执行针对性的内容识别,保留了密集文本、复杂公式和表格中的细粒度细节。为支持这一策略,我们开发了一个全面的数据引擎,为预训练和微调生成了多样化的大规模训练语料库。最终,MinerU2.5 展现出强大的文档解析能力,在多个基准上取得了最先进的性能,在各种识别任务上超越了通用模型和特定领域模型,同时保持了显著更低的计算开销。


UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

https://huggingface.co/papers/2509.02544

https://arxiv.org/abs/2509.02544

https://seed-tars.com/showcase/ui-tars-2/

https://github.com/bytedance/ui-tars


UI-TARS-2 技术报告:通过多轮强化学习推进 GUI 智能体

The development of autonomous agents for graphical user interfaces (GUIs) presents major challenges in artificial intelligence. While recent advances in native agent models have shown promise by unifying perception, reasoning, action, and memory through end-to-end learning, open problems remain in data scalability, multi-turn reinforcement learning (RL), the limitations of GUI-only operation, and environment stability. In this technical report, we present UI-TARS-2, a native GUI-centered agent model that addresses these challenges through a systematic training methodology: a data flywheel for scalable data generation, a stabilized multi-turn RL framework, a hybrid GUI environment that integrates file systems and terminals, and a unified sandbox platform for large-scale rollouts. Empirical evaluation demonstrates that UI-TARS-2 achieves significant improvements over its predecessor UI-TARS-1.5. On GUI benchmarks, it reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld, outperforming strong baselines such as Claude and OpenAI agents. In game environments, it attains a mean normalized score of 59.8 across a 15-game suite-roughly 60% of human-level performance-and remains competitive with frontier proprietary models (e.g., OpenAI o3) on LMGame-Bench. Additionally, the model can generalize to long-horizon information-seeking tasks and software engineering benchmarks, highlighting its robustness across diverse agent tasks. Detailed analyses of training dynamics further provide insights into achieving stability and efficiency in large-scale agent RL. These results underscore UI-TARS-2's potential to advance the state of GUI agents and exhibit strong generalization to real-world interactive scenarios.

为图形用户界面开发自主智能体是人工智能领域的一项重大挑战。尽管近期原生智能体模型的进展通过端到端学习统一了感知、推理、行动和记忆,展现出良好前景,但在数据可扩展性、多轮强化学习、纯 GUI 操作的局限性以及环境稳定性方面仍然存在开放性问题。在本技术报告中,我们提出了 UI-TARS-2,一个以 GUI 为中心的原生智能体模型,它通过一套系统的训练方法解决了这些挑战:一个用于可扩展数据生成的数据飞轮、一个稳定的多轮强化学习框架、一个集成了文件系统和终端的混合 GUI 环境,以及一个用于大规模展开的统一沙箱平台。实证评估表明,UI-TARS-2 相较于其前身 UI-TARS-1.5 取得了显著进步。在 GUI 基准上,它在 Online-Mind2Web 上达到 88.2 分,在 OSWorld 上达到 47.5 分,在 WindowsAgentArena 上达到 50.6 分,在 AndroidWorld 上达到 73.3 分,超越了 Claude 和 OpenAI 智能体等强基线。在游戏环境中,它在 15 个游戏的套件中取得了 59.8 的平均归一化分数——约为人类水平的 60%——并在 LMGame-Bench 上与前沿专有模型保持竞争力。此外,该模型能够泛化到长时程信息搜寻任务和软件工程基准上,凸显了其在多样智能体任务中的鲁棒性。对训练动态的详细分析进一步为在大规模智能体强化学习中实现稳定性和效率提供了见解。这些结果强调了 UI-TARS-2 在推进 GUI 智能体发展方面的潜力,并展现出对真实世界交互场景的强大泛化能力。


Scaling Agents via Continual Pre-training

https://huggingface.co/papers/2509.13310

https://arxiv.org/abs/2509.13310

https://tongyi-agent.github.io/blog/

https://github.com/Alibaba-NLP/DeepResearch

阿里

通过持续预训练扩展智能体

Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmark marks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.

大语言模型已演变为能够自主使用工具并进行多步推理以解决复杂问题的智能体系统。然而,基于通用基础模型的后训练方法在智能体任务中始终表现不佳,尤其是在开源实现中。我们找出了根本原因:缺乏强大的智能体基础模型,迫使模型在后训练期间同时学习多样化的智能体行为,并将其与专家演示对齐,从而产生了根本的优化冲突。为此,我们首次提出将智能体持续预训练纳入深度研究智能体的训练流程,以构建强大的智能体基础模型。基于此方法,我们开发了一个名为 AgentFounder 的深度研究智能体模型。我们在 10 个基准上评估了 AgentFounder-30B,取得了最先进的性能,同时保持了强大的工具使用能力,特别是在 BrowseComp-en 上达到 39.9%,在 BrowseComp-zh 上达到 43.3%,在 HLE 上达到 31.5% 的 Pass@1。