DINOv3
Metahttps://ai.meta.com/blog/dinov3-self-supervised-vision-model/
Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images—using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.
自监督学习有望消除对人工数据标注的需求,使模型能够轻松扩展到海量数据集和更大架构。由于不针对特定任务或领域,这种训练范式有潜力使用单一算法从自然图像到航拍图像等各种来源学习视觉表示。本技术报告介绍了 DINOv3,这是通过利用简单而有效的策略实现这一愿景的重要里程碑。首先,我们通过精心的数据准备、设计和优化,利用了扩展数据集和模型规模的好处。其次,我们引入了一种名为 Gram anchoring 的新方法,该方法有效解决了密集特征图在长训练过程中退化这一已知但尚未解决的问题。最后,我们应用了事后策略,进一步增强了模型在分辨率、模型规模和与文本对齐方面的灵活性。因此,我们提出了一个通用的视觉基础模型,它在广泛设置下无需微调即可超越专门的最先进模型。DINOv3 生成的高质量密集特征在各种视觉任务上取得了卓越的性能,显著超越了之前的自监督和弱监督基础模型。我们还分享了 DINOv3 视觉模型套件,旨在通过为不同的资源约束和部署场景提供可扩展的解决方案,推动广泛任务和数据的先进水平。
Qwen-Image Technical Report
QWen
Qwen-Image 技术报告
We present Qwen-Image, an image generation foundation model in the Qwen series that achieves significant advances in complex text rendering and precise image editing. To address the challenges of complex text rendering, we design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing. Moreover, we adopt a progressive training strategy that starts with non-text-to-text rendering, evolves from simple to complex textual inputs, and gradually scales up to paragraph-level descriptions. This curriculum learning approach substantially enhances the model's native text rendering capabilities. As a result, Qwen-Image not only performs exceptionally well in alphabetic languages such as English, but also achieves remarkable progress on more challenging logographic languages like Chinese. To enhance image editing consistency, we introduce an improved multi-task training paradigm that incorporates not only traditional text-to-image (T2I) and text-image-to-image (T1I2I) tasks but also image-to-image (I2I) reconstruction, effectively aligning the latent representations between Qwen2.5-VL and MMDiT. Furthermore, we separately feed the original image into Qwen2.5-VL and the VAE encoder to obtain semantic and reconstructive representations, respectively. This dual-encoding mechanism enables the editing module to strike a balance between preserving semantic consistency and maintaining visual fidelity. We present a comprehensive evaluation of Qwen-Image across multiple public benchmarks, including GenEval, DPG, and OneIG-Bench for general image generation, as well as GEedit, ImgEdit, and GSO for image editing. Qwen-Image achieves state-of-the-art performance, demonstrating its strong capabilities in both image generation and editing. Furthermore, results on LongText-Bench, ChineseWord, and CVTG-2K show that it excels in text rendering—particularly in Chinese text generation—outperforming existing state-of-the-art models by a significant margin. This highlights Qwen-Image's unique position as a leading image generation model that combines broad general capability with exceptional text rendering precision.
我们推出了 Qwen-Image,这是 Qwen 系列中的一个图像生成基础模型,在复杂文本渲染和精确图像编辑方面取得了显著进展。为了应对复杂文本渲染的挑战,我们设计了一个全面的数据流程,包括大规模数据收集、过滤、标注、合成和平衡。此外,我们采用了一种渐进式训练策略,从非文本到文本渲染开始,从简单到复杂的文本输入逐步演进,并逐渐扩展到段落级描述。这种课程学习方法显著增强了模型的原生文本渲染能力。因此,Qwen-Image 不仅在英语等字母语言中表现异常出色,而且在更具挑战性的中文等表意语言上也取得了显著进步。为了增强图像编辑的一致性,我们引入了一种改进的多任务训练范式,不仅包括传统的文本到图像和文本-图像到图像任务,还包括图像到图像重建,有效地对齐了 Qwen2.5-VL 和 MMDiT 之间的潜在表示。此外,我们将原始图像分别输入到 Qwen2.5-VL 和 VAE 编码器中,分别获得语义表示和重建表示。这种双编码机制使编辑模块能够在保持语义一致性和维持视觉保真度之间取得平衡。我们在多个公共基准上对 Qwen-Image 进行了全面评估,包括用于通用图像生成的 GenEval、DPG 和 OneIG-Bench,以及用于图像编辑的 GEedit、ImgEdit 和 GSO。Qwen-Image 取得了最先进的性能,展示了其在图像生成和编辑方面的强大能力。此外,在 LongText-Bench、ChineseWord 和 CVTG-2K 上的结果表明,它在文本渲染方面表现出色,尤其是在中文文本生成方面,显著优于现有的最先进模型。这凸显了 Qwen-Image 作为领先图像生成模型的独特地位,它结合了广泛的通用能力和卓越的文本渲染精度。
Intern-S1: A Scientific Multimodal Foundation Model
Intern-S1:一个科学多模态基础模型
In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training. On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals.
近年来,涌现了大量开源基础模型,在一些广泛关注的领域取得了显著进展,性能已非常接近闭源模型。然而,在高价值但更具挑战性的科学专业领域,要么该领域仍依赖专家模型,要么通用基础模型的进展相较于热门领域明显滞后,远不足以变革科学研究,导致开源模型与闭源模型在这些科学领域之间存在巨大差距。为缩小这一差距并向通用人工智能迈进一步,我们推出了 Intern-S1,一个具备通用理解和推理能力、并专精于分析多种科学模态数据的专业化通才模型。Intern-S1 是一个多模态混合专家模型,具有 280 亿激活参数和 2410 亿总参数,在 5T 词元上进行了持续预训练,其中包括超过 2.5T 来自科学领域的词元。在后训练阶段,Intern-S1 在 InternBootCamp 中依次经历了离线强化学习和在线强化学习,我们在此提出了奖励混合来协同处理超过 1000 个任务的同时强化学习训练。通过算法、数据和训练系统的集成创新,Intern-S1 在在线强化学习训练中达到了顶级性能。在全面的评估基准上,Intern-S1 在通用推理任务中展现出与开源模型相当的性能,并在科学领域显著优于开源模型,在分子合成规划、反应条件预测、晶体热力学稳定性预测等专业任务上超越了最先进的闭源模型。
Is Chain-of-Thought Reasoning of LLMs a Mirage? A Data Distribution Lens
大语言模型的思维链推理是海市蜃楼吗?一个数据分布透镜视角
Chain-of-Thought (CoT) prompting has been shown to be effective in eliciting structured reasoning (i.e., CoT reasoning) from large language models (LLMs). Regardless of its popularity, recent studies expose its failures in some reasoning tasks, raising fundamental questions about the nature of CoT reasoning. In this work, we propose a data distribution lens to understand when and why CoT reasoning succeeds or fails. We hypothesize that CoT reasoning reflects a structured inductive bias learned from in-distribution data, enabling models to conditionally generate reasoning trajectories that approximate those observed during training. As such, the effectiveness of CoT reasoning is fundamentally governed by the nature and degree of distribution discrepancy between training data and test queries. Guided by this lens, we dissect CoT reasoning via three dimensions: task, length, and format. To test the hypothesis, we introduce DataAlchemy, an abstract and fully controllable environment that trains LLMs from scratch and systematically probes them under various distribution conditions. Through rigorous controlled experiments, we reveal that CoT reasoning is a brittle mirage when it is pushed beyond training distributions, emphasizing the ongoing challenge of achieving genuine and generalizable reasoning.
思维链提示已被证明能有效激发大语言模型的结构化推理。尽管它广受欢迎,但近期研究揭示了它在某些推理任务上的失败,引发了对思维链推理本质的根本性质疑。在本工作中,我们提出一个数据分布透镜来理解思维链推理何时以及为何成功或失败。我们假设思维链推理反映了从分布内数据中学到的结构化归纳偏置,使模型能够有条件地生成近似于训练中观察到的推理轨迹。因此,思维链推理的有效性根本上受制于训练数据与测试查询之间分布差异的性质和程度。在此透镜指导下,我们从任务、长度和格式三个维度剖析了思维链推理。为了验证这一假设,我们引入了 DataAlchemy,一个抽象且完全可控的环境,用于从头训练大语言模型并在各种分布条件下系统地探测它们。通过严格的受控实验,我们发现当思维链推理被推向训练分布之外时,它是一个脆弱的幻象,这强调了实现真正且可泛化推理的持续挑战。
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
清华大学
GLM-4.5:具备智能体能力、推理与编码的基础模型
We introduce GLM-4.5, a new generation of foundation models designed to excel in agentic capabilities, reasoning, and coding. Building upon the success of the GLM series, GLM-4.5 integrates advanced techniques in architecture, data curation, and training to achieve state-of-the-art performance across a wide range of tasks. The model demonstrates strong proficiency in complex reasoning, code generation, and interactive agent scenarios, making it a versatile tool for both research and real-world applications. GLM-4.5 is trained on a massive and diverse corpus, leveraging a mixture-of-experts architecture to balance efficiency and capacity. We evaluate GLM-4.5 on numerous benchmarks, showing significant improvements over previous versions and competitive results against leading models in the field. The release of GLM-4.5 aims to foster innovation in AI and provide the community with a powerful foundation for building agentic systems and beyond.
我们推出了GLM-4.5,新一代旨在智能体能力、推理和编码方面表现卓越的基础模型。基于GLM系列的成功,GLM-4.5整合了架构、数据整理和训练方面的高级技术,在广泛任务上实现了最先进的性能。该模型在复杂推理、代码生成和交互式智能体场景中展现出强大的能力,使其成为研究和实际应用的通用工具。GLM-4.5在海量多样化语料库上训练,利用混合专家架构来平衡效率与容量。我们在众多基准上评估了GLM-4.5,展示了相较于前代版本的显著改进,并与该领域领先模型取得了竞争性结果。GLM-4.5的发布旨在推动AI领域的创新,并为社区构建智能体系统等应用提供强大的基础。
AgentFly: Fine-tuning LLM Agents without Fine-tuning LLMs
AgentFly:无需微调大语言模型的智能体微调
In this paper, we introduce a novel learning paradigm for Adaptive Large Language Model (LLM) agents that eliminates the need for fine-tuning the underlying LLMs. Existing approaches are often either rigid, relying on static, handcrafted reflection workflows, or computationally intensive, requiring gradient updates of LLM model parameters. In contrast, our method enables low-cost continual adaptation via memory-based online reinforcement learning. We formalise this as a Memory-augmented Markov Decision Process (M-MDP), equipped with a neural case-selection policy to guide action decisions. Past experiences are stored in an episodic memory, either differentiable or non-parametric. The policy is continually updated based on environmental feedback through a memory rewriting mechanism, whereas policy improvement is achieved through efficient memory reading (retrieval). We instantiate our agent model in the deep research setting, namely Memento, which attains top-1 on GAIA validation (87.88% Pass@3) and 79.40% on the test set. It reaches 66.6% F1 and 80.4% PM on the DeepResearcher dataset, outperforming the state-of-the-art training-based method, while case-based memory adds 4.7% to 9.6% absolute points on out-of-distribution tasks. Our approach offers a scalable and efficient pathway for developing generalist LLM agents capable of continuous, real-time learning without gradient updates, advancing machine learning towards open-ended skill acquisition and deep research scenarios.
在本文中,我们介绍了一种用于自适应大语言模型智能体的新型学习范式,该范式消除了对底层大语言模型进行微调的需求。现有方法要么是僵化的,依赖于静态的、手工制作的反思工作流;要么是计算密集的,需要对大语言模型参数进行梯度更新。相比之下,我们的方法通过基于记忆的在线强化学习实现了低成本的持续适应。我们将其形式化为一个记忆增强的马尔可夫决策过程,配备了一个神经案例选择策略来指导行动决策。过去的经验存储在一个情景记忆(可微分或非参数)中。策略通过记忆重写机制根据环境反馈持续更新,而策略改进则通过高效的记忆读取来实现。我们在深度研究环境中实例化了我们的智能体模型,即 Memento,它在 GAIA 验证集上获得了第一名,并在测试集上获得了 79.40% 的成绩。它在 DeepResearcher 数据集上达到了 66.6% 的 F1 分数和 80.4% 的 PM 分数,优于最先进的基于训练的方法,而基于案例的记忆在分布外任务上增加了 4.7% 到 9.6% 的绝对点数。我们的方法为开发能够进行持续、实时学习而无需梯度更新的通才大语言模型智能体提供了一条可扩展且高效的路径,推动机器学习向开放式技能获取和深度研究场景发展。
VibeVoice Technical Report
Microsoft
VibeVoice 技术报告
This report presents VIBEVOICE, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion [SBW+24], which is a unified method for modeling continuous data by autoregressively generating latent vectors via diffusion. To enable this, we introduce a novel continuous speech tokenizer that, when compared to the popular Encocode model, improves data compression by 80 times while maintaining comparable performance. The tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. Thus, VIBEVOICE can synthesize long-form speech for up to 90 minutes (in a 64K context window length) with a maximum of 4 speakers, capturing the authentic conversational "vibe" and surpassing open-source and proprietary dialogue models.
本报告介绍了 VIBEVOICE,一种新颖的模型,旨在通过采用下一词元扩散来合成具有多个说话者的长篇语音。这是一种统一的方法,通过扩散自回归地生成潜在向量来建模连续数据。为实现这一点,我们引入了一种新颖的连续语音分词器,与流行的 Encocode 模型相比,它在保持相当性能的同时,将数据压缩提升了 80 倍。该分词器有效地保留了音频保真度,同时显著提高了处理长序列的计算效率。因此,VIBEVOICE 可以合成长达 90 分钟的长篇语音,并支持最多 4 个说话者,捕捉到真实的对话“氛围”,超越了开源和专有对话模型。
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/
阿里
WebWatcher:开创视觉-语言深度研究智能体新前沿
Web agents such as Deep Research have demonstrated superhuman cognitive abilities, capable of solving highly challenging information-seeking problems. However, most research remains primarily text-centric, overlooking visual information in the real world. This makes multimodal Deep Research highly challenging, as such agents require much stronger reasoning abilities in perception, logic, knowledge, and the use of more sophisticated tools compared to text-based agents. To address this limitation, we introduce WebWatcher, a multimodal Agent for Deep Research equipped with enhanced visual-language reasoning capabilities. It leverages high-quality synthetic multimodal trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization through reinforcement learning. To better evaluate the capabilities of multimodal agents, we propose BrowseComp-VL, a benchmark with BrowseComp-style that requires complex information retrieval involving both visual and textual information. Experimental results show that WebWatcher significantly outperforms proprietary baseline, RAG workflow and open-source agents in four challenging VQA benchmarks, which paves the way for solving complex multimodal information-seeking tasks.
像 Deep Research 这样的网络智能体已经展现出超人类的认知能力,能够解决极具挑战性的信息搜寻问题。然而,大多数研究仍以文本为中心,忽略了现实世界中的视觉信息。这使得多模态深度研究极具挑战性,因为与基于文本的智能体相比,此类智能体需要在感知、逻辑、知识方面具备更强的推理能力,并需要使用更复杂的工具。为了解决这一局限性,我们引入了 WebWatcher,一个配备增强型视觉-语言推理能力的多模态深度研究智能体。它利用高质量的合成多模态轨迹进行高效的冷启动训练,使用各种工具进行深度推理,并通过强化学习进一步增强泛化能力。为了更好地评估多模态智能体的能力,我们提出了 BrowseComp-VL,一个 BrowseComp 风格的基准,需要涉及视觉和文本信息的复杂信息检索。实验结果表明,WebWatcher 在四个具有挑战性的 VQA 基准上显著优于专有基线、RAG 工作流和开源智能体,这为解决复杂的多模态信息搜寻任务铺平了道路。
Agent Lightning: Train ANY AI Agents with Reinforcement Learning
https://www.microsoft.com/en-us/research/project/agent-lightning/
Agent Lightning:使用强化学习训练任何 AI 智能体
We present Agent Lightning, a flexible and extensible framework that enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent. Unlike existing methods that tightly couple RL training with agent or rely on sequence concatenation with masking, Agent Lightning achieves complete decoupling between agent execution and training, allowing seamless integration with existing agents developed via diverse ways (e.g., using frameworks like LangChain, OpenAI Agents SDK, AutoGen, and building from scratch) with almost ZERO code modifications. By formulating agent execution as Markov decision process, we define an unified data interface and propose a hierarchical RL algorithm, LightningRL, which contains a credit assignment module, allowing us to decompose trajectories generated by ANY agents into training transition. This enables RL to handle complex interaction logic, such as multi-agent scenarios and dynamic workflows. For the system design, we introduce a Training-Agent Disaggregation architecture, and brings agent observability frameworks into agent runtime, providing a standardized agent finetuning interface. Experiments across text-to-SQL, retrieval-augmented generation, and math tool-use tasks demonstrate stable, continuous improvements, showcasing the framework's potential for real-world agent training and deployment.
我们提出了 Agent Lightning,一个灵活且可扩展的框架,能够为任何 AI 智能体实现基于强化学习的大语言模型训练。与现有将强化学习训练与智能体紧密耦合或依赖于带掩码的序列拼接的方法不同,Agent Lightning 实现了智能体执行与训练的完全解耦,允许通过多种方式开发的现有智能体几乎无需修改代码即可无缝集成。通过将智能体执行形式化为马尔可夫决策过程,我们定义了一个统一的数据接口,并提出了一种分层强化学习算法 LightningRL,该算法包含一个信用分配模块,使我们能够将任何智能体生成的轨迹分解为训练转移。这使得强化学习能够处理复杂的交互逻辑,例如多智能体场景和动态工作流。在系统设计方面,我们引入了一种训练-智能体分离架构,并将智能体可观测性框架引入智能体运行时,提供了一个标准化的智能体微调接口。在文本到 SQL、检索增强生成和数学工具使用任务上的实验证明了稳定、持续的改进,展示了该框架在实际智能体训练和部署中的潜力。
Mobile-Agent-v3: Foundamental Agents for GUI Automation
阿里
Mobile-Agent-v3:用于 GUI 自动化的基础智能体
This paper introduces GUI-Owl, a foundational GUI agent model that achieves new state-of-the-art performance among open-source end-to-end models across ten GUI benchmarks spanning both desktop and mobile environments, covering grounding, question answering, planning, decision-making, and general procedural knowledge in GUI automation scenarios. Notably, GUI-Owl-7B achieves a score of 66.4 on the AndroidWorld benchmark and 29.4 on the OSWorld-Verified benchmark. Building on this model, we propose a general-purpose GUI agent framework, Mobile-Agent-v3, which further enhances GUI-Owl's performance (73.3 on AndroidWorld and 37.7 on OSWorld-Verified), achieving a new state-of-the-art among GUI agent frameworks based on open-source models. GUI-Owl incorporates several key innovations: 1) Large-scale Environment Infrastructure: We introduce a cloud-based virtual environment infrastructure spanning different operating systems (including Android, Ubuntu, macOS, and Windows). This underpins our Self-Evolving GUI Trajectory Production framework, which generates high-quality interaction data through sophisticated query generation and correctness judgment. The framework leverages GUI-Owl's capabilities to continuously refine trajectories, creating a self-reinforcing improvement cycle. It supports multiple downstream data pipelines, enabling robust data collection while reducing manual annotation needs. 2) Diverse Foundational Agents Capability Construction: by incorporating foundational UI data—such as grounding, planning, and action semantic recognition—alongside diverse reasoning and reflecting patterns, GUI-Owl not only supports end-to-end decision making but can also serve as a specialized module integrated into multi-agent frameworks; 3) Scalable Environment RL: we also develop a scalable reinforcement learning framework that enables fully asynchronous training and better aligns the model's decision with real-world usage. In addition, we introduce Trajectory-aware Relative Policy Optimization (TRPO) for online environment RL, which achieves 34.9 on the OSWorld-Verified benchmark.
本文介绍了 GUI-Owl,一个基础性的 GUI 智能体模型,它在涵盖桌面和移动环境的十个 GUI 基准测试中,在开源端到端模型中取得了新的最先进性能,这些基准涉及 GUI 自动化场景中的基础、问答、规划、决策和通用程序性知识。值得注意的是,GUI-Owl-7B 在 AndroidWorld 基准上取得了 66.4 分,在 OSWorld-Verified 基准上取得了 29.4 分。在此模型基础上,我们提出了一个通用 GUI 智能体框架 Mobile-Agent-v3,它进一步增强了 GUI-Owl 的性能,实现了基于开源模型的 GUI 智能体框架中的新最先进水平。GUI-Owl 融合了几项关键创新:1) 大规模环境基础设施:我们引入了一个跨越不同操作系统的基于云的虚拟环境基础设施。这支撑了我们的自演进 GUI 轨迹生成框架,该框架通过复杂的查询生成和正确性判断来生成高质量的交互数据。该框架利用 GUI-Owl 的能力持续优化轨迹,形成一个自我强化的改进循环。它支持多个下游数据流程,能够在减少人工标注需求的同时实现稳健的数据收集。2) 多样化基础智能体能力构建:通过整合基础的 UI 数据以及多样化的推理和反思模式,GUI-Owl 不仅支持端到端决策,还可以作为专用模块集成到多智能体框架中。3) 可扩展的环境强化学习:我们还开发了一个可扩展的强化学习框架,支持完全异步的训练,并使模型的决策更符合实际使用情况。此外,我们引入了用于在线环境强化学习的轨迹感知相对策略优化,它在 OSWorld-Verified 基准上达到了 34.9 分。
AgentScope 1.0: A Developer-Centric Framework for Building Agentic Applications
阿里
AgentScope 1.0:一个以开发者为中心的智能体应用构建框架
Driven by rapid advancements of Large Language Models (LLMs), agents are empowered to combine intrinsic knowledge with dynamic tool use, greatly enhancing their capacity to address real-world tasks. In line with such an evolution, AgentScope introduces major improvements in a new version (1.0), towards comprehensively supporting flexible and efficient tool-based agent-environment interactions for building agent applications. Specifically, we abstract foundational components essential for agentic applications and provide unified interfaces and extensible modules, enabling developers to easily leverage the latest progress, such as new models and MCPs. Furthermore, we ground agent behaviors in the ReAct paradigm and offer advanced agent-level infrastructure based on a systematic asynchronous design, which enriches both human-agent and agent-agent interaction patterns while improving execution efficiency. Building on this foundation, we integrate several built-in agents tailored to specific practical scenarios. AgentScope also includes robust engineering support for developer-friendly experiences. We provide a scalable evaluation module with a visual studio interface, making the development of long-trajectory agentic applications more manageable and easier to trace. In addition, AgentScope offers a runtime sandbox to ensure safe agent execution and facilitates rapid deployment in production environments. With these enhancements, AgentScope provides a practical foundation for building scalable, adaptive, and effective agentic applications.
受大语言模型快速进步的驱动,智能体得以将内在知识与动态工具使用相结合,极大地增强了它们处理现实世界任务的能力。顺应这一演进趋势,AgentScope 在新版本中引入了重大改进,旨在全面支持构建智能体应用时灵活且高效的基于工具的环境交互。具体而言,我们对智能体应用所需的基础组件进行了抽象,并提供了统一的接口和可扩展的模块,使开发者能够轻松利用最新的进展。此外,我们将智能体行为根植于 ReAct 范式,并基于系统的异步设计提供了先进的智能体级基础设施,这丰富了人-智能体和智能体-智能体的交互模式,同时提高了执行效率。在此基础之上,我们集成了多个针对特定实际场景量身定制的内置智能体。AgentScope 还包含强大的工程支持,以提供开发者友好的体验。我们提供了一个带有可视化界面的可扩展评估模块,使得长轨迹智能体应用的开发更易于管理和追踪。此外,AgentScope 提供了一个运行时沙箱,以确保智能体的安全执行,并促进在生产环境中的快速部署。借助这些增强功能,AgentScope 为构建可扩展、自适应且高效的智能体应用提供了一个实用基础。