Qwen2 Technical Report
阿里
This report introduces the Qwen2 series, the latest addition to our large language models and large multimodal models. We release a comprehensive suite of foundational and instruction-tuned language models, encompassing a parameter range from 0.5 to 72 billion, featuring dense models and a Mixture-of-Experts model. Qwen2 surpasses most prior open-weight models, including its predecessor Qwen1.5, and exhibits competitive performance relative to proprietary models across diverse benchmarks on language understanding, generation, multilingual proficiency, coding, mathematics, and reasoning.
本报告介绍了 Qwen2 系列,这是我们大语言模型和多模态大模型系列的最新成员。我们发布了一套全面的基础语言模型和指令调优语言模型,参数范围从 0.5B 到 72B,包括密集模型和混合专家模型。Qwen2 超越了大多数先前开放权重的模型,包括其前身 Qwen1.5,并在语言理解、生成、多语言能力、编程、数学和推理等各种基准上展现出与专有模型相竞争的性能。
The flagship model, Qwen2-72B, showcases remarkable performance: 84.2 on MMLU, 37.9 on GPQA, 64.6 on HumanEval, 89.5 on GSM8K, and 82.4 on BBH as a base language model. The instruction-tuned variant, Qwen2-72B-Instruct, attains 9.1 on MT-Bench, 48.1 on Arena-Hard, and 35.7 on LiveCodeBench. Moreover, Qwen2 demonstrates robust multilingual capabilities, proficient in approximately 30 languages, spanning English, Chinese, Spanish, French, German, Arabic, Russian, Korean, Japanese, Thai, Vietnamese, and more, underscoring its versatility and global reach.
旗舰模型 Qwen2-72B 作为基础语言模型,展现了卓越的性能:在 MMLU 上达到 84.2,在 GPQA 上达到 37.9,在 HumanEval 上达到 64.6,在 GSM8K 上达到 89.5,在 BBH 上达到 82.4。其指令调优版本 Qwen2-72B-Instruct 在 MT-Bench 上获得 9.1 分,在 Arena-Hard 上获得 48.1 分,在 LiveCodeBench 上获得 35.7 分。此外,Qwen2 展现了强大的多语言能力,精通约 30 种语言,包括英语、中文、西班牙语、法语、德语、阿拉伯语、俄语、韩语、日语、泰语、越南语等,突显了其多功能性和全球影响力。
To foster community innovation and accessibility, we have made the Qwen2 model weights openly available on Hugging Face and ModelScope, and the supplementary materials including example code on GitHub. These platforms also include resources for quantization, fine-tuning, and deployment, facilitating a wide range of applications and research endeavors.
为了促进社区创新和可访问性,我们已在 Hugging Face 和 ModelScope 上公开提供了 Qwen2 模型权重,并在 GitHub 上提供了包括示例代码在内的补充材料。这些平台还包含用于量化、微调和部署的资源,以支持广泛的应用和研究工作。
OpenDevin: An Open Platform for AI Software Developers as Generalist Agents
Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. In this paper, we introduce OpenHands (f.k.a. OpenDevin), a platform for the development of powerful and flexible AI agents that interact with the world in similar ways to those of a human developer: by writing code, interacting with a command line, and browsing the web. We describe how the platform allows for the implementation of new agents, safe interaction with sandboxed environments for code execution, coordination between multiple agents, and incorporation of evaluation benchmarks. Based on our currently incorporated benchmarks, we perform an evaluation of agents over 15 challenging tasks, including software engineering (e.g., Swe-Bench) and web browsing (e.g., WebArena), among others. Released under the permissive MIT license, OpenHands is a community project spanning academia and industry with more than 2.1K contributions from over 188 contributors.
软件是我们人类可用的最强大工具之一;它使熟练的程序员能够以复杂而深刻的方式与世界互动。与此同时,得益于大语言模型的进步,能够与周围环境互动并影响其变化的AI智能体也迅速发展。在本文中,我们介绍了 OpenHands,一个用于开发强大而灵活的AI智能体的平台,这些智能体能够以与人类开发者类似的方式与世界互动:通过编写代码、与命令行交互以及浏览网页。我们描述了该平台如何支持新智能体的实现、与用于代码执行的沙箱环境的安全交互、多个智能体之间的协调以及评估基准的整合。基于我们当前整合的基准,我们对智能体在15项具有挑战性的任务上进行了评估,包括软件工程和网页浏览等。OpenHands 根据宽松的 MIT 许可证发布,是一个连接学术界和工业界的社区项目,已有来自 188 多位贡献者的超过 2100 份贡献。
NeedleBench: Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?
The capability of large language models to handle long-context information plays a crucial role across various real-world applications. Existing methods for evaluating long-context abilities often rely either on real-world long texts, making it difficult to exclude the influence of models' inherent knowledge, or introduce large amounts of irrelevant filler content to artificially reach target lengths, reducing the relevance and effectiveness of assessments. To address these limitations, we introduce NeedleBench, a comprehensive synthetic framework designed to assess retrieval and reasoning performance in bilingual long-context tasks with adaptive context lengths (e.g., 32k, 128k, and beyond). NeedleBench systematically embeds key data points at varying depths to rigorously test models' capabilities in diverse settings. Tasks within NeedleBench are categorized into two distinct scenarios: information-sparse, characterized by minimal relevant details embedded within extensive irrelevant text to simulate simpler real-world retrieval tasks; and information-dense, implemented as the Ancestral Trace Challenge, where relevant information is continuously distributed throughout the context to simulate more complex real-world reasoning tasks. Our experiments show that, while recent reasoning models such as Deepseek-R1 and OpenAI's o3 have demonstrated strong performance on mathematical reasoning benchmarks, they still struggle to generalize their reasoning abilities and perform poorly on our information-dense tasks, frequently encountering difficulties with continuous retrieval and reasoning even at relatively shorter context lengths. Furthermore, we identify and characterize a phenomenon termed 'under-thinking', wherein models prematurely conclude their reasoning processes despite the availability of relevant information. NeedleBench thus provides critical insights and targeted evaluation tools essential for understanding and improving the long-context capabilities of LLMs. All codes and resources are publicly available at OpenCompass.
大语言模型处理长上下文信息的能力在各类现实世界应用中扮演着至关重要的角色。现有的评估长上下文能力的方法,要么依赖于真实世界的长文本,这使得排除模型固有知识的影响变得困难;要么引入大量无关的填充内容来人为达到目标长度,从而降低了评估的相关性和有效性。为应对这些局限性,我们引入了 NeedleBench,一个全面的合成框架,旨在评估具有自适应上下文长度的双语长上下文任务中的检索与推理性能。NeedleBench 通过在不同深度系统地嵌入关键数据点,来严格测试模型在各种设置下的能力。NeedleBench 中的任务分为两种不同的场景:信息稀疏场景,其特征是在大量无关文本中嵌入极少的相关细节,以模拟较简单的现实世界检索任务;以及信息密集场景,即祖先追溯挑战,相关信息在上下文中连续分布,以模拟更复杂的现实世界推理任务。我们的实验表明,尽管近期如 Deepseek-R1 和 OpenAI 的 o3 等推理模型在数学推理基准上表现强劲,但它们仍难以泛化其推理能力,在我们的信息密集任务上表现不佳,即使在相对较短的上下文长度下也频繁遇到连续检索和推理的困难。此外,我们识别并描述了一种被称为"思考不足"的现象,即模型在有相关信息可用的情况下,过早地结束其推理过程。因此,NeedleBench 为理解和提升大语言模型的长上下文能力提供了关键的见解和针对性的评估工具。所有代码和资源已在 OpenCompass 公开发布。
MindSearch: Mimicking Human Minds Elicits Deep AI Searcher
MindSearch:模仿人类思维以激发深度AI搜索者
Information seeking and integration is a complex cognitive task that consumes enormous time and effort. Search engines reshape the way of seeking information but often fail to align with complex human intentions. Inspired by the remarkable progress of Large Language Models (LLMs), recent works attempt to solve the information-seeking and integration task by combining LLMs and search engines. However, these methods still obtain unsatisfying performance due to three challenges: (1) complex requests often cannot be accurately and completely retrieved by the search engine once; (2) corresponding information to be integrated is spread over multiple web pages along with massive noise; and (3) a large number of web pages with long contents may quickly exceed the maximum context length of LLMs. Inspired by the cognitive process when humans solve these problems, we introduce MindSearch (思素) to mimic the human minds in web information seeking and integration, which can be instantiated by a simple yet effective LLM-based multi-agent framework consisting of WebPlanner and WebSearcher. The WebPlanner models the human mind of multi-step information seeking as a dynamic graph construction process: it decomposes the user query into atomic sub-questions as nodes in the graph and progressively extends the graph based on the search result from WebSearcher. Tasked with each sub-question, WebSearcher performs hierarchical information retrieval with search engines and collects valuable information for WebPlanner. The multi-agent design of MindSearch enables the whole framework to seek and integrate information parallelly from larger-scale (e.g., more than 300) web pages in 3 minutes, which is worth 3 hours of human effort. Based on either GPT-4o or InternLM2.5-7B models, MindSearch demonstrates significant improvement in the response quality in terms of depth and breadth, on both closed-set and open-set QA problems. Besides, responses from MindSearch based on InternLM2.5-7B are preferable by humans to ChatGPT-Web (by GPT-4o) and Perplexity.ai applications, which implies that MindSearch with open-source models can already deliver a competitive solution to the proprietary AI search engine.
信息搜寻与整合是一项复杂的认知任务,耗费大量的时间和精力。搜索引擎重塑了信息搜寻的方式,但常常无法与复杂的人类意图对齐。受大语言模型显著进展的启发,近期的工作尝试通过结合大语言模型和搜索引擎来解决信息搜寻与整合任务。然而,由于三个挑战,这些方法仍然表现不佳:复杂的请求通常无法通过搜索引擎一次性准确完整地检索到;待整合的对应信息分散在多个网页中,并伴随大量噪声;以及大量长内容的网页可能很快超过大语言模型的最大上下文长度。受人类解决这些问题时的认知过程启发,我们引入了 MindSearch 来模仿人类在网页信息搜寻与整合时的思维过程,这可以通过一个简单而有效的、基于大语言模型的多智能体框架来实现,该框架由 WebPlanner 和 WebSearcher 组成。WebPlanner 将人类多步信息搜寻的思维建模为一个动态图构建过程:它将用户查询分解为原子性子问题作为图中的节点,并根据 WebSearcher 的搜索结果逐步扩展该图。WebSearcher 负责每个子问题,它利用搜索引擎执行分层信息检索,并为 WebPlanner 收集有价值的信息。MindSearch 的多智能体设计使得整个框架能够在 3 分钟内并行地从更大规模(例如超过 300 个)的网页中搜寻和整合信息,这相当于人类 3 小时的工作量。基于 GPT-4o 或 InternLM2.5-7B 模型,MindSearch 在封闭集和开放集问答问题上,都在响应质量的深度和广度上展现出显著的提升。此外,人类更偏好基于 InternLM2.5-7B 的 MindSearch 生成的响应,胜过 ChatGPT-Web 和 Perplexity.ai 应用,这意味着采用开源模型的 MindSearch 已经能够提供与专有AI搜索引擎竞争的解决方案。
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
FunAudioLLM:用于人与大语言模型自然交互的语音理解与生成基础模型
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports high-precision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology.
本报告介绍了 FunAudioLLM,一个旨在增强人与大语言模型之间自然语音交互的模型系列。其核心是两个创新模型:SenseVoice,负责多语言语音识别、情感识别和音频事件检测;以及 CosyVoice,支持自然语音生成,并可控制多语言、音色、说话风格和说话人身份。SenseVoice-Small 为 5 种语言提供极低延迟的 ASR,SenseVoice-Large 支持超过 50 种语言的高精度 ASR,而 CosyVoice 在多语言语音生成、零样本上下文学习、跨语言声音克隆和指令遵循能力方面表现出色。与 SenseVoice 和 CosyVoice 相关的模型已在 Modelscope 和 Huggingface 上开源,相应的训练、推理和微调代码也已发布在 GitHub 上。通过将这些模型与 LLM 集成,FunAudioLLM 能够实现语音到语音翻译、情感语音聊天、互动播客和富有表现力的有声读物叙述等应用,从而推动语音交互技术的发展。
Very Large-Scale Multi-Agent Simulation in AgentScope
阿里
AgentScope中的超大规模多智能体模拟
Recent advances in large language models (LLMs) have opened new avenues for applying multi-agent systems in very large-scale simulations. However, there remain several challenges when conducting multi-agent simulations with existing platforms, such as limited scalability and low efficiency, unsatisfied agent diversity, and effort-intensive management processes. To address these challenges, we develop several new features and components for AgentScope, a user-friendly multi-agent platform, enhancing its convenience and flexibility for supporting very large-scale multi-agent simulations. Specifically, we propose an actor-based distributed mechanism as the underlying technological infrastructure towards great scalability and high efficiency, and provide flexible environment support for simulating various real-world scenarios, which enables parallel execution of multiple agents, automatic workflow conversion for distributed deployment, and both inter-agent and agent-environment interactions. Moreover, we integrate an easy-to-use configurable tool and an automatic background generation pipeline in AgentScope, simplifying the process of creating agents with diverse yet detailed background settings. Last but not least, we provide a web-based interface for conveniently monitoring and managing a large number of agents that might deploy across multiple devices. We conduct a comprehensive simulation to demonstrate the effectiveness of these proposed enhancements in AgentScope, and provide detailed observations and insightful discussions to highlight the great potential of applying multi-agent systems in large-scale simulations. The source code is released on GitHub to inspire further research and development in large-scale multi-agent simulations.
大语言模型的近期进展为多智能体系统在超大规模模拟中的应用开辟了新途径。然而,使用现有平台进行多智能体模拟仍面临若干挑战,例如可扩展性有限和效率低下、智能体多样性不足以及管理过程费力。为应对这些挑战,我们为 AgentScope 开发了若干新特性和组件,这是一个用户友好的多智能体平台,增强了其在支持超大规模多智能体模拟方面的便捷性和灵活性。具体而言,我们提出了一种基于角色的分布式机制作为底层技术基础设施,以实现高可扩展性和高效率,并提供灵活的环境支持来模拟各种现实场景,这使得多智能体并行执行、分布式部署的自动工作流转换以及智能体间及智能体-环境交互成为可能。此外,我们在 AgentScope 中集成了一套易于使用的可配置工具和一个自动化背景生成流程,简化了创建具有多样且详细背景设置的智能体的过程。最后同样重要的是,我们提供了一个基于 Web 的界面,用于方便地监控和管理可能部署在多个设备上的大量智能体。我们进行了一次全面的模拟,以证明这些在 AgentScope 中提出的增强功能的有效性,并提供了详细的观察结果和富有洞见的讨论,以突显多智能体系统在大规模模拟中应用的巨大潜力。源代码已在 GitHub 上发布,以激发在大规模多智能体模拟领域的进一步研究和开发。