Qwen3 Technical Report
阿里
Qwen3 技术报告
In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of large language models (LLMs) designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models—such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)—and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.
在本工作中,我们推出了 Qwen3,这是 Qwen 模型系列的最新版本。Qwen3 包含一系列大语言模型,旨在提升性能、效率和多语言能力。Qwen3 系列涵盖了密集型和混合专家两种架构的模型,参数量从 6 亿到 2350 亿不等。Qwen3 的一个关键创新是将思考模式(用于复杂、多步推理)和非思考模式(用于快速、上下文驱动的响应)整合到一个统一的框架中。这消除了在不同模型之间切换的需要,并支持基于用户查询或聊天模板的动态模式切换。同时,Qwen3 引入了一种思考预算机制,允许用户在推理过程中自适应地分配计算资源,从而根据任务复杂性平衡延迟和性能。此外,通过利用旗舰模型的知识,我们显著减少了构建较小规模模型所需的计算资源,同时确保它们具有高度竞争力的性能。实证评估表明,Qwen3 在包括代码生成、数学推理、智能体任务等多样化基准上取得了最先进的成果,与更大的 MoE 模型和专有模型相比具有竞争力。与其前身 Qwen2.5 相比,Qwen3 将多语言支持从 29 种扩展到 119 种语言和方言,通过改进的跨语言理解和生成能力,增强了全球可访问性。为了促进可复现性和社区驱动的研发,所有 Qwen3 模型均在 Apache 2.0 许可下公开提供。
Mutarjim: Advancing Bidirectional Arabic-English Translation with a Small Language Model
Mutarjim:基于小型语言模型推进阿拉伯语-英语双向翻译
We introduce Mutarjim, a compact yet powerful language model for bidirectional Arabic-English translation. While large-scale LLMs have shown impressive progress in natural language processing tasks, including machine translation, smaller models. Leveraging this insight, we developed Mutarjim based on Kuwait-1.5B Hennara et al. [2025], a language model tailored for both Arabic and English. Despite its modest size, Mutarjim outperforms much larger models on several established benchmarks, achieved through an optimized two-phase training approach and a carefully curated, high-quality training corpus. Experimental results show that Mutarjim rivals models up to 20 times larger while significantly reducing computational costs and training requirements. We also introduce Tarjama-25, a new benchmark designed to overcome limitations in existing Arabic-English benchmarking datasets, such as domain narrowness, short sentence lengths, and English-source bias. Tarjama-25 comprises 5,000 expert-reviewed sentence pairs and spans a wide range of domains, offering a more comprehensive and balanced evaluation framework. Notably, Mutarjim achieves state-of-the-art performance on the English-to-Arabic task in Tarjama-25, surpassing even significantly larger and proprietary models like GPT-4o mini. We publicly release Tarjama-25 to support future research and advance the evaluation of Arabic-English translation systems.
我们推出了 Mutarjim,一个用于阿拉伯语-英语双向翻译的紧凑而强大的语言模型。尽管大规模大语言模型在包括机器翻译在内的自然语言处理任务中取得了令人瞩目的进展,但较小的模型仍有其独特价值。基于这一见解,我们基于 Kuwait-1.5B 开发了 Mutarjim,这是一个专为阿拉伯语和英语设计的语言模型。尽管规模适中,Mutarjim 通过优化的两阶段训练方法和精心策划的高质量训练语料库,在几个成熟的基准上优于规模大得多的模型。实验结果表明,Mutarjim 的性能可与高达 20 倍大的模型相媲美,同时显著降低了计算成本和训练要求。我们还引入了 Tarjama-25,一个新的基准,旨在克服现有阿拉伯语-英语基准数据集中的局限性,如领域狭窄、句子长度短和英语源偏差。Tarjama-25 包含 5,000 个经过专家评审的句对,覆盖广泛领域,提供了一个更全面、更均衡的评估框架。值得注意的是,Mutarjim 在 Tarjama-25 的英译阿任务中取得了最先进的性能,甚至超越了像 GPT-4o mini 这样规模大得多的专有模型。我们公开发布 Tarjama-25,以支持未来研究并推动阿拉伯语-英语翻译系统的评估发展。
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
绝对零:零数据的强化自博弈推理
Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from rule-based outcome rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate self-proposed code reasoning tasks and verify answers, serving as an unified source of verifiable feedback to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing “zero” models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.
基于可验证奖励的强化学习通过直接从基于规则的结果奖励中学习,在增强大语言模型的推理能力方面展现出前景。近期在零样本设定下运作的RLVR工作避免了在标注推理过程中的监督,但仍依赖于手动整理的问题和答案集合进行训练。高质量、人工生成示例的稀缺引发了人们对依赖人工监督的长期可扩展性的担忧,这一挑战在语言模型预训练领域已显而易见。此外,在一个假设性的未来,当AI超越人类智能时,由人类提供的任务对于超级智能系统可能提供的学习潜力有限。为了解决这些问题,我们提出了一种新的RLVR范式,称为绝对零,其中单个模型学习提出能够最大化自身学习进度的任务,并通过解决这些任务来提高推理能力,而不依赖任何外部数据。在此范式下,我们引入了绝对零推理器,一个通过使用代码执行器来验证自提议的代码推理任务和答案的系统,从而实现其训练课程和推理能力的自我进化,作为一个统一的、可验证的反馈源,以指导开放但扎根的学习。尽管完全在无外部数据的情况下训练,AZR在编程和数学推理任务上取得了整体SOTA性能,超越了依赖数万个领域内人工精选示例的现有“零”模型。此外,我们证明AZR可以有效地应用于不同的模型规模,并与各种模型类别兼容。
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers
Paper2Poster:面向科学论文的多模态海报自动化生成
Academic poster generation is a crucial yet challenging task in scientific communication, requiring the compression of long-context interleaved documents into a single, visually coherent page. To address this challenge, we introduce the first benchmark and metric suite for poster generation, which pairs recent conference papers with author-designed posters and evaluates outputs on (i) Visual Quality—semantic alignment with human posters, (ii) Textual Coherence—language fluency, (iii) Holistic Assessment—six fine-grained aesthetic and informational criteria scored by a VLM-as-judge, and notably (iv) PaperQuiz—the poster's ability to convey core paper content as measured by VLMs answering generated quizzes. Building on this benchmark, we propose PosterAgent, a top-down, visual-in-the-loop multi-agent pipeline: the (a) Parser distills the paper into a structured asset library; the (b) Planner aligns text–visual pairs into a binary-tree layout that preserves reading order and spatial balance; and the (c) Painter–Commenter loop refines each panel by executing rendering code and using VLM feedback to eliminate overflow and ensure alignment. In our comprehensive evaluation, we find that GPT-4o outputs—though visually appealing at first glance—often exhibit noisy text and poor PaperQuiz scores, and we find that reader engagement is the primary aesthetic bottleneck, as human-designed posters rely largely on visual semantics to convey meaning. Our fully open-source variants (e.g., based on the Qwen-2.5 series) outperform existing 4o-driven multi-agent systems across nearly all metrics, while using 87% fewer tokens. It transforms a 22-page paper into a finalized yet editable ‘.pptx’ poster — all for just $0.005. These findings chart clear directions for the next generation of fully automated poster-generation models.
学术海报生成是科学传播中一项至关重要但又极具挑战性的任务,它需要将长篇、交错排版的文档压缩到一个视觉上连贯的单页中。为应对这一挑战,我们推出了首个针对海报生成的基准和评估指标套件,该套件将近期会议论文与作者设计的海报配对,并基于以下维度评估输出结果:(i) 视觉质量——与人工设计海报的语义对齐度;(ii) 文本连贯性——语言流畅度;(iii) 整体评估——由视觉语言模型作为评判者评分的六个细粒度美学和信息标准;以及尤为重要的 (iv) PaperQuiz——通过视觉语言模型回答生成的测验来衡量海报传达论文核心内容的能力。在此基准之上,我们提出了 PosterAgent,一个自上而下、视觉参与的闭环多智能体流程:解析器将论文提炼成一个结构化的资产库;规划器将文本-视觉对对齐成能够保持阅读顺序和空间平衡的二叉树布局;绘制器-评论器循环通过执行渲染代码并利用视觉语言模型的反馈来优化每个板块,消除溢出并确保对齐。在我们全面的评估中,我们发现 GPT-4o 的输出虽然乍看之下视觉吸引力强,但经常出现文本嘈杂和 PaperQuiz 得分低的问题,并且我们发现读者参与度是主要的美学瓶颈,因为人类设计的海报主要依赖视觉语义来传达信息。我们完全开源的变体显著优于现有的由 GPT-4o 驱动的多智能体系统,同时使用的词元数量减少了 87%。它可以将一篇 22 页的论文转化为最终定稿且可编辑的 pptx 格式海报——全部成本仅为 0.005 美元。这些发现为下一代全自动海报生成模型指明了清晰的方向。