InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3:探索开源多模态模型的先进训练与测试时方案
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
我们推出了 InternVL3,这是 InternVL 系列的一次重大进步,其特点是采用了原生多模态预训练范式。与将纯文本大语言模型适配成支持视觉输入的多模态大语言模型不同,InternVL3 在单一预训练阶段中,同时从多样化的多模态数据和纯文本语料库中获取多模态和语言能力。这种统一的训练范式有效解决了传统事后训练流程中常见的复杂性和对齐挑战。为了进一步提高性能和可扩展性,InternVL3 集成了可变视觉位置编码以支持扩展的多模态上下文,采用了先进的训练后技术,并采用了测试时扩展策略以及优化的训练基础设施。广泛的实证评估表明,InternVL3 在广泛的多模态任务中提供了卓越的性能。特别是,InternVL3-78B 在 MMMU 基准上取得了 72.2 分,在开源多模态大语言模型中树立了新的最先进水平。其能力与领先的专有模型相比仍保持高度竞争力,同时保持了强大的纯语言能力。秉承开放科学原则,我们将公开发布训练数据和模型权重,以促进下一代多模态大语言模型的进一步研究和开发。
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
https://github.com/FoundationAgents/awesome-foundation-agents
基础智能体的进展与挑战:从脑启发智能到进化、协作与安全系统
The advent of large language models (LLMs) has catalyzed a transformative shift in artificial intelligence, paving the way for advanced intelligent agents capable of sophisticated reasoning, robust perception, and versatile action across diverse domains. As these agents increasingly drive AI research and practical applications, their design, evaluation, and continuous improvement present intricate, multifaceted challenges. This book provides a comprehensive overview, framing intelligent agents within modular, brain-inspired architectures that integrate principles from cognitive science, neuroscience, and computational research. We structure our exploration into four interconnected parts. First, we systematically investigate the modular foundation of intelligent agents, systematically mapping their cognitive, perceptual, and operational modules onto analogous human brain functionalities and elucidating core components such as memory, world modeling, reward processing, goal, and emotion. Second, we discuss self-enhancement and adaptive evolution mechanisms, exploring how agents autonomously refine their capabilities, adapt to dynamic environments, and achieve continual learning through automated optimization paradigms. Third, we examine collaborative and evolutionary multi-agent systems, investigating the collective intelligence emerging from agent interactions, cooperation, and societal structures, highlighting parallels to human social dynamics. Finally, we address the critical imperative of building safe and beneficial AI systems, emphasizing intrinsic and extrinsic security threats, ethical alignment, robustness, and practical mitigation strategies necessary for trustworthy real-world deployment. By synthesizing modular AI architectures with insights from different disciplines, this survey identifies key research gaps, challenges, and opportunities, encouraging innovations that harmonize technological advancement with meaningful societal benefit.
大语言模型的出现催化了人工智能的变革性转变,为能够跨领域进行复杂推理、稳健感知和灵活行动的先进智能体铺平了道路。随着这些智能体日益推动AI研究和实际应用,其设计、评估和持续改进呈现出错综复杂的多方面挑战。本书提供了一个全面的视角,将智能体置于模块化的、受大脑启发的架构中进行阐述,该架构融合了认知科学、神经科学和计算研究的原理。我们将探索分为四个相互关联的部分。首先,我们系统研究了智能体的模块化基础,将其认知、感知和操作模块系统地映射到类似的人类大脑功能上,并阐明了记忆、世界建模、奖励处理、目标和情感等核心组件。其次,我们讨论了自我增强与自适应进化机制,探索智能体如何通过自动化优化范式自主完善其能力、适应动态环境并实现持续学习。第三,我们审视了协作与进化的多智能体系统,研究从智能体交互、合作和社会结构中涌现的集体智能,并强调了与人类社会动态的相似之处。最后,我们探讨了构建安全且有益的AI系统这一关键任务,强调内在和外在的安全威胁、伦理对齐、鲁棒性以及实现可信赖的现实世界部署所必需的实用缓解策略。通过将模块化AI架构与不同学科的见解相结合,本综述识别了关键的研究空白、挑战和机遇,鼓励那些能够将技术进步与有意义的社会利益相协调的创新。
OmniSVG: A Unified Scalable Vector Graphics Generation Model
OmniSVG:统一的SVG生成模型
Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The development of autonomous SVG generation workflows is continuously drawing attention from both designers and researchers in the AIGC community. However, existing methods either produce unstructured outputs at huge computational cost or are limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG adhering to multi-modal instructions, we propose OmniSVG, a unified SVG generation framework that inherits knowledge from a pre-trained Vision-Language Model (VLM). By parameterizing SVG commands and coordinates into discrete token sequences, the auto-regressive nature enables us to seamlessly adapt modern VLMs to the direct SVG generation. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.
可缩放矢量图形因其分辨率无关性和可编辑性而成为平面设计中广泛采用的重要图像格式。自主SVG生成流程的发展持续吸引着AIGC社区设计师和研究人员的关注。然而,现有方法要么以巨大的计算成本生成非结构化输出,要么局限于生成结构过于简化的单色图标。为了生成符合多模态指令的高质量复杂SVG,我们提出了OmniSVG,一个统一的SVG生成框架,它继承了预训练视觉语言模型的知识。通过将SVG命令和坐标参数化为离散的词元序列,自回归的特性使我们能够将现代视觉语言模型无缝适配到直接SVG生成中。为了进一步推动SVG合成的发展,我们引入了MMSVG-2M,一个包含两百万个丰富标注SVG资源的多模态数据集,以及一个用于条件SVG生成任务的标准化评估协议。大量实验表明,OmniSVG优于现有方法,并展示了其集成到专业SVG设计工作流程中的潜力。
Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning
Paper2Code:从机器学习科学论文自动化生成代码
Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into operational code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins.
尽管机器学习研究发展迅速,但相应的代码实现往往不可用,这使得研究人员复现结果和在先前工作的基础上进行开发变得缓慢且劳动密集。与此同时,近期的大语言模型在理解科学文档和生成高质量代码方面表现出色。受此启发,我们推出了 PaperCoder,一个多智能体大语言模型框架,能够将机器学习论文转化为可运行的代码仓库。PaperCoder 分三个阶段运行:规划阶段,它构建高级路线图,用图表设计系统架构,识别文件依赖关系,并生成配置文件;分析阶段,专注于解读实现特定的细节;以及生成阶段,产生模块化、感知依赖的代码。此外,每个阶段都通过一组专门的智能体来实现,这些智能体旨在整个流程中有效协作。随后,我们基于模型评估和人工评估,特别是来自论文作者的评估,以作者发布的代码仓库作为真实基准,对 PaperCoder 从机器学习论文生成代码实现的能力进行了评估。我们的结果证明了 PaperCoder 在创建高质量、忠实实现方面的有效性。此外,它在近期发布的 PaperBench 基准上持续展现出优势,以显著差距超越了强大的基线模型。
Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
在视频生成的下一帧预测模型中打包输入帧上下文
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length, with more important frames having longer contexts. The frame importance can be measured using time proximity, feature similarity, or hybrid metrics. The packing method allows for inference with thousands of frames and training with relatively large batch sizes. We also present drift prevention methods to address observation bias (error accumulation), including early-established endpoints, adjusted sampling orders, and discrete history representation. Ablation studies validate the effectiveness of the anti-drifting methods in both single-directional video streaming and bi-directional video generation. Finally, we show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.
我们提出了一种名为 FramePack 的神经网络结构,用于训练视频生成的下一帧(或下一帧片段)预测模型。FramePack 通过逐帧重要性压缩输入帧上下文,从而在固定上下文长度内编码更多帧,且更重要的帧拥有更长的上下文。帧重要性可以通过时间接近度、特征相似性或混合度量来衡量。这种打包方法允许使用数千帧进行推理,并以相对较大的批量大小进行训练。我们还提出了防漂移方法来处理观测偏差,包括提前建立的端点、调整的采样顺序和离散历史表示。消融研究验证了防漂移方法在单向视频流和双向视频生成中的有效性。最后,我们展示了现有视频扩散模型可以使用 FramePack 进行微调,并分析了不同打包调度之间的差异。
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Mem0:构建具备可扩展长期记忆的生产级AI智能体
Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining consistency over prolonged multi-session dialogues. We introduce MemO, a scalable memory-centric architecture that addresses this issue by dynamically extracting, consolidating, and retrieving salient information from ongoing conversations. Building on this foundation, we further propose an enhanced variant that leverages graph-based memory representations to capture complex relational structures among conversational elements. Through comprehensive evaluations on the LOCOMO benchmark, we systematically compare our approaches against six baseline categories: (i) established memory-augmented systems, (ii) retrieval-augmented generation (RAG) with varying chunk sizes and
大语言模型在生成上下文连贯的响应方面展现出卓越能力,然而其固定的上下文窗口对维持长时间多轮对话的一致性构成了根本性挑战。我们引入了 MemO,一种可扩展的、以记忆为中心的架构,通过动态地从持续对话中提取、整合和检索关键信息来解决这一问题。在此基础之上,我们进一步提出了一种增强变体,利用基于图的记忆表示来捕捉会话元素之间复杂的关联结构。通过在 LOCOMO 基准上的全面评估,我们系统地比较了我们的方法与六类基线方法:既有的记忆增强系统、不同块大小和