SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Hugging Face
SmolLM2:当小模型变大——小型语言模型的数据中心训练
While large language models have facilitated breakthroughs in many applications of artificial intelligence, their inherent largeness makes them computationally expensive and challenging to deploy in resource-constrained settings. In this paper, we document the development of SmolLM2, a state-of-the-art “small” (1.7 billion parameter) language model (LM). To attain strong performance, we overtrain SmolLM2 on ~11 trillion tokens of data using a multi-stage training process that mixes web text with specialized math, code, and instruction-following data. We additionally introduce new specialized datasets (FineMath, Stack-Edu, and SmolTalk) at stages where we found existing datasets to be problematically small or low-quality. To inform our design decisions, we perform both small-scale ablations as well as a manual refinement process that updates the dataset mixing rates at each stage based on the performance at the previous stage. Ultimately, we demonstrate that SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B. To facilitate future research on LM development as well as applications of small LMs, we release both SmolLM2 as well as all of the datasets we prepared in the course of this project.
尽管大语言模型推动了人工智能诸多应用的突破,但其固有的庞大体积使其计算成本高昂,并且在资源受限的环境中部署颇具挑战。在本文中,我们记录了 SmolLM2 的开发过程,这是一个最先进的“小型”语言模型。为了获得强劲的性能,我们使用约 11 万亿词元的数据对 SmolLM2 进行了过度训练,采用了一个混合了网络文本与专门数学、编程和指令跟随数据的多阶段训练过程。此外,在我们发现现有数据集规模过小或质量欠佳的阶段,我们还引入了新的专门数据集。为支撑我们的设计决策,我们既进行了小规模消融实验,也执行了手动优化流程,根据前一阶段的性能来更新每个阶段的数据混合比例。最终,我们证明了 SmolLM2 优于其他近期的小型语言模型。为了促进未来关于语言模型开发以及小型语言模型应用的研究,我们发布了 SmolLM2 以及我们在本项目过程中准备的所有数据集。
OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models
字节跳动
OmniHuman-1:重新思考单阶段条件化人体动画模型的规模化扩展
End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals).
端到端的人体动画,例如音频驱动的说话人生成,在近几年来取得了显著进展。然而,现有方法仍难以像大型通用视频生成模型那样规模化扩展,限制了它们在真实应用中的潜力。在本文中,我们提出了OmniHuman,一个基于扩散Transformer的框架,通过在训练阶段混合与运动相关的条件来扩展数据。为此,我们为这些混合条件引入了两个训练原则,以及相应的模型架构和推理策略。这些设计使OmniHuman能够充分利用数据驱动的运动生成,最终实现高度逼真的人体视频生成。更重要的是,OmniHuman支持多种肖像内容,支持说话和唱歌,处理人-物交互和具有挑战性的身体姿态,并能适应不同的图像风格。与现有的端到端音频驱动方法相比,OmniHuman不仅能生成更逼真的视频,还提供了更大的输入灵活性。它还支持多种驱动模式。
Qwen2.5-VL Technical Report
阿里
We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we have significantly reduced computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. The model achieves strong generalization across domains without requiring task-specific fine-tuning. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. The smaller Qwen2.5-VL-7B and Qwen2.5-VL-3B models outperform comparable competitors, offering strong capabilities even in resource-constrained environments. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.
我们推出了 Qwen2.5-VL,这是 Qwen 视觉语言系列的最新旗舰模型,在基础能力和创新功能方面均展现出显著进步。Qwen2.5-VL 通过增强的视觉识别、精确的目标定位、稳健的文档解析和长视频理解,实现了理解和与世界交互的重大飞跃。Qwen2.5-VL 的一个突出特点是能够使用边界框或点精确地定位物体。它能够从发票、表格中提取稳健的结构化数据,并对图表、图示和布局进行详细分析。为了处理复杂的输入,Qwen2.5-VL 引入了动态分辨率处理和绝对时间编码,使其能够处理不同大小的图像和长达数小时的视频,并具备秒级事件定位能力。这使得模型无需依赖传统的归一化技术,就能原生地感知空间尺度和时间动态。通过从头训练原生的动态分辨率视觉Transformer并结合窗口注意力,我们在保持原生分辨率的同时显著降低了计算开销。因此,Qwen2.5-VL 不仅在静态图像和文档理解方面表现出色,还能作为交互式视觉智能体,在操作计算机和移动设备等现实场景中进行推理、使用工具和执行任务。该模型无需特定任务的微调,就能实现跨领域的强泛化能力。Qwen2.5-VL 提供三种尺寸,可满足从边缘AI到高性能计算的多样化用例。旗舰版 Qwen2.5-VL-72B 模型与 GPT-4o 和 Claude 3.5 Sonnet 等最先进模型性能相当,尤其在文档和图表理解方面表现出色。较小的 Qwen2.5-VL-7B 和 Qwen2.5-VL-3B 模型性能优于同类竞品,即使在资源受限的环境中也提供强大的能力。此外,Qwen2.5-VL 保持了稳健的语言性能,保留了 Qwen2.5 大语言模型的核心语言能力。