Skip to content

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

https://huggingface.co/papers/2511.22699

https://arxiv.org/abs/2511.22699

https://tongyi-mai.github.io/Z-Image-blog/

https://github.com/Tongyi-MAI/Z-Image

阿里

Z-Image:基于单流扩散 Transformer 的高效图像生成基础模型

The landscape of high-performance image generation models is currently dominated by proprietary systems, such as Nano Banana Pro and Seedream 4.0. Leading open-source alternatives, including Qwen-Image, Hunyuan-Image-3.0 and FLUX.2, are characterized by massive parameter counts (20B to 80B), making them impractical for inference, and fine-tuning on consumer-grade hardware. To address this gap, we propose Z-Image, an efficient 6B-parameter foundation generative model built upon a Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture that challenges the "scale-at-all-costs" paradigm. By systematically optimizing the entire model lifecycle – from a curated data infrastructure to a streamlined training curriculum – we complete the full training workflow in just 314K H800 GPU hours (approx. $630K). Our few-step distillation scheme with reward post-training further yields Z-Image-Turbo, offering both sub-second inference latency on an enterprise-grade H800 GPU and compatibility with consumer-grade hardware (<16GB VRAM). Additionally, our omni-pre-training paradigm also enables efficient training of Z-Image-Edit, an editing model with impressive instruction-following capabilities. Both qualitative and quantitative experiments demonstrate that our model achieves performance comparable to or surpassing that of leading competitors across various dimensions. Most notably, Z-Image exhibits exceptional capabilities in photorealistic image generation and bilingual text rendering, delivering results that rival top-tier commercial models, thereby demonstrating that state-of-the-art results are achievable with significantly reduced computational overhead. We publicly release our code, weights, and online demo to foster the development of accessible, budget-friendly, yet state-of-the-art generative models.

高性能图像生成模型的领域目前由专有系统主导,例如 Nano Banana Pro 和 Seedream 4.0。领先的开源替代方案,包括 Qwen-Image、Hunyuan-Image-3.0 和 FLUX.2,其特点是参数量巨大,使其在消费级硬件上进行推理和微调变得不切实际。为了弥合这一差距,我们提出了 Z-Image,一个高效的 60 亿参数基础生成模型,构建于可扩展单流扩散 Transformer 架构之上,挑战了“不惜一切代价追求规模”的范式。通过系统地优化整个模型生命周期——从精心整理的数据基础设施到精简的训练流程——我们仅在 31.4 万 H800 GPU 小时中完成了完整的训练工作流程。我们的少步蒸馏方案结合奖励后训练进一步产生了 Z-Image-Turbo,它在企业级 H800 GPU 上实现了亚秒级推理延迟,并与消费级硬件兼容。此外,我们的全能预训练范式还实现了 Z-Image-Edit 的高效训练,这是一个具有令人印象深刻的指令跟随能力的编辑模型。定性和定量实验均表明,我们的模型在各个维度上都达到了与领先竞争对手相当或超越的性能。最值得注意的是,Z-Image 在逼真图像生成和双语文本渲染方面展现出卓越的能力,其结果可与顶级商业模型相媲美,从而证明了以显著降低的计算开销实现最先进结果是可行的。我们公开发布了代码、权重和在线演示,以促进可访问、经济且最先进的生成模型的发展。


From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

https://huggingface.co/papers/2511.18538

https://arxiv.org/abs/2511.18538

北京航空航天大学

从代码基础模型到智能体与应用:代码智能实践指南

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like GitHub Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

大语言模型通过将自然语言描述直接转化为功能性代码,从根本上变革了自动化软件开发,并推动了 GitHub Copilot、Cursor、Trae 和 Claude Code 等工具的商业化应用。尽管该领域已从基于规则的系统演变为基于 Transformer 的架构,在 HumanEval 等基准测试上实现了从个位数到超过 95% 成功率的性能提升,但在本工作中,我们提供了关于代码大语言模型的全面综述与实践指南,通过一系列分析与探究实验,系统地审视了从数据整理到后训练的完整模型生命周期,涉及高级提示范式、代码预训练、监督微调、强化学习以及自主编码智能体。我们分析了通用大语言模型和代码专用大语言模型的代码能力,批判性地审视了相关技术、设计决策及其权衡。此外,我们阐述了学术研究与真实世界部署之间的差距,并指出了从研究需求到实践需求的有前景的研究方向。最后,我们进行了一系列实验,对代码预训练、监督微调和强化学习进行了全面分析,涵盖缩放定律、框架选择、超参数敏感性、模型架构和数据集比较。


Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

https://arxiv.org/abs/2511.04570

https://huggingface.co/papers/2511.04570

https://thinking-with-video.github.io/

https://github.com/tongjingqi/Thinking-with-Video

复旦 OpenMOSS

用视频思考:将视频生成视为一种有前景的多模态推理范式

"Thinking with Text" and "Thinking with Images" paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce "Thinking with Video", a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Games), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions "thinking with video" as a unified multimodal reasoning paradigm.

“用文本思考”和“用图像思考”的范式显著提升了大语言模型和视觉语言模型的推理能力。然而,这些范式存在固有的局限性。(1) 图像仅捕捉单一瞬间,无法表现动态过程或连续变化;(2) 文本和视觉作为不同模态的分离,阻碍了统一的多模态理解与生成。为了克服这些局限,我们引入了“用视频思考”这一新范式,它利用像 Sora-2 这样的视频生成模型,在一个统一的时间框架内桥接视觉与文本推理。为支持这一探索,我们开发了视频思考基准。VideoThinkBench 包含两类任务:(1) 以视觉为中心的任务,以及 (2) 以文本为中心的任务。我们的评估证实 Sora-2 是一个有能力的推理器。在以视觉为中心的任务上,Sora-2 总体上与最先进的 VLM 相当,甚至在几项任务上超越了 VLM。在以文本为中心的任务上,Sora-2 在 MATH 上达到了 92% 的准确率,在 MMMU 上达到了 75.53% 的准确率。此外,我们系统地分析了这些能力的来源。我们还发现自洽性和上下文学习可以提升 Sora-2 的性能。总之,我们的研究结果表明,视频生成模型是潜在的统一多模态理解与生成模型,将“用视频思考”定位为一种统一的多模态推理范式


Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

https://huggingface.co/papers/2511.14993

https://arxiv.org/abs/2511.14993

https://github.com/kandinskylab/kandinsky-5

https://kandinskylab.ai/


This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite – a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite – a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro – 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle – including collection, processing, filtering and clustering – for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

本报告介绍了 Kandinsky 5.0,一个用于高分辨率图像和 10 秒视频合成的最先进基础模型系列。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——一个 60 亿参数的图像生成模型系列;Kandinsky 5.0 Video Lite——快速轻量的 20 亿参数文本到视频和图像到视频模型;以及 Kandinsky 5.0 Video Pro——190 亿参数、实现卓越视频生成质量的模型。我们全面回顾了数据整理生命周期,包括收集、处理、过滤和聚类,用于多阶段训练流程,该流程涉及广泛的预训练,并结合了自监督微调和基于强化学习的后训练等质量提升技术。我们还介绍了新颖的架构、训练和推理优化,使 Kandinsky 5.0 能够在各种任务上实现高生成速度和最先进的性能,这已通过人类评估得到验证。作为一个大规模、公开可用的生成框架,Kandinsky 5.0 充分发挥其预训练及后续阶段的潜力,可适配广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将极大地促进高质量生成模型的发展及其在研究社区的可获取性。


SAM 3: Segment Anything with Concepts

https://huggingface.co/papers/2511.16719

https://arxiv.org/abs/2511.16719

https://ai.meta.com/research/sam3/

https://github.com/facebookresearch/sam3

Meta

SAM 3:基于概念的通用分割模型

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., "yellow school bus"), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of an image-level detector and a memory-based video tracker that share a single backbone. Recognition and localization are decoupled with a presence head, which boosts detection accuracy. SAM 3 doubles the accuracy of existing systems in both image and video PCS, and improves previous SAM capabilities on visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark for promptable concept segmentation.

我们提出了 Segment Anything Model 3,一个统一的模型,能够基于概念提示检测、分割和跟踪图像及视频中的对象。我们将概念提示定义为短名词短语、图像示例或两者的组合。可提示概念分割接受此类提示,并返回所有匹配对象实例的分割掩码和唯一标识。为了推进可提示概念分割,我们构建了一个可扩展的数据引擎,生成了一个包含 400 万个跨图像和视频的唯一概念标签的高质量数据集,其中包括难负例。我们的模型由一个图像级检测器和一个基于记忆的视频跟踪器组成,两者共享一个骨干网络。通过一个存在性头将识别与定位解耦,从而提高了检测精度。SAM 3 在图像和视频的可提示概念分割任务上,精度是现有系统的两倍,并在视觉分割任务上改进了之前 SAM 的能力。我们开源了 SAM 3 以及我们新的用于可提示概念分割的 Segment Anything with Concepts 基准。


SAM 3D: 3Dfy Anything in Images

https://huggingface.co/papers/2511.16624

https://arxiv.org/abs/2511.16624

https://ai.meta.com/research/sam3d/

https://github.com/facebookresearch/sam-3d-objects

Meta

SAM 3D:将图像中的一切 3D 化

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D “data barrier”. We obtain significant gains over recent work, with at least a 5 : 1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

我们提出了 SAM 3D,一个用于基于视觉的 3D 物体重建的生成模型,能够从单张图像预测几何、纹理和布局。SAM 3D 在处理自然图像方面表现出色,这类图像中遮挡和场景杂乱很常见,来自上下文的视觉识别线索起着更重要的作用。我们通过一个人类和模型协同的标注流程实现了这一点,该流程用于标注物体的形状、纹理和姿态,以前所未有的规模提供了基于视觉的 3D 重建数据。我们在一个现代化的多阶段训练框架中从这些数据中学习,该框架结合了合成预训练和真实世界对齐,打破了 3D 重建的“数据壁垒”。相较于近期工作,我们取得了显著的性能提升,在真实世界物体和场景的人类偏好测试中,胜率至少达到 5 : 1。我们将发布我们的代码和模型权重、一个在线演示,以及一个用于野外 3D 物体重建的新颖且具有挑战性的基准。


Depth Anything 3: Recovering the Visual Space from Any Views

https://huggingface.co/papers/2511.10647

https://arxiv.org/abs/2511.10647

https://depth-anything-3.github.io/

https://github.com/ByteDance-Seed/depth-anything-3

字节跳动

Depth Anything 3:从任意视角恢复视觉空间

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

我们提出了 Depth Anything 3,一个能够从任意数量的视觉输入中预测空间一致几何结构的模型,无论是否已知相机位姿。为了追求最小化建模,DA3 产生了两个关键见解:单个普通 transformer 足以作为骨干网络,无需架构上的专门化;单一的深度射线预测目标避免了复杂的多任务学习。通过我们的师生训练范式,该模型实现了与 Depth Anything 2 相当的细节水平和泛化能力。我们建立了一个新的视觉几何基准,涵盖相机位姿估计、任意视角几何和视觉渲染。在这个基准上,DA3 在所有任务中都树立了新的最先进水平,在相机位姿精度上平均超越先前的 SOTA 模型 VGGT 35.7%,在几何精度上平均超越 23.6%。此外,它在单目深度估计上也优于 DA2。所有模型均仅在公共学术数据集上进行训练。