VecGlypher: Unified Vector Glyph Generation with Language Models
CVPR2026GitHub Stars 124
VecGlypher:使用语言模型的统一矢量字形生成
Abstract
Vector glyphs are the atomic units of digital typography, yet most learning-based pipelines still depend on carefully curated exemplar sheets and raster-to-vector postprocessing, which limits accessibility and editability. We introduce VecGlypher, a single multimodal language model that generates high-fidelity vector glyphs directly from text descriptions or image exemplars. Given a style prompt, optional reference glyph images, and a target character, VecGlypher autoregressively emits SVG path tokens, avoiding raster intermediates and producing editable, watertight outlines in one pass. A typography-aware data and training recipe makes this possible: (i) a large-scale continuation stage on 39K noisy Envato fonts to master SVG syntax and long-horizon geometry, followed by (ii) post-training on 2.5K expert-annotated Google Fonts with descriptive tags and exemplars to align language and imagery with geometry; preprocessing normalizes coordinate frames, canonicalizes paths, de-duplicates families, and quantizes coordinates for stable long-sequence decoding. On cross-family OOD evaluation, VecGlypher substantially outperforms both general-purpose LLMs and specialized vector-font baselines for text-only generation, while image-referenced generation reaches a state-of-the-art performance, with marked gains over DeepVecFont-v2 and DualVector. Ablations show that model scale and the two-stage recipe are critical and that absolute-coordinate serialization yields the best geometry. VecGlypher lowers the barrier to font creation by letting users design with words or exemplars, and provides a scalable foundation for future multimodal design tools.
矢量字形是数字字体排印的基本单元,但大多数基于学习的流程仍依赖精心整理的示例字表和栅格到矢量的后处理,这限制了可访问性和可编辑性。我们提出 VecGlypher,这是一个单一的多模态语言模型,能够直接从文本描述或图像示例中生成高保真的矢量字形。给定风格提示、可选的参考字形图像和目标字符后,VecGlypher 以自回归方式输出 SVG 路径 token,避免栅格中间表示,并一次性生成可编辑且封闭良好的轮廓。一个面向字体排印的数据和训练方案使这一点成为可能:(i)首先在 3.9 万个带噪 Envato 字体上进行大规模延续训练,以掌握 SVG 语法和长程几何;(ii)随后在 2500 个带有描述标签和示例的专家标注 Google Fonts 上进行后训练,使语言、图像和几何对齐;预处理过程会归一化坐标框架、规范化路径、去重字体族并量化坐标,以稳定长序列解码。在跨字体族 OOD 评估中,VecGlypher 在纯文本生成上显著优于通用 LLM 和专用矢量字体基线,而基于图像参考的生成达到最先进性能,相比 DeepVecFont-v2 和 DualVector 有明显提升。消融实验表明,模型规模和两阶段训练方案至关重要,并且绝对坐标序列化能得到最佳几何结果。VecGlypher 通过让用户用文字或示例进行设计,降低了字体创作门槛,并为未来多模态设计工具提供了可扩展基础。
InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
ICLR2026GitHub Stars 110MLLM
InternSVG:迈向基于多模态大语言模型的统一 SVG 任务
Abstract
General SVG modeling remains challenging due to fragmented datasets, limited transferability of methods across tasks, and the difficulty of handling structural complexity. In response, we leverage the strong transfer and generalization capabilities of multimodal large language models (MLLMs) to achieve unified modeling for SVG understanding, editing, and generation. We present the InternSVG family, an integrated data-benchmark-model suite. At its core is SAgoge, the largest and most comprehensive multimodal dataset for SVG tasks, encompassing both static graphics and dynamic animations. It covers icons, long-sequence illustrations, scientific diagrams, and dynamic animations, supporting tasks of varied difficulty levels and providing deeper hierarchies with richer attributes compared to previous datasets. Based on this resource, we introduce SArena, a companion benchmark with comprehensive task definitions and standardized evaluation that aligns with the domains and difficulty spectrum covered by SAgoge. Building on these foundations, we propose InternSVG, a unified MLLM for SVG understanding, editing, and generation with SVG-specific special tokens, subword-based embedding initialization, and a two-stage training strategy that progresses from short static SVGs to long-sequence illustrations and complex animations. This unified formulation induces positive transfer and improves overall performance. Experiments on SArena and prior benchmark confirm that InternSVG achieves substantial gains and consistently outperforms leading open and proprietary counterparts.
通用 SVG 建模仍然充满挑战,原因在于数据集碎片化、方法在不同任务之间的可迁移性有限,以及处理结构复杂性较为困难。为此,我们利用多模态大语言模型(MLLM)强大的迁移和泛化能力,实现对 SVG 理解、编辑和生成的统一建模。我们提出 InternSVG 系列,这是一个集数据、基准和模型于一体的套件。其核心是 SAgoge,这是目前最大且最全面的 SVG 任务多模态数据集,同时涵盖静态图形和动态动画。它覆盖图标、长序列插图、科学图表和动态动画,支持不同难度级别的任务,并相比以往数据集提供更深层级和更丰富属性。基于这一资源,我们引入配套基准 SArena,提供全面的任务定义和标准化评估,其领域和难度范围与 SAgoge 对齐。在这些基础上,我们提出 InternSVG,这是一个用于 SVG 理解、编辑和生成的统一 MLLM,包含 SVG 特定特殊 token、基于子词的嵌入初始化,以及从短静态 SVG 逐步过渡到长序列插图和复杂动画的两阶段训练策略。这种统一表述带来了正向迁移并提升整体性能。在 SArena 和已有基准上的实验确认,InternSVG 取得显著提升,并持续优于领先的开源和闭源模型。
AutoFigure: Generating and Refining Publication-Ready Scientific Illustrations
ICLR2026GitHub Stars 1.3k
AutoFigure:生成并优化可发表科学插图
Abstract
High-quality scientific illustrations are crucial for effectively communicating complex scientific and technical concepts, yet their manual creation remains a well-recognized bottleneck in both academia and industry. We present FigureBench, the first large-scale benchmark for generating scientific illustrations from long-form scientific texts. It contains 3,300 high-quality scientific text-figure pairs, covering diverse text-to-illustration tasks from scientific papers, surveys, blogs, and textbooks. Moreover, we propose AutoFigure, the first agentic framework that automatically generates high-quality scientific illustrations based on long-form scientific text. Specifically, before rendering the final result, AutoFigure engages in extensive thinking, recombination, and validation to produce a layout that is both structurally sound and aesthetically refined, outputting a scientific illustration that achieves both structural completeness and aesthetic appeal. Leveraging the high-quality data from FigureBench, we conduct extensive experiments to test the performance of AutoFigure against various baseline methods. The results demonstrate that AutoFigure consistently surpasses all baseline methods, producing publication-ready scientific illustrations.
高质量科学插图对于有效传达复杂科学与技术概念至关重要,但手工创作此类插图仍然是学术界和工业界公认的瓶颈。我们提出 FigureBench,这是首个用于从长篇科学文本生成科学插图的大规模基准。它包含 3300 对高质量科学文本-图像样本,覆盖来自论文、综述、博客和教材的多样化文本到插图任务。此外,我们提出 AutoFigure,这是首个能够基于长篇科学文本自动生成高质量科学插图的智能体框架。具体而言,在渲染最终结果之前,AutoFigure 会进行充分思考、重组和验证,以生成结构合理且审美精致的布局,输出兼具结构完整性和审美吸引力的科学插图。借助 FigureBench 的高质量数据,我们进行了大量实验,将 AutoFigure 与多种基线方法进行比较。结果表明,AutoFigure 持续超越所有基线方法,能够生成可发表级别的科学插图。
AutoFigure-Edit: Generating Editable Scientific Illustration
arXiv2026GitHub Stars 3.1k
AutoFigure-Edit:生成可编辑科学插图
Abstract
High-quality scientific illustrations are essential for communicating complex scientific and technical concepts, yet existing automated systems remain limited in editability, stylistic controllability, and efficiency. We present AutoFigure-Edit, an end-to-end system that generates fully editable scientific illustrations from long-form scientific text while enabling flexible style adaptation through user-provided reference images. By combining long-context understanding, reference-guided styling, and native SVG editing, it enables efficient creation and refinement of high-quality scientific illustrations.
高质量科学插图对于传达复杂科学与技术概念至关重要,但现有自动化系统在可编辑性、风格可控性和效率方面仍然受限。我们提出 AutoFigure-Edit,这是一个端到端系统,能够从长篇科学文本生成完全可编辑的科学插图,同时通过用户提供的参考图像实现灵活的风格适配。通过结合长上下文理解、参考引导的风格化以及原生 SVG 编辑,它能够高效创建和优化高质量科学插图。
ACPV-Net: All-Class Polygonal Vectorization for Seamless Vector Map Generation from Aerial Imagery
CVPR2026GitHub Stars 25
ACPV-Net:从航空影像生成无缝矢量地图的全类别多边形矢量化
Abstract
We tackle the problem of generating a complete vector map representation from aerial imagery in a single run: producing polygons for all land-cover classes with shared boundaries and without gaps or overlaps. Existing polygonization methods are typically class-specific; extending them to multiple classes via per-class runs commonly leads to topological inconsistencies, such as duplicated edges, gaps, and overlaps. We formalize this new task as All-Class Polygonal Vectorization (ACPV) and release the first public benchmark, Deventer-512, with standardized metrics jointly evaluating semantic fidelity, geometric accuracy, vertex efficiency, per-class topological fidelity and global topological consistency. To realize ACPV, we propose ACPV-Net, a unified framework introducing a novel Semantically Supervised Conditioning (SSC) mechanism coupling semantic perception with geometric primitive generation, along with a topological reconstruction that enforces shared-edge consistency by design. While enforcing such strict topological constraints, ACPV-Net surpasses all class-specific baselines in polygon quality across classes on Deventer-512. It also applies to single-class polygonal vectorization without any architectural modification, achieving the best-reported results on WHU-Building.
我们研究如何在单次运行中从航空影像生成完整的矢量地图表示:为所有土地覆盖类别生成共享边界且不存在缝隙或重叠的多边形。现有多边形化方法通常是类别特定的;通过逐类别运行将其扩展到多类别时,常常会导致拓扑不一致,例如重复边、缝隙和重叠。我们将这一新任务形式化为全类别多边形矢量化(All-Class Polygonal Vectorization, ACPV),并发布首个公开基准 Deventer-512,其中包含标准化指标,联合评估语义保真度、几何精度、顶点效率、逐类别拓扑保真度和全局拓扑一致性。为实现 ACPV,我们提出 ACPV-Net,这是一个统一框架,引入新的语义监督条件化(SSC)机制,将语义感知与几何图元生成耦合,并结合拓扑重建,从设计上强制共享边一致性。在施加如此严格拓扑约束的同时,ACPV-Net 在 Deventer-512 上的各类别多边形质量均超过所有类别特定基线。它也可在不做任何架构修改的情况下应用于单类别多边形矢量化,并在 WHU-Building 上取得目前报告的最佳结果。
Clair Obscur: an Illumination-Aware Method for Real-World Image Vectorization
CVPR2026 HighlightGitHub Stars 4
Clair Obscur:面向真实世界图像矢量化的光照感知方法
Abstract
Image vectorization aims to convert raster images into editable, scalable vector representations while preserving visual fidelity. Existing vectorization methods struggle to represent complex real-world images, often producing fragmented shapes at the cost of semantic conciseness. In this paper, we propose COVec, an illumination-aware vectorization framework inspired by the Clair-Obscur principle of light-shade contrast. COVec is the first to introduce intrinsic image decomposition in the vector domain, separating an image into albedo, shade, and light layers in a unified vector representation. A semantic-guided initialization and two-stage optimization refine these layers with differentiable rendering. Experiments on various datasets demonstrate that COVec achieves higher visual fidelity and significantly improved editability compared to existing methods.
图像矢量化旨在将栅格图像转换为可编辑、可缩放的矢量表示,同时保持视觉保真度。现有矢量化方法难以表示复杂真实图像,通常会以牺牲语义简洁性为代价生成碎片化形状。本文提出 COVec,这是一个光照感知的矢量化框架,灵感来自明暗对照的 Clair-Obscur 原理。COVec 首次在矢量域中引入内在图像分解,将图像在统一矢量表示中分离为反照率、阴影和光照层。语义引导初始化和两阶段优化通过可微渲染细化这些层。在多个数据集上的实验表明,相比现有方法,COVec 实现了更高的视觉保真度和显著提升的可编辑性。
DuetSVG: Unified Multimodal SVG Generation with Internal Visual Guidance
arXiv2025VLM
DuetSVG:借助内部视觉引导的统一多模态 SVG 生成
Abstract
Recent vision-language model (VLM)-based approaches have achieved impressive results on SVG generation. However, because they generate only text and lack visual signals during decoding, they often struggle with complex semantics and fail to produce visually appealing or geometrically coherent SVGs. We introduce DuetSVG, a unified multimodal model that jointly generates image tokens and corresponding SVG tokens in an end-to-end manner. DuetSVG is trained on both image and SVG datasets. At inference, we apply a novel test-time scaling strategy that leverages the model's native visual predictions as guidance to improve SVG decoding quality. Extensive experiments show that our method outperforms existing methods, producing visually faithful, semantically aligned, and syntactically clean SVGs across a wide range of applications.
近期基于视觉语言模型(VLM)的方法在 SVG 生成上取得了令人印象深刻的结果。然而,由于这些方法只生成文本,并且在解码过程中缺少视觉信号,它们在复杂语义下往往表现困难,难以生成视觉上吸引人或几何上一致的 SVG。我们提出 DuetSVG,这是一个统一多模态模型,能够以端到端方式联合生成图像 token 和对应的 SVG token。DuetSVG 同时在图像和 SVG 数据集上训练。在推理阶段,我们采用一种新的测试时缩放策略,利用模型自身的视觉预测作为引导,以提升 SVG 解码质量。大量实验表明,我们的方法优于现有方法,能够在广泛应用中生成视觉忠实、语义对齐且语法干净的 SVG。
Reliable Reasoning in SVG-LLMs via Multi-Task Multi-Reward Reinforcement Learning
GitHub Stars 19RL
通过多任务多奖励强化学习实现 SVG-LLM 的可靠推理
Abstract
With the rapid advancement of vision-language models, an increasing number of studies have explored their potential for SVG generation tasks. Although existing approaches improve performance by constructing large-scale SVG datasets and introducing SVG-specific tokens, they still suffer from limited generalization, redundant paths in code outputs, and a lack of explicit reasoning. In this work, we present CTRL-S (Chain-of-Thought Reinforcement Learning for SVG), a unified framework that introduces a chain-of-thought mechanism to explicitly expose the model's reasoning process during SVG generation. To support this structured reasoning, we construct SVG-Sophia, a high-quality dataset containing 145K samples across SVG code refinement, Text-to-SVG, and Image-to-SVG tasks. By training the model to generate group-level structured SVG code, CTRL-S significantly improves structural coherence and visual fidelity. Furthermore, we adopt the GRPO algorithm and design a multi-reward optimization framework, incorporating DINO, image-text similarity, format, and code efficiency rewards. Through joint multi-reward optimization and multi-task training, our approach systematically enhances overall generation capabilities. Extensive experiments show that CTRL-S outperforms existing methods, achieving higher task success rates, superior SVG code quality, and exceptional visual fidelity.
随着视觉语言模型快速发展,越来越多研究开始探索其在 SVG 生成任务中的潜力。尽管现有方法通过构建大规模 SVG 数据集和引入 SVG 特定 token 提升了性能,但仍存在泛化能力有限、代码输出中路径冗余以及缺乏显式推理的问题。本文提出 CTRL-S(Chain-of-Thought Reinforcement Learning for SVG),这是一个统一框架,引入思维链机制,在 SVG 生成过程中显式展示模型的推理过程。为支持这种结构化推理,我们构建了 SVG-Sophia,这是一个高质量数据集,包含 14.5 万个样本,覆盖 SVG 代码优化、Text-to-SVG 和 Image-to-SVG 任务。通过训练模型生成组级结构化 SVG 代码,CTRL-S 显著提升了结构一致性和视觉保真度。此外,我们采用 GRPO 算法并设计多奖励优化框架,融合 DINO、图文相似度、格式和代码效率奖励。通过联合多奖励优化和多任务训练,我们的方法系统性增强了整体生成能力。大量实验表明,CTRL-S 优于现有方法,实现了更高的任务成功率、更优的 SVG 代码质量以及出色的视觉保真度。
DiffBMP: Differentiable Rendering with Bitmap Primitives
DiffBMP: Differentiable Rendering with Bitmap Primitives[J]. arXiv preprint arXiv:2602.22625, 2026.
CVPR2026GitHub Stars 2
DiffBMP:使用位图图元的可微渲染
Abstract
We introduce DiffBMP, a scalable and efficient differentiable rendering engine for a collection of bitmap images. Our work addresses a limitation that traditional differentiable renderers are constrained to vector graphics, given that most images in the world are bitmaps. Our core contribution is a highly parallelized rendering pipeline, featuring a custom CUDA implementation for calculating gradients. This system can, for example, optimize the position, rotation, scale, color, and opacity of thousands of bitmap primitives all in under 1 min using a consumer GPU. We employ and validate several techniques to facilitate the optimization: soft rasterization via Gaussian blur, structure-aware initialization, noisy canvas, and specialized losses/heuristics for videos or spatially constrained images. We demonstrate DiffBMP is not just an isolated tool, but a practical one designed to integrate into creative workflows. It supports exporting compositions to a native, layered file format, and the entire framework is publicly accessible via an easy-to-hack Python package.
我们提出 DiffBMP,这是一个面向位图图像集合的可扩展、高效可微渲染引擎。考虑到世界上大多数图像都是位图,我们的工作解决了传统可微渲染器受限于矢量图形的局限。我们的核心贡献是一个高度并行化的渲染流水线,其中包含用于计算梯度的自定义 CUDA 实现。例如,该系统可以使用消费级 GPU 在 1 分钟内优化数千个位图图元的位置、旋转、缩放、颜色和透明度。我们采用并验证了多种促进优化的技术:通过高斯模糊进行软栅格化、结构感知初始化、带噪画布,以及面向视频或空间受限图像的专用损失与启发式方法。我们证明 DiffBMP 不只是一个孤立工具,而是一个旨在融入创意工作流的实用工具。它支持将组合导出为原生分层文件格式,并且整个框架通过一个易于修改的 Python 包公开可用。