Skip to content

StarVector: Generating Scalable Vector Graphics Code from Images and Text

StarVector: Generating Scalable Vector Graphics Code from Images and Text[J]. arXiv preprint arXiv:2312.11556, 2023.

https://github.com/joanrod/star-vector

CVPR2025GitHub Stars 4.4kMLLM

StarVector:从图像和文本生成可缩放矢量图形代码

Abstract

Scalable Vector Graphics (SVGs) are vital for modern image rendering due to their scalability and versatility. Previous SVG generation methods have focused on curve-based vectorization, lacking semantic understanding, often producing artifacts, and struggling with SVG primitives beyond path curves. To address these issues, we introduce StarVector, a multimodal large language model for SVG generation. It performs image vectorization by understanding image semantics and using SVG primitives for compact, precise outputs. Unlike traditional methods, StarVector works directly in the SVG code space, leveraging visual understanding to apply accurate SVG primitives. To train StarVector, we create SVG-Stack, a diverse dataset of 2M samples that enables generalization across vectorization tasks and precise use of primitives like ellipses, polygons, and text. We address challenges in SVG evaluation, showing that pixel-based metrics like MSE fail to capture the unique qualities of vector graphics. We introduce SVG-Bench, a benchmark across 10 datasets, and 3 tasks: Image-to-SVG, Text-to-SVG generation, and diagram generation. Using this setup, StarVector achieves state-of-the-art performance, producing more compact and semantically rich SVGs.

可缩放矢量图形(SVG)因其可缩放性和通用性,对现代图像渲染至关重要。以往 SVG 生成方法主要关注基于曲线的矢量化,缺乏语义理解,常产生伪影,并且难以处理路径曲线之外的 SVG 图元。为解决这些问题,我们提出 StarVector,这是一个用于 SVG 生成的多模态大语言模型。它通过理解图像语义并使用 SVG 图元,实现紧凑且精确的图像矢量化输出。不同于传统方法,StarVector 直接在 SVG 代码空间中工作,利用视觉理解来应用准确的 SVG 图元。为训练 StarVector,我们构建了 SVG-Stack,这是一个包含 200 万样本的多样化数据集,使模型能够在矢量化任务中泛化,并精确使用椭圆、多边形和文本等图元。我们还讨论了 SVG 评估中的挑战,指出 MSE 等基于像素的指标无法捕捉矢量图形的独特质量。我们提出 SVG-Bench,这是一个跨 10 个数据集和 3 个任务的基准,任务包括 Image-to-SVG、Text-to-SVG 生成和图表生成。在这一设置下,StarVector 达到了最先进性能,能够生成更紧凑且语义更丰富的 SVG。


OmniSVG: A Unified Scalable Vector Graphics Generation Model

OmniSVG: A Unified Scalable Vector Graphics Generation Model[J]. arXiv preprint arXiv:2504.06263, 2025.

https://github.com/OmniSVG/OmniSVG

NeurIPS2025GitHub Stars 2.5kVLM

OmniSVG:统一的可缩放矢量图形生成模型

Abstract

Scalable Vector Graphics (SVG) is an important image format widely adopted in graphic design because of their resolution independence and editability. The study of generating high-quality SVG has continuously drawn attention from both designers and researchers in the AIGC community. However, existing methods either produces unstructured outputs with huge computational cost or is limited to generating monochrome icons of over-simplified structures. To produce high-quality and complex SVG, we propose OmniSVG, a unified framework that leverages pre-trained Vision-Language Models (VLMs) for end-to-end multimodal SVG generation. By parameterizing SVG commands and coordinates into discrete tokens, OmniSVG decouples structural logic from low-level geometry for efficient training while maintaining the expressiveness of complex SVG structure. To further advance the development of SVG synthesis, we introduce MMSVG-2M, a multimodal dataset with two million richly annotated SVG assets, along with a standardized evaluation protocol for conditional SVG generation tasks. Extensive experiments show that OmniSVG outperforms existing methods and demonstrates its potential for integration into professional SVG design workflows.

可缩放矢量图形(SVG)是一种重要图像格式,因其分辨率无关性和可编辑性而被广泛用于图形设计。生成高质量 SVG 的研究持续吸引设计师和 AIGC 研究者的关注。然而,现有方法要么以巨大计算成本生成非结构化输出,要么局限于生成结构过于简化的单色图标。为生成高质量且复杂的 SVG,我们提出 OmniSVG,这是一个统一框架,利用预训练视觉语言模型(VLM)进行端到端多模态 SVG 生成。通过将 SVG 命令和坐标参数化为离散 token,OmniSVG 将结构逻辑与低层几何解耦,在保持复杂 SVG 结构表达能力的同时实现高效训练。为进一步推动 SVG 合成发展,我们引入 MMSVG-2M,这是一个包含 200 万个丰富标注 SVG 资产的多模态数据集,并提供用于条件 SVG 生成任务的标准化评估协议。大量实验表明,OmniSVG 优于现有方法,并展现出融入专业 SVG 设计工作流的潜力。


NeuralSVG: An Implicit Representation for Text-to-Vector Generation

NeuralSVG: An Implicit Representation for Text-to-Vector Generation[J]. arXiv preprint arXiv:2501.03992, 2025.

https://github.com/SagiPolaczek/NeuralSVG

ICCV2025GitHub Stars 1.4k

NeuralSVG:用于文本到矢量生成的隐式表示

Abstract

Vector graphics are essential in design, providing artists with a versatile medium for creating resolution-independent and highly editable visual content. Recent advancements in vision-language and diffusion models have fueled interest in text-to-vector graphics generation. However, existing approaches often suffer from over-parameterized outputs or treat the layered structure - a core feature of vector graphics - as a secondary goal, diminishing their practical use. Recognizing the importance of layered SVG representations, we propose NeuralSVG, an implicit neural representation for generating vector graphics from text prompts. Inspired by Neural Radiance Fields (NeRFs), NeuralSVG encodes the entire scene into the weights of a small MLP network, optimized using Score Distillation Sampling (SDS). To encourage a layered structure in the generated SVG, we introduce a dropout-based regularization technique that strengthens the standalone meaning of each shape. We additionally demonstrate that utilizing a neural representation provides an added benefit of inference-time control, enabling users to dynamically adapt the generated SVG based on user-provided inputs, all with a single learned representation. Through extensive qualitative and quantitative evaluations, we demonstrate that NeuralSVG outperforms existing methods in generating structured and flexible SVG.

矢量图形在设计中至关重要,为艺术家提供了一种可创建分辨率无关且高度可编辑视觉内容的通用媒介。视觉语言模型和扩散模型的最新进展激发了人们对文本到矢量图形生成的兴趣。然而,现有方法通常存在输出过度参数化的问题,或将分层结构这一矢量图形核心特征视为次要目标,从而削弱其实用性。认识到分层 SVG 表示的重要性后,我们提出 NeuralSVG,这是一种用于从文本提示生成矢量图形的隐式神经表示。受神经辐射场(NeRF)启发,NeuralSVG 将整个场景编码到一个小型 MLP 网络的权重中,并使用 Score Distillation Sampling(SDS)进行优化。为鼓励生成 SVG 具有分层结构,我们引入一种基于 dropout 的正则化技术,增强每个形状的独立语义。我们还展示了使用神经表示带来的额外推理时控制优势:用户可以基于输入动态调整生成的 SVG,而这一切都来自单个学习到的表示。通过大量定性和定量评估,我们证明 NeuralSVG 在生成结构化、灵活 SVG 方面优于现有方法。


Empowering LLMs to Understand and Generate Complex Vector Graphics

Xing X, Wang H, Zhou Z, et al. Empowering LLMs to Understand and Generate Complex Vector Graphics[J]. arXiv preprint arXiv:2412.11102, 2024.

https://github.com/ximinng/LLM4SVG

CVPR2025GitHub Stars 634LLM

赋能大语言模型理解并生成复杂矢量图形

Abstract

The unprecedented advancements in Large Language Models (LLMs) have profoundly impacted natural language processing but have yet to fully embrace the realm of scalable vector graphics (SVG) generation. While LLMs encode partial knowledge of SVG data from web pages during training, recent findings suggest that semantically ambiguous and tokenized representations within LLMs may result in hallucinations in vector primitive predictions. Additionally, LLM training typically lacks modeling and understanding of the rendering sequence of vector paths, which can lead to occlusion between output vector primitives. In this paper, we present LLM4SVG, an initial yet substantial step toward bridging this gap by enabling LLMs to better understand and generate vector graphics. LLM4SVG facilitates a deeper understanding of SVG components through learnable semantic tokens, which precisely encode these tokens and their corresponding properties to generate semantically aligned SVG outputs. Using a series of learnable semantic tokens, a structured dataset for instruction following is developed to support comprehension and generation across two primary tasks. Our method introduces a modular architecture to existing large language models, integrating semantic tags, vector instruction encoders, fine-tuned commands, and powerful LLMs to tightly combine geometric, appearance, and language information. To overcome the scarcity of SVG-text instruction data, we developed an automated data generation pipeline that collected our SVGX-SFT Dataset, consisting of high-quality human-designed SVGs and 580k SVG instruction following data specifically crafted for LLM training, which facilitated the adoption of the supervised fine-tuning strategy popular in LLM development.

大语言模型(LLM)的空前进展深刻影响了自然语言处理,但尚未充分进入可缩放矢量图形(SVG)生成领域。尽管 LLM 在训练过程中从网页中编码了部分 SVG 数据知识,但近期发现表明,LLM 内部语义模糊且 token 化的表示可能导致矢量图元预测中的幻觉。此外,LLM 训练通常缺乏对矢量路径渲染顺序的建模和理解,这可能导致输出矢量图元之间发生遮挡。本文提出 LLM4SVG,这是弥合这一差距的初步但重要一步,使 LLM 能更好地理解和生成矢量图形。LLM4SVG 通过可学习语义 token 促进对 SVG 组件的深入理解,这些 token 能精确编码自身及其对应属性,从而生成语义对齐的 SVG 输出。利用一系列可学习语义 token,我们构建了一个用于指令跟随的结构化数据集,支持两项主要任务中的理解与生成。我们的方法为现有大语言模型引入模块化架构,整合语义标签、矢量指令编码器、微调命令和强大的 LLM,以紧密结合几何、外观和语言信息。为克服 SVG-文本指令数据稀缺的问题,我们开发了自动数据生成流水线,收集得到 SVGX-SFT 数据集,其中包含高质量人工设计 SVG 和 58 万条专门为 LLM 训练构建的 SVG 指令跟随数据,从而支持采用 LLM 发展中常用的监督微调策略。


SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation

SVGDreamer++: Advancing Editability and Diversity in Text-Guided SVG Generation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

https://github.com/ximinng/SVGDreamerV2

TPAMI2025GitHub Stars 449

SVGDreamer++:推进文本引导 SVG 生成的可编辑性和多样性

Abstract

Recently, text-guided scalable vector graphics (SVG) synthesis has demonstrated significant potential in domains such as iconography and sketching. However, SVGs generated from existing Text-to-SVG methods often lack editability and exhibit deficiencies in visual quality and diversity. In this paper, we propose a novel text-guided vector graphics synthesis method to address these limitations. To enhance the editability of output SVGs, we introduce a Hierarchical Image VEctorization (HIVE) framework that operates at the semantic object level and supervises the optimization of components within the vector object. This approach facilitates the decoupling of vector graphics into distinct objects and component levels. Our proposed HIVE algorithm, informed by image segmentation priors, not only ensures a more precise representation of vector graphics but also enables fine-grained editing capabilities within vector objects. To improve the diversity of output SVGs, we present a Vectorized Particle-based Score Distillation (VPSD) approach. VPSD addresses over-saturation issues in existing methods and enhances sample diversity. A pre-trained reward model is incorporated to re-weight vector particles, improving aesthetic appeal and enabling faster convergence. Additionally, we design a novel adaptive vector primitives control strategy, which allows for the dynamic adjustment of the number of primitives, thereby enhancing the presentation of graphic details. Extensive experiments validate the effectiveness of the proposed method, demonstrating its superiority over baseline methods in terms of editability, visual quality, and diversity. We also show that our new method supports up to six distinct vector styles, capable of generating high-quality vector assets suitable for stylized vector design and poster design.

近年来,文本引导的可缩放矢量图形(SVG)合成在图标设计和草图等领域展现出巨大潜力。然而,现有 Text-to-SVG 方法生成的 SVG 通常缺乏可编辑性,并在视觉质量和多样性方面存在不足。本文提出一种新的文本引导矢量图形合成方法,以解决这些限制。为增强输出 SVG 的可编辑性,我们引入层次化图像矢量化(HIVE)框架,该框架在语义对象级别运行,并监督矢量对象内部组件的优化。该方法有助于将矢量图形解耦为不同对象和组件层级。我们提出的 HIVE 算法结合图像分割先验,不仅确保更精确的矢量图形表示,也支持矢量对象内部的细粒度编辑能力。为提升输出 SVG 的多样性,我们提出基于矢量化粒子的分数蒸馏(VPSD)方法。VPSD 解决了现有方法中的过饱和问题,并增强样本多样性。我们还引入预训练奖励模型对矢量粒子重新加权,以提升美学吸引力并加快收敛。此外,我们设计了一种新的自适应矢量图元控制策略,可动态调整图元数量,从而增强图形细节呈现。大量实验验证了所提方法的有效性,证明其在可编辑性、视觉质量和多样性方面优于基线方法。我们还展示了新方法支持多达六种不同矢量风格,能够生成适用于风格化矢量设计和海报设计的高质量矢量资产。


Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models[J]. arXiv preprint arXiv:2411.16602, 2024.

https://github.com/kingnobro/Chat2SVG

CVPR2025GitHub Stars 230Diffusion

Chat2SVG:结合大语言模型和图像扩散模型的矢量图形生成

Abstract

Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

可缩放矢量图形(SVG)已成为数字设计中矢量图形的事实标准,提供分辨率无关性和对单个元素的精确控制。尽管具有这些优势,创建高质量 SVG 内容仍然具有挑战性,因为这需要专业编辑软件的技术经验,并且需要大量时间来制作复杂形状。近期 Text-to-SVG 生成方法旨在让矢量图形创作更易获得,但它们仍在形状规则性、泛化能力和表达能力方面存在限制。为应对这些挑战,我们提出 Chat2SVG,这是一个混合框架,结合大语言模型(LLM)和图像扩散模型的优势,用于 Text-to-SVG 生成。我们的方法首先使用 LLM 从基本几何图元生成语义有意义的 SVG 模板。在图像扩散模型引导下,双阶段优化流水线在潜在空间中细化路径,并调整点坐标以增强几何复杂性。大量实验表明,Chat2SVG 在视觉保真度、路径规则性和语义对齐方面优于现有方法。此外,我们的系统支持通过自然语言指令进行直观编辑,使所有用户都能进行专业矢量图形创作。


LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer[J]. arXiv preprint arXiv:2502.01105, 2025.

https://github.com/showlab/LayerTracer

ICCV2025GitHub Stars 94DiT

LayerTracer:通过扩散 Transformer 实现认知对齐的分层 SVG 合成

Abstract

Generating cognitive-aligned layered SVGs remains challenging due to existing methods' tendencies toward either oversimplified single-layer outputs or optimization-induced shape redundancies. We propose LayerTracer, a diffusion transformer based framework that bridges this gap by learning designers' layered SVG creation processes from a novel dataset of sequential design operations. Our approach operates in two phases: First, a text-conditioned DiT generates multi-phase rasterized construction blueprints that simulate human design workflows. Second, layer-wise vectorization with path deduplication produces clean, editable SVGs. For image vectorization, we introduce a conditional diffusion mechanism that encodes reference images into latent tokens, guiding hierarchical reconstruction while preserving structural integrity. Extensive experiments demonstrate LayerTracer's superior performance against optimization-based and neural baselines in both generation quality and editability, effectively aligning AI-generated vectors with professional design cognition.

生成认知对齐的分层 SVG 仍然具有挑战性,因为现有方法往往要么产生过度简化的单层输出,要么由于优化过程引入形状冗余。我们提出 LayerTracer,这是一个基于扩散 Transformer 的框架,通过从新的顺序设计操作数据集中学习设计师的分层 SVG 创作过程来弥合这一差距。我们的方法分为两个阶段:首先,文本条件 DiT 生成多阶段栅格化构造蓝图,用以模拟人类设计工作流;其次,带有路径去重的逐层矢量化生成干净、可编辑的 SVG。对于图像矢量化,我们引入条件扩散机制,将参考图像编码为潜在 token,在保持结构完整性的同时引导层次化重建。大量实验表明,LayerTracer 在生成质量和可编辑性方面均优于基于优化和神经网络的基线,有效使 AI 生成的矢量图与专业设计认知对齐。


LayerD: Decomposing Raster Graphic Designs into Layers

LayerD: Decomposing Raster Graphic Designs into Layers[J]. arXiv preprint arXiv:2509.25134, 2025.

https://github.com/CyberAgentAILab/LayerD

ICCV2025GitHub Stars 76

LayerD:将栅格图形设计分解为图层

Abstract

Designers craft and edit graphic designs in a layer representation, but layer-based editing becomes impossible once composited into a raster image. In this work, we propose LayerD, a method to decompose raster graphic designs into layers for re-editable creative workflow. LayerD addresses the decomposition task by iteratively extracting unoccluded foreground layers. We propose a simple yet effective refinement approach taking advantage of the assumption that layers often exhibit uniform appearance in graphic designs. As decomposition is ill-posed and the ground-truth layer structure may not be reliable, we develop a quality metric that addresses the difficulty. In experiments, we show that LayerD successfully achieves high-quality decomposition and outperforms baselines. We also demonstrate the use of LayerD with state-of-the-art image generators and layer-based editing.

设计师通常以图层表示来创作和编辑图形设计,但一旦这些图层被合成为栅格图像,基于图层的编辑就不再可能。本文提出 LayerD,这是一种将栅格图形设计分解为图层的方法,用于支持可再次编辑的创意工作流。LayerD 通过迭代提取未被遮挡的前景图层来处理分解任务。我们提出一种简单但有效的细化方法,利用图形设计中图层通常呈现均匀外观这一假设。由于分解问题本身是不适定的,而且真实图层结构可能并不可靠,我们开发了一种质量指标来应对这一困难。实验表明,LayerD 能够成功实现高质量分解并优于基线方法。我们还展示了 LayerD 与最先进图像生成器和基于图层编辑的结合使用。


HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery

HoliTracer: Holistic Vectorization of Geographic Objects from Large-Size Remote Sensing Imagery[J]. arXiv preprint arXiv:2507.16251, 2025.

https://github.com/vvangfaye/HoliTracer

ICCV2025GitHub Stars 45

HoliTracer:从大尺寸遥感影像中整体矢量化地理对象

Abstract

With the increasing resolution of remote sensing imagery (RSI), large-size RSI has emerged as a vital data source for high-precision vector mapping of geographic objects. Existing methods are typically constrained to processing small image patches, which often leads to the loss of contextual information and produces fragmented vector outputs. To address these, this paper introduces HoliTracer, the first framework designed to holistically extract vectorized geographic objects from large-size RSI. In HoliTracer, we enhance segmentation of large-size RSI using the Context Attention Net (CAN), which employs a local-to-global attention mechanism to capture contextual dependencies. Furthermore, we achieve holistic vectorization through a robust pipeline that leverages the Mask Contour Reformer (MCR) to reconstruct polygons and the Polygon Sequence Tracer (PST) to trace vertices. Extensive experiments on large-size RSI datasets, including buildings, water bodies, and roads, demonstrate that HoliTracer outperforms state-of-the-art methods. Our code and data are available in this https URL .

随着遥感影像(RSI)分辨率不断提高,大尺寸 RSI 已成为地理对象高精度矢量制图的重要数据来源。现有方法通常受限于处理小图像块,这往往会导致上下文信息丢失,并产生碎片化的矢量输出。为解决这些问题,本文提出 HoliTracer,这是首个面向大尺寸 RSI 整体提取矢量化地理对象的框架。在 HoliTracer 中,我们使用上下文注意力网络(CAN)增强大尺寸 RSI 的分割,该网络采用从局部到全局的注意力机制来捕获上下文依赖。此外,我们通过一个鲁棒流水线实现整体矢量化,该流水线利用 Mask Contour Reformer(MCR)重建多边形,并使用 Polygon Sequence Tracer(PST)追踪顶点。在包含建筑物、水体和道路的大尺寸 RSI 数据集上的大量实验表明,HoliTracer 优于最先进方法。我们的代码和数据可在 this https URL 获取。


Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility

Empowering Vector Graphics with Consistently Arbitrary Viewing and View-dependent Visibility[J]. arXiv preprint arXiv:2505.21377, 2025.

https://github.com/chenxinl/Dream3DVG

CVPR2025GitHub Stars 12

赋予矢量图形一致的任意视角观看和视角相关可见性

Abstract

This work presents a novel text-to-vector graphics generation approach, Dream3DVG, allowing for arbitrary viewpoint viewing, progressive detail optimization, and view-dependent occlusion awareness. Our approach is a dual-branch optimization framework, consisting of an auxiliary 3D Gaussian Splatting optimization branch and a 3D vector graphics optimization branch. The introduced 3DGS branch can bridge the domain gaps between text prompts and vector graphics with more consistent guidance. Moreover, 3DGS allows for progressive detail control by scheduling classifier-free guidance, facilitating guiding vector graphics with coarse shapes at the initial stages and finer details at later stages. We also improve the view-dependent occlusions by devising a visibility-awareness rendering module. Extensive results on 3D sketches and 3D iconographies, demonstrate the superiority of the method on different abstraction levels of details, cross-view consistency, and occlusion-aware stroke culling.

本文提出一种新的文本到矢量图形生成方法 Dream3DVG,支持任意视角观看、渐进式细节优化以及视角相关遮挡感知。我们的方法是一个双分支优化框架,由辅助 3D Gaussian Splatting 优化分支和 3D 矢量图形优化分支组成。引入的 3DGS 分支可以通过更一致的引导弥合文本提示和矢量图形之间的领域差距。此外,3DGS 通过调度 classifier-free guidance 实现渐进式细节控制,使其能够在初始阶段以粗略形状引导矢量图形,并在后期引入更精细细节。我们还通过设计可见性感知渲染模块来改善视角相关遮挡。大量关于 3D 草图和 3D 图标的结果表明,该方法在不同抽象层级细节、跨视角一致性以及遮挡感知笔画剔除方面具有优越性。