Abstract
We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops—through increased transformer depth/width or increased number of input tokens—consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet
我们探索了一类基于Transformer架构的新型扩散模型。我们训练图像的潜在扩散模型,用作用于潜在块上的Transformer替换了常用的U-Net主干网络。我们通过以Gflops衡量的前向传播复杂度视角,分析了我们的扩散Transformer(DiT)的可扩展性。我们发现,具有更高Gflops(通过增加Transformer深度/宽度或增加输入token数量)的DiT consistently具有更低的FID。除了拥有良好的可扩展性外,我们最大的DiT-XL/2模型在类条件ImageNet
Introduction
Machine learning is experiencing a renaissance powered by transformers. Over the past five years, neural architectures for natural language processing [42, 8], vision and several other domains have been subsumed by transformers. Many classes of image-level generative models remain holdouts to the trend, though—while transformers see widespread use in autoregressive models, they have seen less adoption in other generative modeling frameworks. For example, diffusion models have been at the forefront of recent advances in image generation [9, 46]; yet, they all adopt a convolutional U-Net architecture as the de-facto choice of backbone.
机器学习正经历一场由Transformer驱动的复兴。过去五年中,自然语言处理、视觉及其他多个领域的神经架构已被Transformer所取代。然而,许多图像级生成模型类别仍是这一趋势的例外——尽管Transformer在自回归模型中得到了广泛应用,但在其他生成建模框架中采用较少。例如,扩散模型处于近期图像生成进展的前沿,但它们都采用卷积U-Net架构作为事实上的主干选择。
The seminal work of Ho et al. first introduced the U-Net backbone for diffusion models. Having initially seen success within pixel-level autoregressive models and conditional GANs, the U-Net was inherited from Pixel-CNN++ with a few changes. The model is convolutional, comprised primarily of ResNet blocks. In contrast to the standard U-Net, additional spatial self-attention blocks, which are essential components in transformers, are interspersed at lower resolutions. Dhariwal and Nichol ablated several architecture choices for the U-Net, such as the use of adaptive normalization layers to inject conditional information and channel counts for convolutional layers. However, the high-level design of the U-Net from Ho et al. has largely remained intact.
Ho等人的开创性工作首次将U-Net主干引入扩散模型。U-Net最初在像素级自回归模型和条件GAN中取得成功,经少量修改后从Pixel-CNN++继承而来。该模型是卷积的,主要由ResNet块构成。与标准U-Net不同,它在较低分辨率处穿插了额外的空间自注意力块——这些是Transformer中的关键组件。Dhariwal和Nichol对U-Net的若干架构选择进行了消融研究,例如使用自适应归一化层注入条件信息,以及卷积层的通道数。然而,Ho等人的U-Net高层设计基本保持不变。
With this work, we aim to demystify the significance of architectural choices in diffusion models and offer empirical baselines for future generative modeling research. We show that the U-Net inductive bias is not crucial to the performance of diffusion models, and they can be readily replaced with standard designs such as transformers. As a result, diffusion models are well-poised to benefit from the recent trend of architecture unification—e.g., by inheriting best practices and training recipes from other domains, as well as retaining favorable properties like scalability, robustness and efficiency. A standardized architecture would also open up new possibilities for cross-domain research.
本工作旨在阐明扩散模型中架构选择的重要性,并为未来的生成建模研究提供经验基线。我们表明,U-Net的归纳偏置对扩散模型的性能并非关键,它们可以轻易地被标准设计(如Transformer)所替代。因此,扩散模型有望从近期架构统一的趋势中受益——例如,继承其他领域的最佳实践和训练方案,同时保持可扩展性、鲁棒性和效率等有利特性。标准化的架构还将为跨领域研究开辟新的可能性。
In this paper, we focus on a new class of diffusion models based on transformers. We call them Diffusion Transformers, or DiTs for short. DiTs adhere to the best practices of Vision Transformers (ViTs), which have been shown to scale more effectively for visual recognition than traditional convolutional networks (e.g., ResNet).
本文聚焦于一类基于Transformer的新型扩散模型,我们称之为扩散Transformer(简称DiT)。DiT遵循视觉Transformer的最佳实践,后者已被证明在视觉识别上比传统卷积网络(如ResNet)具有更有效的可扩展性。
More specifically, we study the scaling behavior of transformers with respect to network complexity vs. sample quality. We show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) framework, where diffusion models are trained within a VAE’s latent space, we can successfully replace the U-Net backbone with a transformer. We further show that DiTs are scalable architectures for diffusion models: there is a strong correlation between the network complexity (measured by Gflops) vs. sample quality (measured by FID). By simply scaling-up DiT and training an LDM with a high-capacity backbone (118.6 Gflops), we are able to achieve a state-of-the-art result of 2.27 FID on the class-conditional 256 × 256 ImageNet generation benchmark.
更具体地,我们研究了Transformer在网络复杂度与样本质量之间的扩展行为。我们证明,通过在潜在扩散模型框架下构建和基准测试DiT设计空间(其中扩散模型在VAE的潜在空间内训练),我们可以成功地将U-Net主干替换为Transformer。我们进一步表明,DiT是扩散模型的可扩展架构:网络复杂度(以Gflops衡量)与样本质量(以FID衡量)之间存在强相关性。通过简单扩展DiT并使用高容量主干(118.6 Gflops)训练LDM,我们能够在类条件256 × 256 ImageNet生成基准上取得2.27 FID的 SOTA 结果。
Related Work
Transformers. Transformers have replaced domain-specific architectures across language, vision, reinforcement learning and meta-learning. They have shown remarkable scaling properties under increasing model size, training compute and data in the language domain, as generic autoregressive models and as ViTs. Beyond language, transformers have been trained to autoregressively predict pixels. They have also been trained on discrete codebooks as both autoregressive models and masked generative models; the former has shown excellent scaling behavior up to 20B parameters. Finally, transformers have been explored in DDPMs to synthesize non-spatial data; e.g., to generate CLIP image embeddings in DALL·E 2. In this paper, we study the scaling properties of transformers when used as the backbone of diffusion models of images.
Transformer. Transformer 已取代了语言、视觉、强化学习和元学习等领域的特定领域架构。它们在语言领域、作为通用自回归模型以及作为 ViT 时,随着模型规模、训练计算量和数据量的增加,展现出了卓越的扩展特性。除了语言领域,Transformer 还被训练用于自回归预测像素。它们也在离散码本上作为自回归模型和掩码生成模型进行训练;前者展现出卓越的扩展行为,参数量高达 200 亿。最后,Transformer 已在 DDPM 中被探索用于合成非空间数据,例如在 DALL·E 2 中生成 CLIP 图像嵌入。本文研究了 Transformer 作为图像扩散模型主干时的扩展特性。
Denoising diffusion probabilistic models (DDPMs) Diffusion and score-based generative models have been particularly successful as generative models of images, in many cases outperforming generative adversarial networks (GANs) which had previously been state-of-the-art. Improvements to DDPMs over the past two years have largely been driven by improved sampling techniques, most notably classifier-free guidance, reformulating diffusion models to predict noise instead of pixels and using cascaded pipelines where low-resolution base diffusion models are trained in parallel with upsamplers. For all the diffusion models listed above, convolutional U-Nets are the de-facto choice of backbone architecture. Concurrent work introduced a novel, efficient architecture based on attention for DDPMs; we explore pure transformers.
去噪扩散概率模型 (DDPM) 扩散模型和基于分数的生成模型作为图像的生成模型尤为成功,在许多情况下超越了先前最先进的生成对抗网络。过去两年 DDPM 的改进主要得益于采样技术的进步,最显著的是无分类器引导、将扩散模型重新表述为预测噪声而非像素,以及使用级联流程(低分辨率基础扩散模型与上采样器并行训练)。对于上述所有扩散模型,卷积 U-Net 是主干架构的事实选择。同期工作引入了一种新颖的、基于注意力的高效 DDPM 架构;而我们探索的是纯粹的 Transformer。
Architecture complexity. When evaluating architecture complexity in the image generation literature, it is common practice to use parameter counts. In general, parameter counts are poor proxies for the complexity of image models since they do not account for, e.g., image resolution which significantly impacts performance [44, 45]. Instead, much of the analysis in this paper is through the lens of compute. This brings us in-line with the architecture design literature where flops are widely-used to gauge complexity. In practice, the golden metric will depend on particular application scenarios. Nichol and Dhariwal's seminal work improving diffusion models [36, 9] is most related to us—there, they analyzed the scalability properties of the U-Net architecture class. In this paper, we focus on the transformer class.
架构复杂度 在图像生成文献中评估架构复杂度时,通常使用参数量。一般来说,参数数量并不能很好地代表图像模型的复杂度,因为它们没有考虑例如对性能有显著影响的图像分辨率。相反,本文的大部分分析是从计算量的角度进行的。这与架构设计文献的思路一致,其中 flops 被广泛用于衡量复杂度。在实践中,黄金指标将取决于特定的应用场景。Nichol 和 Dhariwal 改进扩散模型的开创性工作与我们最相关——他们分析了 U-Net 架构类的可扩展性。本文则聚焦于 Transformer 类。
Diffusion Transformers
3.1. Preliminaries
Diffusion formulation. Before introducing our architecture, we briefly review some basic concepts needed to understand diffusion models (DDPMs). Gaussian diffusion models assume a forward noising process which gradually applies noise to real data
3.1. 预备知识
扩散模型公式。 在介绍我们的架构之前,我们简要回顾理解扩散模型所需的一些基本概念。高斯扩散模型假设一个前向加噪过程,逐步向真实数据
Classifier-free guidance. Conditional diffusion models take extra information as input, such as a class label
无分类器引导。 条件扩散模型将额外信息作为输入,例如类标签
Latent diffusion models. Training diffusion models directly in high-resolution pixel space can be computationally prohibitive. Latent diffusion models (LDMs) tackle this issue with a two-stage approach: (1) learn an autoencoder that compresses images into smaller spatial representations with a learned encoder
潜在扩散模型。 直接在高分辨率像素空间中训练扩散模型在计算上可能是 prohibitive 的。潜在扩散模型 (LDM) 通过两阶段方法解决这个问题:(1) 学习一个自编码器,使用学习到的编码器
3.2. Diffusion Transformer Design Space
We introduce Diffusion Transformers (DiTs), a new architecture for diffusion models. We aim to be as faithful to the standard transformer architecture as possible to retain its scaling properties. Since our focus is training DDPMs of images (specifically, spatial representations of images), DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches. DiT retains many of the best practices of ViTs. Figure 3 shows an overview of the complete DiT architecture. In this section, we describe the forward pass of DiT, as well as the components of the design space of the DiT class.
3.2. 扩散Transformer设计空间
我们引入了扩散Transformer(DiT),一种新的扩散模型架构。我们旨在尽可能忠实于标准Transformer架构,以保留其可扩展性。由于我们专注于训练图像的DDPM(特别是图像的空间表示),DiT基于视觉Transformer(ViT)架构,该架构在图像块序列上操作。DiT保留了ViT的许多最佳实践。图3展示了完整DiT架构的概览。在本节中,我们将描述DiT的前向传播,以及DiT类设计空间的组成部分。
图3:扩散Transformer(DiT)架构。左图: 我们训练条件潜在DiT模型。输入潜在表示被分解为图像块,并由多个DiT块处理。右图: 我们DiT块的细节。我们尝试了标准Transformer块的变体,通过自适应层归一化、交叉注意力和额外输入token来融入条件信息。其中自适应层归一化效果最佳。
Patchify. The input to DiT is a spatial representation
分块。 DiT的输入是一个空间表示
DiT block design. Following patchify, the input tokens are processed by a sequence of transformer blocks. In addition to noised image inputs, diffusion models sometimes process additional conditional information such as noise timesteps
DiT块设计。 分块之后,输入token由一系列Transformer块处理。除了加噪的图像输入,扩散模型有时还处理额外的条件信息,如噪声时间步
- In-context conditioning. We simply append the vector embeddings of
and as two additional tokens in the input sequence, treating them no differently from the image tokens. This is similar to clstokens in ViTs, and it allows us to use standard ViT blocks without modification. After the final block, we remove the conditioning tokens from the sequence. This approach introduces negligible new Gflops to the model.
- 上下文内条件。 我们简单地将
和 的向量嵌入作为两个额外的token附加到输入序列中,对它们的处理与图像token无异。这类似于ViT中的 clstoken,它允许我们不加修改地使用标准ViT块。在最后一个块之后,我们从序列中移除条件token。这种方法为模型引入了可忽略不计的新Gflops。
- Cross-attention block. We concatenate the embeddings of
and into a length-two sequence, separate from the image token sequence. The transformer block is modified to include an additional multi-head cross-attention layer following the multi-head self-attention block, similar to the original design from Vaswani et al, and also similar to the one used by LDM for conditioning on class labels. Cross-attention adds the most Gflops to the model, roughly a 15% overhead.
- 交叉注意力块。 我们将
和 的嵌入连接成一个长度为2的序列,与图像token序列分开。Transformer块被修改为在多头自注意力块之后包含一个额外的多头交叉注意力层,类似于Vaswani等人的原始设计,也类似于LDM用于类标签条件的块。交叉注意力为模型增加了最多的Gflops,大约有15%的开销。
- Adaptive layer norm (adaLN) block. Following the widespread usage of adaptive normalization layers in GANs and diffusion models with U-Net backbones, we explore replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN). Rather than directly learn dimension-wise scale and shift parameters
and , we regress them from the sum of the embedding vectors of and . Of the three block designs we explore, adaLN adds the least Gflops and is thus the most compute-efficient. It is also the only conditioning mechanism that is restricted to apply the same function to all tokens.
- 自适应层归一化块。 遵循自适应归一化层在GAN和带有U-Net主干的扩散模型中的广泛使用,我们探索用自适应层归一化(adaLN)替换Transformer块中的标准层归一化层。我们不是直接学习逐维度的缩放和偏移参数
和 ,而是从 和 的嵌入向量之和回归得到它们。在我们探索的三种块设计中,adaLN增加的Gflops最少,因此是计算效率最高的。它也是唯一一个限制为对所有token应用相同函数的条件机制。
- adaLN-Zero block. Prior work on ResNets has found that initializing each residual block as the identity function is beneficial. For example, Goyal et al. found that zero-initializing the final batch norm scale factor
in each block accelerates large-scale training in the supervised learning setting. Diffusion U-Net models use a similar initialization strategy, zero-initializing the final convolutional layer in each block prior to any residual connections. We explore a modification of the adaLN DiT block which does the same. In addition to regressing and , we also regress dimension-wise scaling parameters that are applied immediately prior to any residual connections within the DiT block. We initialize the MLP to output the zero-vector for all ; this initializes the full DiT block as the identity function. As with the vanilla adaLN block, adaLN-Zero adds negligible Gflops to the model. We include the in-context, cross-attention, adaptive layer norm and adaLN-Zero blocks in the DiT design space.
- adaLN-Zero块。 先前关于ResNet的工作发现,将每个残差块初始化为恒等函数是有益的。例如,Goyal等人发现,在监督学习环境中,将每个块中最后的批归一化缩放因子
初始化为零可以加速大规模训练。扩散U-Net模型使用类似的初始化策略,在每个块中,在任何残差连接之前,将最后的卷积层初始化为零。我们探索了对adaLN DiT块的修改,以实现相同效果。除了回归 和 之外,我们还回归逐维度的缩放参数 ,这些参数在DiT块内的任何残差连接之前立即应用。我们将MLP初始化为对所有 输出零向量;这将整个DiT块初始化为恒等函数。与普通的adaLN块一样,adaLN-Zero为模型增加了可忽略不计的Gflops。我们将上下文内、交叉注意力、自适应层归一化和adaLN-Zero块纳入DiT设计空间。
Model size. We apply a sequence of
模型大小。 我们应用一个由
Transformer decoder. After the final DiT block, we need to decode our sequence of image tokens into an output noise prediction and an output diagonal covariance prediction. Both of these outputs have shape equal to the original spatial input. We use a standard linear decoder to do this; we apply the final layer norm (adaptive if using adaLN) and linearly decode each token into a
Transformer解码器。 在最后一个DiT块之后,我们需要将图像token序列解码为输出的噪声预测和对角协方差预测。这两个输出的形状都与原始空间输入相同。我们使用标准的线性解码器来完成此操作;我们应用最后的层归一化(如果使用adaLN则应用自适应归一化),并将每个token线性解码成一个
Conclusion
We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL·E 2 and Stable Diffusion.
我们引入了扩散Transformer(DiT),一种简单的基于Transformer的扩散模型主干,它优于先前的U-Net模型,并继承了Transformer模型类卓越的可扩展特性。鉴于本文中令人鼓舞的扩展结果,未来的工作应继续将DiT扩展到更大的模型和token数量。DiT也可以作为即插即用的主干,用于像DALL·E 2和Stable Diffusion这样的文本到图像模型中进行探索。