Skip to content


Abstract

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops—through increased transformer depth/width or increased number of input tokens—consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512×512 and 256×256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

我们探索了一类基于Transformer架构的新型扩散模型。我们训练图像的潜在扩散模型,用作用于潜在块上的Transformer替换了常用的U-Net主干网络。我们通过以Gflops衡量的前向传播复杂度视角,分析了我们的扩散Transformer(DiT)的可扩展性。我们发现,具有更高Gflops(通过增加Transformer深度/宽度或增加输入token数量)的DiT consistently具有更低的FID。除了拥有良好的可扩展性外,我们最大的DiT-XL/2模型在类条件ImageNet 512×512256×256 基准测试上超越了所有先前的扩散模型,在后者上实现了2.27的SOTA FID分数。

Introduction

Machine learning is experiencing a renaissance powered by transformers. Over the past five years, neural architectures for natural language processing [42, 8], vision and several other domains have been subsumed by transformers. Many classes of image-level generative models remain holdouts to the trend, though—while transformers see widespread use in autoregressive models, they have seen less adoption in other generative modeling frameworks. For example, diffusion models have been at the forefront of recent advances in image generation [9, 46]; yet, they all adopt a convolutional U-Net architecture as the de-facto choice of backbone.

机器学习正经历一场由Transformer驱动的复兴。过去五年中,自然语言处理、视觉及其他多个领域的神经架构已被Transformer所取代。然而,许多图像级生成模型类别仍是这一趋势的例外——尽管Transformer在自回归模型中得到了广泛应用,但在其他生成建模框架中采用较少。例如,扩散模型处于近期图像生成进展的前沿,但它们都采用卷积U-Net架构作为事实上的主干选择。

The seminal work of Ho et al. first introduced the U-Net backbone for diffusion models. Having initially seen success within pixel-level autoregressive models and conditional GANs, the U-Net was inherited from Pixel-CNN++ with a few changes. The model is convolutional, comprised primarily of ResNet blocks. In contrast to the standard U-Net, additional spatial self-attention blocks, which are essential components in transformers, are interspersed at lower resolutions. Dhariwal and Nichol ablated several architecture choices for the U-Net, such as the use of adaptive normalization layers to inject conditional information and channel counts for convolutional layers. However, the high-level design of the U-Net from Ho et al. has largely remained intact.

Ho等人的开创性工作首次将U-Net主干引入扩散模型。U-Net最初在像素级自回归模型和条件GAN中取得成功,经少量修改后从Pixel-CNN++继承而来。该模型是卷积的,主要由ResNet块构成。与标准U-Net不同,它在较低分辨率处穿插了额外的空间自注意力块——这些是Transformer中的关键组件。Dhariwal和Nichol对U-Net的若干架构选择进行了消融研究,例如使用自适应归一化层注入条件信息,以及卷积层的通道数。然而,Ho等人的U-Net高层设计基本保持不变。

With this work, we aim to demystify the significance of architectural choices in diffusion models and offer empirical baselines for future generative modeling research. We show that the U-Net inductive bias is not crucial to the performance of diffusion models, and they can be readily replaced with standard designs such as transformers. As a result, diffusion models are well-poised to benefit from the recent trend of architecture unification—e.g., by inheriting best practices and training recipes from other domains, as well as retaining favorable properties like scalability, robustness and efficiency. A standardized architecture would also open up new possibilities for cross-domain research.

本工作旨在阐明扩散模型中架构选择的重要性,并为未来的生成建模研究提供经验基线。我们表明,U-Net的归纳偏置对扩散模型的性能并非关键,它们可以轻易地被标准设计(如Transformer)所替代。因此,扩散模型有望从近期架构统一的趋势中受益——例如,继承其他领域的最佳实践和训练方案,同时保持可扩展性、鲁棒性和效率等有利特性。标准化的架构还将为跨领域研究开辟新的可能性。

In this paper, we focus on a new class of diffusion models based on transformers. We call them Diffusion Transformers, or DiTs for short. DiTs adhere to the best practices of Vision Transformers (ViTs), which have been shown to scale more effectively for visual recognition than traditional convolutional networks (e.g., ResNet).

本文聚焦于一类基于Transformer的新型扩散模型,我们称之为扩散Transformer(简称DiT)。DiT遵循视觉Transformer的最佳实践,后者已被证明在视觉识别上比传统卷积网络(如ResNet)具有更有效的可扩展性。

More specifically, we study the scaling behavior of transformers with respect to network complexity vs. sample quality. We show that by constructing and benchmarking the DiT design space under the Latent Diffusion Models (LDMs) framework, where diffusion models are trained within a VAE’s latent space, we can successfully replace the U-Net backbone with a transformer. We further show that DiTs are scalable architectures for diffusion models: there is a strong correlation between the network complexity (measured by Gflops) vs. sample quality (measured by FID). By simply scaling-up DiT and training an LDM with a high-capacity backbone (118.6 Gflops), we are able to achieve a state-of-the-art result of 2.27 FID on the class-conditional 256 × 256 ImageNet generation benchmark.

更具体地,我们研究了Transformer在网络复杂度样本质量之间的扩展行为。我们证明,通过在潜在扩散模型框架下构建和基准测试DiT设计空间(其中扩散模型在VAE的潜在空间内训练),我们可以成功地将U-Net主干替换为Transformer。我们进一步表明,DiT是扩散模型的可扩展架构:网络复杂度(以Gflops衡量)与样本质量(以FID衡量)之间存在强相关性。通过简单扩展DiT并使用高容量主干(118.6 Gflops)训练LDM,我们能够在类条件256 × 256 ImageNet生成基准上取得2.27 FID的 SOTA 结果。

Transformers. Transformers have replaced domain-specific architectures across language, vision, reinforcement learning and meta-learning. They have shown remarkable scaling properties under increasing model size, training compute and data in the language domain, as generic autoregressive models and as ViTs. Beyond language, transformers have been trained to autoregressively predict pixels. They have also been trained on discrete codebooks as both autoregressive models and masked generative models; the former has shown excellent scaling behavior up to 20B parameters. Finally, transformers have been explored in DDPMs to synthesize non-spatial data; e.g., to generate CLIP image embeddings in DALL·E 2. In this paper, we study the scaling properties of transformers when used as the backbone of diffusion models of images.

Transformer. Transformer 已取代了语言、视觉、强化学习和元学习等领域的特定领域架构。它们在语言领域、作为通用自回归模型以及作为 ViT 时,随着模型规模、训练计算量和数据量的增加,展现出了卓越的扩展特性。除了语言领域,Transformer 还被训练用于自回归预测像素。它们也在离散码本上作为自回归模型和掩码生成模型进行训练;前者展现出卓越的扩展行为,参数量高达 200 亿。最后,Transformer 已在 DDPM 中被探索用于合成非空间数据,例如在 DALL·E 2 中生成 CLIP 图像嵌入。本文研究了 Transformer 作为图像扩散模型主干时的扩展特性。

Denoising diffusion probabilistic models (DDPMs) Diffusion and score-based generative models have been particularly successful as generative models of images, in many cases outperforming generative adversarial networks (GANs) which had previously been state-of-the-art. Improvements to DDPMs over the past two years have largely been driven by improved sampling techniques, most notably classifier-free guidance, reformulating diffusion models to predict noise instead of pixels and using cascaded pipelines where low-resolution base diffusion models are trained in parallel with upsamplers. For all the diffusion models listed above, convolutional U-Nets are the de-facto choice of backbone architecture. Concurrent work introduced a novel, efficient architecture based on attention for DDPMs; we explore pure transformers.

去噪扩散概率模型 (DDPM) 扩散模型和基于分数的生成模型作为图像的生成模型尤为成功,在许多情况下超越了先前最先进的生成对抗网络。过去两年 DDPM 的改进主要得益于采样技术的进步,最显著的是无分类器引导、将扩散模型重新表述为预测噪声而非像素,以及使用级联流程(低分辨率基础扩散模型与上采样器并行训练)。对于上述所有扩散模型,卷积 U-Net 是主干架构的事实选择。同期工作引入了一种新颖的、基于注意力的高效 DDPM 架构;而我们探索的是纯粹的 Transformer。

Architecture complexity. When evaluating architecture complexity in the image generation literature, it is common practice to use parameter counts. In general, parameter counts are poor proxies for the complexity of image models since they do not account for, e.g., image resolution which significantly impacts performance [44, 45]. Instead, much of the analysis in this paper is through the lens of compute. This brings us in-line with the architecture design literature where flops are widely-used to gauge complexity. In practice, the golden metric will depend on particular application scenarios. Nichol and Dhariwal's seminal work improving diffusion models [36, 9] is most related to us—there, they analyzed the scalability properties of the U-Net architecture class. In this paper, we focus on the transformer class.

架构复杂度 在图像生成文献中评估架构复杂度时,通常使用参数量。一般来说,参数数量并不能很好地代表图像模型的复杂度,因为它们没有考虑例如对性能有显著影响的图像分辨率。相反,本文的大部分分析是从计算量的角度进行的。这与架构设计文献的思路一致,其中 flops 被广泛用于衡量复杂度。在实践中,黄金指标将取决于特定的应用场景。Nichol 和 Dhariwal 改进扩散模型的开创性工作与我们最相关——他们分析了 U-Net 架构类的可扩展性。本文则聚焦于 Transformer 类。

Diffusion Transformers

3.1. Preliminaries

Diffusion formulation. Before introducing our architecture, we briefly review some basic concepts needed to understand diffusion models (DDPMs). Gaussian diffusion models assume a forward noising process which gradually applies noise to real data x0: q(xt|x0)=N(xt;α¯tx0,(1α¯t)I), where constants α¯t are hyperparameters. By applying the reparameterization trick, we can sample xt=α¯tx0+1α¯tϵt, where ϵtN(0,I). Diffusion models are trained to learn the reverse process that inverts forward process corruptions: pθ(xt1|xt)=N(μθ(xt),Σθ(xt)), where neural networks are used to predict the statistics of pθ. The reverse process model is trained with the variational lower bound of the log-likelihood of x0, which reduces to L(θ)=p(x0|x1) +tDKL(q(xt1|xt,x0)||pθ(xt1|xt)), excluding an additional term irrelevant for training. By reparameterizing μθ as a noise prediction network ϵθ, the model can be trained using simple mean-squared error between the predicted noise ϵθ(xt) and the ground truth sampled Gaussian noise ϵt: Lsimple(θ)=||ϵθ(xt)ϵt||22. But, in order to train diffusion models with a learned reverse process covariance Σθ, the full DKL term needs to be optimized. We follow Nichol and Dhariwal's approach: train ϵθ with Lsimple, and train Σθ with the full L. Once pθ is trained, new images can be sampled by initializing xtmaxN(0,I) and sampling xt1pθ(xt1|xt) via the reparameterization trick.

3.1. 预备知识

扩散模型公式。 在介绍我们的架构之前,我们简要回顾理解扩散模型所需的一些基本概念。高斯扩散模型假设一个前向加噪过程,逐步向真实数据 x0 添加噪声:q(xt|x0)=N(xt;α¯tx0,(1α¯t)I), 其中常数 α¯t 是超参数。通过应用重参数化技巧,我们可以采样 xt=α¯tx0+1α¯tϵt,其中 ϵtN(0,I)。扩散模型被训练来学习逆转前向过程破坏的反向过程:pθ(xt1|xt)=N(μθ(xt),Σθ(xt)),其中神经网络用于预测 pθ 的统计量。反向过程模型使用 x0 对数似然的变分下界进行训练,该下界可简化为 L(θ)=p(x0|x1) +tDKL(q(xt1|xt,x0)||pθ(xt1|xt)),排除了另一个与训练无关的项。通过将 μθ 重参数化为噪声预测网络 ϵθ,该模型可以使用预测噪声 ϵθ(xt) 与真实采样的高斯噪声 ϵt 之间的简单均方误差来训练:Lsimple(θ)=||ϵθ(xt)ϵt||22.但是,为了训练具有可学习反向过程协方差 Σθ 的扩散模型,需要优化完整的 DKL 项。我们遵循 Nichol 和 Dhariwal 的方法:使用 Lsimple 训练 ϵθ,并使用完整的 L 训练 Σθ。一旦 pθ 训练完成,可以通过初始化 xtmaxN(0,I) 并通过重参数化技巧采样 xt1pθ(xt1|xt) 来生成新图像。

Classifier-free guidance. Conditional diffusion models take extra information as input, such as a class label c. In this case, the reverse process becomes pθ(xt1|xt,c), where ϵθ and Σθ are conditioned on c. In this setting, classifier-free guidance can be used to encourage the sampling procedure to find x such that logp(c|x) is high. By Bayes Rule, logp(c|x)logp(x|c)logp(x), and hence xlogp(c|x) xlogp(x|c) xlogp(x). By interpreting the output of diffusion models as the score function, the DDPM sampling procedure can be guided to sample x with high p(x|c) by: ϵ^θ(xt,c)=ϵθ(xt,)+sxlogp(x|c)ϵθ(xt,) +s(ϵθ(xt,c)ϵθ(xt,)), where s>1 indicates the scale of the guidance (note that s=1 recovers standard sampling). Evaluating the diffusion model with c= is done by randomly dropping out c during training and replacing it with a learned "null" embedding . Classifier-free guidance is widely-known to yield significantly improved samples over generic sampling techniques, and the trend holds for our DiT models.

无分类器引导。 条件扩散模型将额外信息作为输入,例如类标签 c。在这种情况下,反向过程变为 pθ(xt1|xt,c),其中 ϵθΣθc 为条件。在这种设定下,无分类器引导可用于鼓励采样过程找到使 logp(c|x) 较高的 x。根据贝叶斯规则,logp(c|x)logp(x|c) logp(x),因此 xlogp(c|x) xlogp(x|c) xlogp(x)。通过将扩散模型的输出解释为得分函数,DDPM采样过程可以通过以下方式引导以采样具有高 p(x|c)xϵ^θ(xt,c)=ϵθ(xt,)+sxlogp(x|c) ϵθ(xt,)+s(ϵθ(xt,c)ϵθ(xt,)),其中 s>1 表示引导的尺度(注意 s=1 恢复标准采样)。使用 c= 评估扩散模型是通过在训练期间随机丢弃 c 并用学习到的“空”嵌入 替换它来实现的。众所周知,无分类器引导能产生比通用采样技术显著改进的样本,并且这一趋势对我们的 DiT 模型同样成立。

Latent diffusion models. Training diffusion models directly in high-resolution pixel space can be computationally prohibitive. Latent diffusion models (LDMs) tackle this issue with a two-stage approach: (1) learn an autoencoder that compresses images into smaller spatial representations with a learned encoder E; (2) train a diffusion model of representations z=E(x) instead of a diffusion model of images x (E is frozen). New images can then be generated by sampling a representation z from the diffusion model and subsequently decoding it to an image with the learned decoder x=D(z).As shown in Figure 2, LDMs achieve good performance while using a fraction of the Gflops of pixel space diffusion models like ADM. Since we are concerned with compute efficiency, this makes them an appealing starting point for architecture exploration. In this paper, we apply DiTs to latent space, although they could be applied to pixel space without modification as well. This makes our image generation pipeline a hybrid-based approach; we use off-the-shelf convolutional VAEs and transformer-based DDPMs.

潜在扩散模型。 直接在高分辨率像素空间中训练扩散模型在计算上可能是 prohibitive 的。潜在扩散模型 (LDM) 通过两阶段方法解决这个问题:(1) 学习一个自编码器,使用学习到的编码器 E 将图像压缩成更小的空间表示;(2) 训练表示 z=E(x) 的扩散模型,而不是图像 x 的扩散模型(E 被冻结)。然后可以通过从扩散模型采样表示 z,随后使用学习到的解码器 x=D(z) 将其解码为图像来生成新图像。如图 2 所示,LDM 在使用远少于像 ADM 这样的像素空间扩散模型的 Gflops 的同时,取得了良好的性能。由于我们关注计算效率,这使得它们成为架构探索的一个有吸引力的起点。在本文中,我们将 DiT 应用于潜在空间,尽管它们也可以不加修改地应用于像素空间。这使得我们的图像生成流程成为一种基于混合的方法;我们使用现成的卷积 VAE 和基于 Transformer 的 DDPM。

3.2. Diffusion Transformer Design Space

We introduce Diffusion Transformers (DiTs), a new architecture for diffusion models. We aim to be as faithful to the standard transformer architecture as possible to retain its scaling properties. Since our focus is training DDPMs of images (specifically, spatial representations of images), DiT is based on the Vision Transformer (ViT) architecture which operates on sequences of patches. DiT retains many of the best practices of ViTs. Figure 3 shows an overview of the complete DiT architecture. In this section, we describe the forward pass of DiT, as well as the components of the design space of the DiT class.

3.2. 扩散Transformer设计空间

我们引入了扩散Transformer(DiT),一种新的扩散模型架构。我们旨在尽可能忠实于标准Transformer架构,以保留其可扩展性。由于我们专注于训练图像的DDPM(特别是图像的空间表示),DiT基于视觉Transformer(ViT)架构,该架构在图像块序列上操作。DiT保留了ViT的许多最佳实践。图3展示了完整DiT架构的概览。在本节中,我们将描述DiT的前向传播,以及DiT类设计空间的组成部分。

您的浏览器不支持 SVG

图3:扩散Transformer(DiT)架构。左图: 我们训练条件潜在DiT模型。输入潜在表示被分解为图像块,并由多个DiT块处理。右图: 我们DiT块的细节。我们尝试了标准Transformer块的变体,通过自适应层归一化、交叉注意力和额外输入token来融入条件信息。其中自适应层归一化效果最佳。

Patchify. The input to DiT is a spatial representation z (for 256×256×3 images, z has shape 32×32×4). The first layer of DiT is "patchify," which converts the spatial input into a sequence of T tokens, each of dimension d, by linearly embedding each patch in the input. Following patchify, we apply standard ViT frequency-based positional embeddings (the sine-cosine version) to all input tokens. The number of tokens T created by patchify is determined by the patch size hyperparameter p. As shown in Figure 4, halving p will quadruple T, and thus at least quadruple total transformer Gflops. Although it has a significant impact on Gflops, note that changing p has no meaningful impact on downstream parameter counts. We add p=2,4,8 to the DiT design space.

分块。 DiT的输入是一个空间表示 z(对于 256×256×3 的图像,z 的形状为 32×32×4)。DiT的第一层是“分块”,通过对输入中的每个图像块进行线性嵌入,将空间输入转换为 T 个token的序列,每个token的维度为 d。分块之后,我们对所有输入token应用标准的基于频率的ViT位置嵌入(正弦-余弦版本)。分块创建的token数量 T 由分块大小超参数 p 决定。如图4所示,将 p 减半将使 T 变为四倍,因此至少使总Transformer Gflops变为四倍。尽管它对Gflops有显著影响,但请注意,改变 p 对下游参数数量没有有意义的影响。我们将 p=2,4,8 添加到 DiT设计空间中。

DiT block design. Following patchify, the input tokens are processed by a sequence of transformer blocks. In addition to noised image inputs, diffusion models sometimes process additional conditional information such as noise timesteps t, class labels c, natural language, etc. We explore four variants of transformer blocks that process conditional inputs differently. The designs introduce small, but important, modifications to the standard ViT block design. The designs of all blocks are shown in Figure 3.

DiT块设计。 分块之后,输入token由一系列Transformer块处理。除了加噪的图像输入,扩散模型有时还处理额外的条件信息,如噪声时间步 t、类标签 c、自然语言等。我们探索了四种以不同方式处理条件输入的Transformer块变体。这些设计对标准ViT块设计引入了微小但重要的修改。所有块的设计如图3所示。

  • In-context conditioning. We simply append the vector embeddings of t and c as two additional tokens in the input sequence, treating them no differently from the image tokens. This is similar to cls tokens in ViTs, and it allows us to use standard ViT blocks without modification. After the final block, we remove the conditioning tokens from the sequence. This approach introduces negligible new Gflops to the model.
  • 上下文内条件。 我们简单地将 tc 的向量嵌入作为两个额外的token附加到输入序列中,对它们的处理与图像token无异。这类似于ViT中的cls token,它允许我们不加修改地使用标准ViT块。在最后一个块之后,我们从序列中移除条件token。这种方法为模型引入了可忽略不计的新Gflops。
  • Cross-attention block. We concatenate the embeddings of t and c into a length-two sequence, separate from the image token sequence. The transformer block is modified to include an additional multi-head cross-attention layer following the multi-head self-attention block, similar to the original design from Vaswani et al, and also similar to the one used by LDM for conditioning on class labels. Cross-attention adds the most Gflops to the model, roughly a 15% overhead.
  • 交叉注意力块。 我们将 tc 的嵌入连接成一个长度为2的序列,与图像token序列分开。Transformer块被修改为在多头自注意力块之后包含一个额外的多头交叉注意力层,类似于Vaswani等人的原始设计,也类似于LDM用于类标签条件的块。交叉注意力为模型增加了最多的Gflops,大约有15%的开销。
  • Adaptive layer norm (adaLN) block. Following the widespread usage of adaptive normalization layers in GANs and diffusion models with U-Net backbones, we explore replacing standard layer norm layers in transformer blocks with adaptive layer norm (adaLN). Rather than directly learn dimension-wise scale and shift parameters γ and β, we regress them from the sum of the embedding vectors of t and c. Of the three block designs we explore, adaLN adds the least Gflops and is thus the most compute-efficient. It is also the only conditioning mechanism that is restricted to apply the same function to all tokens.
  • 自适应层归一化块。 遵循自适应归一化层在GAN和带有U-Net主干的扩散模型中的广泛使用,我们探索用自适应层归一化(adaLN)替换Transformer块中的标准层归一化层。我们不是直接学习逐维度的缩放和偏移参数 γβ,而是从 tc 的嵌入向量之和回归得到它们。在我们探索的三种块设计中,adaLN增加的Gflops最少,因此是计算效率最高的。它也是唯一一个限制为对所有token应用相同函数的条件机制。
  • adaLN-Zero block. Prior work on ResNets has found that initializing each residual block as the identity function is beneficial. For example, Goyal et al. found that zero-initializing the final batch norm scale factor γ in each block accelerates large-scale training in the supervised learning setting. Diffusion U-Net models use a similar initialization strategy, zero-initializing the final convolutional layer in each block prior to any residual connections. We explore a modification of the adaLN DiT block which does the same. In addition to regressing γ and β, we also regress dimension-wise scaling parameters α that are applied immediately prior to any residual connections within the DiT block. We initialize the MLP to output the zero-vector for all α; this initializes the full DiT block as the identity function. As with the vanilla adaLN block, adaLN-Zero adds negligible Gflops to the model. We include the in-context, cross-attention, adaptive layer norm and adaLN-Zero blocks in the DiT design space.
  • adaLN-Zero块。 先前关于ResNet的工作发现,将每个残差块初始化为恒等函数是有益的。例如,Goyal等人发现,在监督学习环境中,将每个块中最后的批归一化缩放因子 γ 初始化为零可以加速大规模训练。扩散U-Net模型使用类似的初始化策略,在每个块中,在任何残差连接之前,将最后的卷积层初始化为零。我们探索了对adaLN DiT块的修改,以实现相同效果。除了回归 γβ 之外,我们还回归逐维度的缩放参数 α,这些参数在DiT块内的任何残差连接之前立即应用。我们将MLP初始化为对所有 α 输出零向量;这将整个DiT块初始化为恒等函数。与普通的adaLN块一样,adaLN-Zero为模型增加了可忽略不计的Gflops。我们将上下文内、交叉注意力、自适应层归一化和adaLN-Zero块纳入DiT设计空间。

Model size. We apply a sequence of N DiT blocks, each operating at the hidden dimension size d. Following ViT, we use standard transformer configs that jointly scale N, d and attention heads. Specifically, we use four configs: DiT-S, DiT-B, DiT-L and DiT-XL. They cover a wide range of model sizes and flop allocations, from 0.3 to 118.6 Gflops, allowing us to gauge scaling performance. Table 1 gives details of the configs. We add B, S, L and XL configs to the DiT design space.

模型大小。 我们应用一个由 N 个DiT块组成的序列,每个块在隐藏维度大小 d 上操作。遵循ViT,我们使用联合缩放 Nd 和注意力头数的标准Transformer配置。具体来说,我们使用四种配置:DiT-S、DiT-B、DiT-L 和 DiT-XL。它们涵盖了从0.3到118.6 Gflops的广泛模型大小和flop分配,使我们能够衡量扩展性能。表1给出了配置的详细信息。我们将B、S、L和XL配置添加到DiT设计空间中。

Transformer decoder. After the final DiT block, we need to decode our sequence of image tokens into an output noise prediction and an output diagonal covariance prediction. Both of these outputs have shape equal to the original spatial input. We use a standard linear decoder to do this; we apply the final layer norm (adaptive if using adaLN) and linearly decode each token into a p×p×2C tensor, where C is the number of channels in the spatial input to DiT. Finally, we rearrange the decoded tokens into their original spatial layout to get the predicted noise and covariance. The complete DiT design space we explore is patch size, transformer block architecture and model size.

Transformer解码器。 在最后一个DiT块之后,我们需要将图像token序列解码为输出的噪声预测和对角协方差预测。这两个输出的形状都与原始空间输入相同。我们使用标准的线性解码器来完成此操作;我们应用最后的层归一化(如果使用adaLN则应用自适应归一化),并将每个token线性解码成一个 p×p×2C 的张量,其中 C 是DiT空间输入的通道数。最后,我们将解码后的token重新排列回其原始的空间布局,以获得预测的噪声和协方差。我们探索的完整DiT设计空间包括分块大小、Transformer块架构和模型大小。

Conclusion

We introduce Diffusion Transformers (DiTs), a simple transformer-based backbone for diffusion models that outperforms prior U-Net models and inherits the excellent scaling properties of the transformer model class. Given the promising scaling results in this paper, future work should continue to scale DiTs to larger models and token counts. DiT could also be explored as a drop-in backbone for text-to-image models like DALL·E 2 and Stable Diffusion.

我们引入了扩散Transformer(DiT),一种简单的基于Transformer的扩散模型主干,它优于先前的U-Net模型,并继承了Transformer模型类卓越的可扩展特性。鉴于本文中令人鼓舞的扩展结果,未来的工作应继续将DiT扩展到更大的模型和token数量。DiT也可以作为即插即用的主干,用于像DALL·E 2和Stable Diffusion这样的文本到图像模型中进行探索。