High-Resolution Image Synthesis with Latent Diffusion Models

Rombach R, Blattmann A, Lorenz D, Esser P, Ommer B. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022.
https://github.com/CompVis/latent-diffusion

DiffusionLatent DiffusionImage Synthesis35360+CVPR 2022CCF-ALMUHeidelbergRunway ML

利用潜在扩散模型进行高分辨率图像合成

Abstract

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve new state-of-the-art scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

通过把图像形成过程分解为去噪自编码器的顺序应用，扩散模型（DM）在图像数据及其他领域取得了最先进的合成结果。此外，其形式化还允许使用一种引导机制，在无需重新训练的情况下控制图像生成过程。然而，由于这些模型通常直接在像素空间中运行，优化强大的 DM 往往会消耗数百个 GPU 日，而且由于需要顺序评估，推理成本也很高。为了在有限计算资源上训练 DM，同时保留其质量和灵活性，作者将它们应用于强大的预训练自编码器的潜在空间。与先前工作不同，在这种表示上训练扩散模型首次能够在降低复杂度和保留细节之间达到接近最优的平衡点，从而大幅提升视觉保真度。通过在模型架构中引入交叉注意力层，作者把扩散模型转变为强大且灵活的生成器，可处理文本或边界框等通用条件输入，并能以卷积方式实现高分辨率合成。作者的潜在扩散模型（LDM）在图像修复和类别条件图像合成上取得新的最先进分数，并在文本到图像合成、无条件图像生成和超分辨率等多种任务上表现出高度竞争力，同时相比基于像素的 DM 显著降低计算需求。

1. Introduction

First-stage input example 1 — 图1：Boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models. Images are from the DIV2K validation set, evaluated at $512^{2}$ px. We denote the spatial downsampling factor by $f$ . Reconstruction FIDs and PSNR are calculated on ImageNet-val; see also the appendix table.

First-stage reconstruction example 1 by ours — 图1：Boosting the upper bound on achievable quality with less agressive downsampling. Since diffusion models offer excellent inductive biases for spatial data, we do not need the heavy spatial downsampling of related generative models in latent space, but can still greatly reduce the dimensionality of the data via suitable autoencoding models. Images are from the DIV2K validation set, evaluated at $512^{2}$ px. We denote the spatial downsampling factor by $f$ . Reconstruction FIDs and PSNR are calculated on ImageNet-val; see also the appendix table.

Image synthesis is one of the computer vision fields with the most spectacular recent development, but also among those with the greatest computational demands. Especially high-resolution synthesis of complex, natural scenes is presently dominated by scaling up likelihood-based models, potentially containing billions of parameters in autoregressive (AR) transformers. In contrast, the promising results of GANs have been revealed to be mostly confined to data with comparably limited variability as their adversarial learning procedure does not easily scale to modeling complex, multi-modal distributions. Recently, diffusion models, which are built from a hierarchy of denoising autoencoders, have shown to achieve impressive results in image synthesis and beyond, and define the state-of-the-art in class-conditional image synthesis and super-resolution. Moreover, even unconditional DMs can readily be applied to tasks such as inpainting and colorization or stroke-based synthesis, in contrast to other types of generative models. Being likelihood-based models, they do not exhibit mode-collapse and training instabilities as GANs and, by heavily exploiting parameter sharing, they can model highly complex distributions of natural images without involving billions of parameters as in AR models.

图像合成是计算机视觉中近期发展最引人注目的领域之一，同时也是计算需求最高的领域之一。尤其是复杂自然场景的高分辨率合成，目前主要由基于似然模型的扩展所主导，这类模型可能是在自回归（AR）Transformer 中包含数十亿参数。相比之下，GAN 展示出的有前景结果大多被证明局限于变化相对有限的数据，因为其对抗学习过程并不容易扩展到复杂、多模态分布的建模。近来，由去噪自编码器层级构成的扩散模型已经在图像合成及其他领域取得了令人印象深刻的结果，并定义了类别条件图像合成和超分辨率的最先进水平。此外，与其他类型生成模型不同，即使是无条件 DM 也可以直接应用于图像修复、着色或基于笔画的合成等任务。作为基于似然的模型，它们不像 GAN 那样表现出模式坍塌和训练不稳定，并且通过大量利用参数共享，它们无需像 AR 模型那样涉及数十亿参数，就能建模自然图像的高度复杂分布。

Democratizing High-Resolution Image Synthesis. DMs belong to the class of likelihood-based models, whose mode-covering behavior makes them prone to spend excessive amounts of capacity (and thus compute resources) on modeling imperceptible details of the data. Although the reweighted variational objective aims to address this by undersampling the initial denoising steps, DMs are still computationally demanding, since training and evaluating such a model requires repeated function evaluations (and gradient computations) in the high-dimensional space of RGB images. As an example, training the most powerful DMs often takes hundreds of GPU days (e.g. 150 - 1000 V100 days) and repeated evaluations on a noisy version of the input space render also inference expensive, so that producing 50k samples takes approximately 5 days on a single A100 GPU. This has two consequences for the research community and users in general: Firstly, training such a model requires massive computational resources only available to a small fraction of the field, and leaves a huge carbon footprint. Secondly, evaluating an already trained model is also expensive in time and memory, since the same model architecture must run sequentially for a large number of steps (e.g. 25 - 1000 steps). To increase the accessibility of this powerful model class and at the same time reduce its significant resource consumption, a method is needed that reduces the computational complexity for both training and sampling. Reducing the computational demands of DMs without impairing their performance is, therefore, key to enhance their accessibility.

让高分辨率图像合成更加普及。 DM 属于基于似然的模型，其覆盖模式的行为使它们容易把过多容量（也就是计算资源）花在建模数据中不可感知的细节上。虽然重新加权的变分目标试图通过欠采样初始去噪步骤来解决这一点，但 DM 仍然计算开销很大，因为训练和评估这类模型需要在 RGB 图像的高维空间中反复进行函数评估（以及梯度计算）。例如，训练最强大的 DM 往往需要数百个 GPU 日（例如 150 到 1000 个 V100 日），而在输入空间的噪声版本上反复评估也使推理代价高昂，因此在单张 A100 GPU 上生成 5 万个样本大约需要 5 天。这对研究社区和普通用户有两个后果：首先，训练这样的模型需要大量计算资源，而这些资源只有该领域的一小部分人能够获得，并且会留下巨大的碳足迹。其次，评估一个已经训练好的模型在时间和内存上也很昂贵，因为同一模型架构必须顺序运行大量步骤（例如 25 到 1000 步）。为了提高这一强大模型类别的可访问性，同时降低其显著的资源消耗，需要一种同时降低训练和采样计算复杂度的方法。因此，在不损害 DM 性能的情况下降低其计算需求，是提升其可访问性的关键。

Perceptual and semantic compression — 图2：Illustrating perceptual and semantic compression: Most bits of a digital image correspond to imperceptible details. While DMs allow to suppress this semantically meaningless information by minimizing the responsible loss term, gradients and the neural network backbone still need to be evaluated on all pixels, leading to superfluous computations and unnecessarily expensive optimization and inference. We propose latent diffusion models as an effective generative model and a separate mild compression stage that only eliminates imperceptible details.

Departure to Latent Space. Our approach starts with the analysis of already trained diffusion models in pixel space: Figure Figure 2 shows the rate-distortion trade-off of a trained model. As with any likelihood-based model, learning can be roughly divided into two stages: First is a perceptual compression stage which removes high-frequency details but still learns little semantic variation. In the second stage, the actual generative model learns the semantic and conceptual composition of the data (semantic compression). We thus aim to first find a perceptually equivalent, but computationally more suitable space, in which we will train diffusion models for high-resolution image synthesis. Following common practice, we separate training into two distinct phases: First, we train an autoencoder which provides a lower-dimensional (and thereby efficient) representational space which is perceptually equivalent to the data space. Importantly, and in contrast to previous work, we do not need to rely on excessive spatial compression, as we train DMs in the learned latent space, which exhibits better scaling properties with respect to the spatial dimensionality. The reduced complexity also provides efficient image generation from the latent space with a single network pass. We dub the resulting model class Latent Diffusion Models (LDMs). A notable advantage of this approach is that we need to train the universal autoencoding stage only once and can therefore reuse it for multiple DM trainings or to explore possibly completely different tasks. This enables efficient exploration of a large number of diffusion models for various image-to-image and text-to-image tasks. For the latter, we design an architecture that connects transformers to the DM's UNet backbone and enables arbitrary types of token-based conditioning mechanisms, see Section 3.3.

走向潜在空间。 作者的方法从分析已经在像素空间中训练好的扩散模型开始：图2 展示了一个训练好模型的率失真权衡。与任何基于似然的模型一样，学习过程大致可以分为两个阶段：第一阶段是感知压缩阶段，它会去除高频细节，但仍然只学习到很少的语义变化。在第二阶段，实际的生成模型学习数据的语义和概念组成（语义压缩）。因此，作者的目标是先找到一个感知上等价、但计算上更合适的空间，并在其中训练用于高分辨率图像合成的扩散模型。遵循常见做法，作者把训练分成两个不同阶段：首先训练一个自编码器，它提供一个低维（因此高效）的表示空间，并且该空间在感知上等价于数据空间。重要的是，与先前工作不同，作者不需要依赖过度的空间压缩，因为作者在学习到的潜在空间中训练 DM，而该空间在空间维度方面具有更好的扩展性质。降低后的复杂度还使得从潜在空间进行高效图像生成成为可能，只需一次网络前向传播。作者将得到的模型类别称为潜在扩散模型（LDM）。这一方法的一个显著优势是，通用自编码阶段只需要训练一次，因此可以复用于多个 DM 训练，或用于探索可能完全不同的任务。这使作者能够高效探索大量用于不同图像到图像和文本到图像任务的扩散模型。对于后者，作者设计了一种把 Transformer 连接到 DM 的 UNet 主干的架构，并支持任意类型的基于 token 的条件机制，见第 3.3 节。

In sum, our work makes the following contributions:

总之，作者的工作做出了以下贡献：

In contrast to purely transformer-based approaches, our method scales more graceful to higher dimensional data and can thus (a) work on a compression level which provides more faithful and detailed reconstructions than previous work (see Figure Figure 1) and (b) can be efficiently applied to high-resolution synthesis of megapixel images.
We achieve competitive performance on multiple tasks (unconditional image synthesis, inpainting, stochastic super-resolution) and datasets while significantly lowering computational costs.
Compared to pixel-based diffusion approaches, we also significantly decrease inference costs.
We show that, in contrast to previous work which learns both an encoder/decoder architecture and a score-based prior simultaneously, our approach does not require a delicate weighting of reconstruction and generative abilities.
This ensures extremely faithful reconstructions and requires very little regularization of the latent space.
We find that for densely conditioned tasks such as super-resolution, inpainting and semantic synthesis, our model can be applied in a convolutional fashion and render large, consistent images of $\sim 1024^{2}$ px.
Moreover, we design a general-purpose conditioning mechanism based on cross-attention, enabling multi-modal training.
We use it to train class-conditional, text-to-image and layout-to-image models.
Finally, we release pretrained latent diffusion and autoencoding models at https://github.com/CompVis/latent-diffusion which might be reusable for a various tasks besides training of DMs.

与纯 Transformer 方法不同，作者的方法能够更优雅地扩展到更高维数据，因此 (a) 可以在一种压缩级别上工作，相比先前工作提供更忠实、更细致的重建（见图1），并且 (b) 可以高效应用于百万像素图像的高分辨率合成。
作者在多个任务（无条件图像合成、图像修复、随机超分辨率）和数据集上取得了有竞争力的性能，同时显著降低计算成本。
与基于像素的扩散方法相比，作者还显著降低了推理成本。
作者表明，与同时学习编码器/解码器架构和基于分数的先验的先前工作不同，作者的方法不需要在重建能力和生成能力之间进行精细加权。
这确保了极其忠实的重建，并且只需要对潜在空间施加很少的正则化。
作者发现，对于超分辨率、图像修复和语义合成等密集条件任务，该模型可以以卷积方式应用，并渲染约 $\sim 1024^{2}$ 像素的大尺寸一致图像。
此外，作者设计了一种基于交叉注意力的通用条件机制，从而支持多模态训练。
作者使用它训练类别条件、文本到图像和布局到图像模型。
最后，作者在 https://github.com/CompVis/latent-diffusion 发布了预训练的潜在扩散和自编码模型，这些模型除了训练 DM 外，也可能在多种任务中复用。

Generative Models for Image Synthesis. The high dimensional nature of images presents distinct challenges to generative modeling. Generative Adversarial Networks (GAN) allow for efficient sampling of high resolution images with good perceptual quality, but are difficult to optimize and struggle to capture the full data distribution. In contrast, likelihood-based methods emphasize good density estimation which renders optimization more well-behaved. Variational autoencoders (VAE) and flow-based models enable efficient synthesis of high resolution images, but sample quality is not on par with GANs. While autoregressive models (ARM) achieve strong performance in density estimation, computationally demanding architectures and a sequential sampling process limit them to low resolution images. Because pixel based representations of images contain barely perceptible, high-frequency details, maximum-likelihood training spends a disproportionate amount of capacity on modeling them, resulting in long training times. To scale to higher resolutions, several two-stage approaches use ARMs to model a compressed latent image space instead of raw pixels.

用于图像合成的生成模型。 图像的高维性质给生成建模带来了独特挑战。生成对抗网络（GAN）能够高效采样感知质量良好的高分辨率图像，但难以优化，并且难以捕获完整的数据分布。相比之下，基于似然的方法强调良好的密度估计，这使优化行为更加良好。变分自编码器（VAE）和基于流的模型能够高效合成高分辨率图像，但样本质量无法与 GAN 相当。虽然自回归模型（ARM）在密度估计方面表现强劲，但计算需求高的架构和顺序采样过程将它们限制在低分辨率图像上。由于图像的像素表示包含几乎不可感知的高频细节，最大似然训练会把不成比例的容量花在建模这些细节上，从而导致训练时间很长。为了扩展到更高分辨率，若干两阶段方法使用 ARM 来建模压缩后的潜在图像空间，而不是原始像素。

Recently, Diffusion Probabilistic Models (DM), have achieved state-of-the-art results in density estimation as well as in sample quality. The generative power of these models stems from a natural fit to the inductive biases of image-like data when their underlying neural backbone is implemented as a UNet. The best synthesis quality is usually achieved when a reweighted objective is used for training. In this case, the DM corresponds to a lossy compressor and allow to trade image quality for compression capabilities. Evaluating and optimizing these models in pixel space, however, has the downside of low inference speed and very high training costs. While the former can be partially adressed by advanced sampling strategies and hierarchical approaches, training on high-resolution image data always requires to calculate expensive gradients. We adress both drawbacks with our proposed LDMs, which work on a compressed latent space of lower dimensionality. This renders training computationally cheaper and speeds up inference with almost no reduction in synthesis quality (see Figure Figure 1).

近来，扩散概率模型（DM）已经在密度估计和样本质量方面取得最先进结果。这些模型的生成能力来自这样一个自然契合：当其底层神经主干实现为 UNet 时，它们非常适合图像式数据的归纳偏置。最佳合成质量通常在使用重新加权目标进行训练时获得。在这种情况下，DM 对应于一种有损压缩器，并允许在图像质量和压缩能力之间进行权衡。然而，在像素空间中评估和优化这些模型的缺点是推理速度低、训练成本很高。虽然前者可以通过先进采样策略和层级方法部分解决，但在高分辨率图像数据上训练始终需要计算昂贵的梯度。作者提出的 LDM 同时解决这两个缺点，它在维度更低的压缩潜在空间中工作。这使训练在计算上更便宜，并且在几乎不降低合成质量的情况下加速推理（见图1）。

Two-Stage Image Synthesis. To mitigate the shortcomings of individual generative approaches, a lot of research has gone into combining the strengths of different methods into more efficient and performant models via a two stage approach. VQ-VAEs use autoregressive models to learn an expressive prior over a discretized latent space. DALL-E extend this approach to text-to-image generation by learning a joint distributation over discretized image and text representations. More generally, conditional invertible networks uses conditionally invertible networks to provide a generic transfer between latent spaces of diverse domains. Different from VQ-VAEs, VQGANs employ a first stage with an adversarial and perceptual objective to scale autoregressive transformers to larger images. However, the high compression rates required for feasible ARM training, which introduces billions of trainable parameters, limit the overall performance of such approaches and less compression comes at the price of high computational cost. Our work prevents such trade-offs, as our proposed LDMs scale more gently to higher dimensional latent spaces due to their convolutional backbone. Thus, we are free to choose the level of compression which optimally mediates between learning a powerful first stage, without leaving too much perceptual compression up to the generative diffusion model while guaranteeing high-fidelity reconstructions (see Figure Figure 1). While approaches to jointly or separately learn an encoding/decoding model together with a score-based prior exist, the former still require a difficult weighting between reconstruction and generative capabilities and are outperformed by our approach (Section 4), and the latter focus on highly structured images such as human faces.

两阶段图像合成。 为了缓解单一生成方法的缺点，许多研究通过两阶段方法把不同方法的优势结合进更高效、性能更好的模型中。 VQ-VAE 使用自回归模型来学习离散化潜在空间上的表达性先验。 DALL-E 通过学习离散化图像和文本表示上的联合分布，将这一方法扩展到文本到图像生成。更一般地，条件可逆网络使用条件可逆网络，在不同领域的潜在空间之间提供通用迁移。不同于 VQ-VAE，VQGAN 使用带有对抗目标和感知目标的第一阶段，把自回归 Transformer 扩展到更大图像。然而，可行 ARM 训练所需的高压缩率会引入数十亿可训练参数，限制这类方法的整体性能；而较低压缩率又会带来很高的计算成本。作者的工作避免了这种权衡，因为提出的 LDM 由于卷积主干的存在，可以更平缓地扩展到更高维潜在空间。因此，作者可以自由选择一种压缩级别，在学习强大第一阶段和不过多把感知压缩留给生成扩散模型之间实现最优折中，同时保证高保真重建（见图1）。虽然已有方法联合或分别学习编码/解码模型与基于分数的先验，但前者仍然需要在重建能力和生成能力之间进行困难加权，并且被作者的方法超越（第 4 节），后者则聚焦于人脸等高度结构化图像。

3. Method

To lower the computational demands of training diffusion models towards high-resolution image synthesis, we observe that although diffusion models allow to ignore perceptually irrelevant details by undersampling the corresponding loss terms, they still require costly function evaluations in pixel space, which causes huge demands in computation time and energy resources. We propose to circumvent this drawback by introducing an explicit separation of the compressive from the generative learning phase (see Figure Figure 2). To achieve this, we utilize an autoencoding model which learns a space that is perceptually equivalent to the image space, but offers significantly reduced computational complexity. Such an approach offers several advantages: (i) By leaving the high-dimensional image space, we obtain DMs which are computationally much more efficient because sampling is performed on a low-dimensional space. (ii) We exploit the inductive bias of DMs inherited from their UNet architecture, which makes them particularly effective for data with spatial structure and therefore alleviates the need for aggressive, quality-reducing compression levels as required by previous approaches. (iii) Finally, we obtain general-purpose compression models whose latent space can be used to train multiple generative models and which can also be utilized for other downstream applications such as single-image CLIP-guided synthesis.

为了降低面向高分辨率图像合成训练扩散模型的计算需求，作者观察到，虽然扩散模型可以通过欠采样相应损失项来忽略感知上无关的细节，但它们仍然需要在像素空间中进行代价高昂的函数评估，这会造成巨大的计算时间和能源资源需求。作者提出通过显式分离压缩学习阶段和生成学习阶段来绕过这一缺点（见图2）。为此，作者使用一个自编码模型来学习一个在感知上等价于图像空间、但计算复杂度显著降低的空间。这种方法有若干优势：(i) 通过离开高维图像空间，作者得到计算效率高得多的 DM，因为采样是在低维空间中执行的。 (ii) 作者利用了 DM 从其 UNet 架构继承来的归纳偏置，这使它们对具有空间结构的数据尤其有效，因此缓解了先前方法所需的激进且降低质量的压缩级别。 (iii) 最后，作者得到通用压缩模型，其潜在空间可用于训练多个生成模型，也可用于单图像 CLIP 引导合成等其他下游应用。

3.1 Perceptual Image Compression

Our perceptual compression model is based on previous work and consists of an autoencoder trained by combination of a perceptual loss and a patch-based adversarial objective. This ensures that the reconstructions are confined to the image manifold by enforcing local realism and avoids bluriness introduced by relying solely on pixel-space losses such as $L_{2}$ or $L_{1}$ objectives. More precisely, given an image $x \in R^{H \times W \times 3}$ in RGB space, the encoder $E$ encodes $x$ into a latent representation $z = E (x)$ , and the decoder $D$ reconstructs the image from the latent, giving $\tilde{x} = D (z) = D (E (x))$ , where $z \in R^{h \times w \times c}$ . Importantly, the encoder downsamples the image by a factor $f = H / h = W / w$ , and we investigate different downsampling factors $f = 2^{m}$ , with $m \in N$ . In order to avoid arbitrarily high-variance latent spaces, we experiment with two different kinds of regularizations. The first variant, KL-reg., imposes a slight KL-penalty towards a standard normal on the learned latent, similar to a VAE, whereas VQ-reg. uses a vector quantization layer within the decoder. This model can be interpreted as a VQGAN but with the quantization layer absorbed by the decoder. Because our subsequent DM is designed to work with the two-dimensional structure of our learned latent space $z = E (x)$ , we can use relatively mild compression rates and achieve very good reconstructions. This is in contrast to previous works, which relied on an arbitrary 1D ordering of the learned space $z$ to model its distribution autoregressively and thereby ignored much of the inherent structure of $z$ . Hence, our compression model preserves details of $x$ better (see the appendix table). The full objective and training details can be found in the supplement.

作者的感知压缩模型基于先前工作，由一个结合感知损失和基于图像块的对抗目标训练的自编码器组成。这通过强制局部真实感来确保重建结果被限制在图像流形上，并避免仅依赖像素空间损失（如 $L_{2}$ 或 $L_{1}$ 目标）所引入的模糊。更准确地说，给定 RGB 空间中的图像 $x \in R^{H \times W \times 3}$ ，编码器 $E$ 将 $x$ 编码为潜在表示 $z = E (x)$ ，解码器 $D$ 从潜在表示重建图像，得到 $\tilde{x} = D (z) = D (E (x))$ ，其中 $z \in R^{h \times w \times c}$ 。重要的是，编码器会以因子 $f = H / h = W / w$ 对图像进行下采样，作者研究了不同下采样因子 $f = 2^{m}$ ，其中 $m \in N$ 。为了避免任意高方差的潜在空间，作者实验了两种不同类型的正则化。第一种变体 KL-reg. 类似 VAE，对学习到的潜在表示施加一个朝向标准正态分布的轻微 KL 惩罚；而 VQ-reg. 在解码器内部使用向量量化层。该模型可以解释为一个 VQGAN，但量化层被吸收进了解码器。由于后续 DM 被设计为处理学习到的潜在空间 $z = E (x)$ 的二维结构，作者可以使用相对温和的压缩率，并取得非常好的重建效果。这不同于先前工作，后者依赖于学习空间 $z$ 的任意一维顺序，以自回归方式建模其分布，因此忽略了 $z$ 的许多内在结构。因此，作者的压缩模型能更好地保留 $x$ 的细节（见附表）。完整目标和训练细节见补充材料。

附表A：Complete autoencoder zoo trained on OpenImages, evaluated on ImageNet-Val.

†

denotes an attention-free autoencoder. This table is defined in the supplement but cited repeatedly by the main text, so it is included here as an appendix table and does not occupy a main-paper table number.

$f$	$\| Z \|$ / reg.	$c$	R-FID ↓	R-IS ↑	PSNR ↑	PSIM ↓	SSIM ↑
16 VQGAN	16384	256	4.98	--	19.9 ± 3.4	1.83 ± 0.42	0.51 ± 0.18
16 VQGAN	1024	256	7.94	--	19.4 ± 3.3	1.98 ± 0.43	0.50 ± 0.18
8 DALL-E	8192	-	32.01	--	22.8 ± 2.1	1.95 ± 0.51	0.73 ± 0.13
32	16384	16	31.83	40.40 ± 1.07	17.45 ± 2.90	2.58 ± 0.48	0.41 ± 0.18
16	16384	8	5.15	144.55 ± 3.74	20.83 ± 3.61	1.73 ± 0.43	0.54 ± 0.18
8	16384	4	1.14	201.92 ± 3.97	23.07 ± 3.99	1.17 ± 0.36	0.65 ± 0.16
8	256	4	1.49	194.20 ± 3.87	22.35 ± 3.81	1.26 ± 0.37	0.62 ± 0.16
4	8192	3	0.58	224.78 ± 5.35	27.43 ± 4.26	0.53 ± 0.21	0.82 ± 0.10
4 $^{†}$	8192	3	1.06	221.94 ± 4.58	25.21 ± 4.17	0.72 ± 0.26	0.76 ± 0.12
4	256	3	0.47	223.81 ± 4.58	26.43 ± 4.22	0.62 ± 0.24	0.80 ± 0.11
2	2048	2	0.16	232.75 ± 5.09	30.85 ± 4.12	0.27 ± 0.12	0.91 ± 0.05
2	64	2	0.40	226.62 ± 4.83	29.13 ± 3.46	0.38 ± 0.13	0.90 ± 0.05
32	KL	64	2.04	189.53 ± 3.68	22.27 ± 3.93	1.41 ± 0.40	0.61 ± 0.17
32	KL	16	7.3	132.75 ± 2.71	20.38 ± 3.56	1.88 ± 0.45	0.53 ± 0.18
16	KL	16	0.87	210.31 ± 3.97	24.08 ± 4.22	1.07 ± 0.36	0.68 ± 0.15
16	KL	8	2.63	178.68 ± 4.08	21.94 ± 3.92	1.49 ± 0.42	0.59 ± 0.17
8	KL	4	0.90	209.90 ± 4.92	24.19 ± 4.19	1.02 ± 0.35	0.69 ± 0.15
4	KL	3	0.27	227.57 ± 4.89	27.53 ± 4.54	0.55 ± 0.24	0.82 ± 0.11
2	KL	2	0.086	232.66 ± 5.16	32.47 ± 4.19	0.20 ± 0.09	0.93 ± 0.04

3.2 Latent Diffusion Models

Diffusion Models are probabilistic models designed to learn a data distribution $p (x)$ by gradually denoising a normally distributed variable, which corresponds to learning the reverse process of a fixed Markov Chain of length $T$ . For image synthesis, the most successful models rely on a reweighted variant of the variational lower bound on $p (x)$ , which mirrors denoising score-matching. These models can be interpreted as an equally weighted sequence of denoising autoencoders $ϵ_{θ} (x_{t}, t); t = 1 \dots T$ , which are trained to predict a denoised variant of their input $x_{t}$ , where $x_{t}$ is a noisy version of the input $x$ . The corresponding objective can be simplified to:

扩散模型 是一种概率模型，旨在通过逐步去噪一个正态分布变量来学习数据分布 $p (x)$ ，这对应于学习长度为 $T$ 的固定马尔可夫链的反向过程。对于图像合成，最成功的模型依赖于 $p (x)$ 的变分下界的一种重新加权变体，它对应于去噪分数匹配。这些模型可以解释为一组等权重的去噪自编码器序列 $ϵ_{θ} (x_{t}, t); t = 1 \dots T$ ，它们被训练来预测输入 $x_{t}$ 的去噪版本，其中 $x_{t}$ 是输入 $x$ 的加噪版本。相应目标可以简化为：

L_{DM} = E_{x, ϵ \sim N (0, 1), t} [{‖ ϵ - ϵ_{θ} (x_{t}, t) ‖}_{2}^{2}],

with $t$ uniformly sampled from ${1, \dots, T}$ . Generative Modeling of Latent Representations. With our trained perceptual compression models consisting of $E$ and $D$ , we now have access to an efficient, low-dimensional latent space in which high-frequency, imperceptible details are abstracted away. Compared to the high-dimensional pixel space, this space is more suitable for likelihood-based generative models, as they can now (i) focus on the important, semantic bits of the data and (ii) train in a lower dimensional, computationally much more efficient space. Unlike previous work that relied on autoregressive, attention-based transformer models in a highly compressed, discrete latent space, we can take advantage of image-specific inductive biases that our model offers. This includes the ability to build the underlying UNet primarily from 2D convolutional layers, and further focusing the objective on the perceptually most relevant bits using the reweighted bound, which now reads:

其中 $t$ 从 ${1, \dots, T}$ 中均匀采样。 潜在表示的生成建模。 通过由 $E$ 和 $D$ 组成的训练好感知压缩模型，作者现在可以访问一个高效的低维潜在空间，其中高频、不可感知的细节被抽象掉。与高维像素空间相比，该空间更适合基于似然的生成模型，因为它们现在可以 (i) 专注于数据中重要的语义部分，并且 (ii) 在维度更低、计算效率高得多的空间中训练。不同于先前依赖高度压缩离散潜在空间中的自回归注意力 Transformer 模型的工作，作者可以利用模型提供的图像特定归纳偏置。这包括能够主要用 2D 卷积层构建底层 UNet，并使用重新加权下界进一步把目标集中在感知上最相关的部分，该目标现在写作：

L_{LDM} := E_{E (x), ϵ \sim N (0, 1), t} [{‖ ϵ - ϵ_{θ} (z_{t}, t) ‖}_{2}^{2}] .

LDM conditioning mechanisms — 图3：We condition LDMs either via concatenation or by a more general cross-attention mechanism. See Section 3.3.

The neural backbone $ϵ_{θ} (\circ, t)$ of our model is realized as a time-conditional UNet. Since the forward process is fixed, $z_{t}$ can be efficiently obtained from $E$ during training, and samples from $p (z)$ can be decoded to image space with a single pass through $D$ .

作者模型的神经主干 $ϵ_{θ} (\circ, t)$ 被实现为时间条件 UNet。由于前向过程是固定的，训练期间可以从 $E$ 高效获得 $z_{t}$ ，并且来自 $p (z)$ 的样本只需通过 $D$ 一次前向传播即可解码到图像空间。

3.3 Conditioning Mechanisms

Similar to other types of generative models, diffusion models are in principle capable of modeling conditional distributions of the form $p (z | y)$ . This can be implemented with a conditional denoising autoencoder $ϵ_{θ} (z_{t}, t, y)$ and paves the way to controlling the synthesis process through inputs $y$ such as text, semantic maps or other image-to-image translation tasks. In the context of image synthesis, however, combining the generative power of DMs with other types of conditionings beyond class-labels or blurred variants of the input image is so far an under-explored area of research. We turn DMs into more flexible conditional image generators by augmenting their underlying UNet backbone with the cross-attention mechanism, which is effective for learning attention-based models of various input modalities. To pre-process $y$ from various modalities (such as language prompts) we introduce a domain specific encoder $τ_{θ}$ that projects $y$ to an intermediate representation $τ_{θ} (y) \in R^{M \times d_{τ}}$ , which is then mapped to the intermediate layers of the UNet via a cross-attention layer implementing $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) \cdot V$ .

类似其他类型的生成模型，扩散模型原则上能够建模形式为 $p (z | y)$ 的条件分布。这可以通过条件去噪自编码器 $ϵ_{θ} (z_{t}, t, y)$ 来实现，并为通过文本、语义图或其他图像到图像翻译任务等输入 $y$ 控制合成过程铺平道路。然而，在图像合成语境中，把 DM 的生成能力与类别标签或输入图像模糊版本之外的其他条件类型结合起来，目前仍是一个探索不足的研究领域。作者通过在底层 UNet 主干中加入交叉注意力机制，把 DM 转变为更灵活的条件图像生成器，而交叉注意力对学习各种输入模态的注意力模型是有效的。为了预处理来自各种模态（例如语言提示）的 $y$ ，作者引入一个领域特定编码器 $τ_{θ}$ ，它把 $y$ 投影为中间表示 $τ_{θ} (y) \in R^{M \times d_{τ}}$ ，随后通过实现 $Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d}}) \cdot V$ 的交叉注意力层映射到 UNet 的中间层。

\begin{aligned} Q & = W_{Q}^{(i)} \cdot φ_{i} (z_{t}), \\ K & = W_{K}^{(i)} \cdot τ_{θ} (y), \\ V & = W_{V}^{(i)} \cdot τ_{θ} (y) . \end{aligned}

Here, $φ_{i} (z_{t}) \in R^{N \times d_{ϵ}^{i}}$ denotes a (flattened) intermediate representation of the UNet implementing $ϵ_{θ}$ and $W_{V}^{(i)} \in R^{d \times d_{ϵ}^{i}}$ , $W_{Q}^{(i)} \in R^{d \times d_{τ}}$ & $W_{K}^{(i)} \in R^{d \times d_{τ}}$ are learnable projection matrices. See Figure Figure 3 for a visual depiction. Based on image-conditioning pairs, we then learn the conditional LDM via:

这里， $φ_{i} (z_{t}) \in R^{N \times d_{ϵ}^{i}}$ 表示实现 $ϵ_{θ}$ 的 UNet 的（展平）中间表示，而 $W_{V}^{(i)} \in R^{d \times d_{ϵ}^{i}}$ 、 $W_{Q}^{(i)} \in R^{d \times d_{τ}}$ 和 $W_{K}^{(i)} \in R^{d \times d_{τ}}$ 是可学习投影矩阵。可视化示意见图3。基于图像-条件配对，作者随后通过如下方式学习条件 LDM：

L_{LDM}^{cond} := E_{E (x), y, ϵ \sim N (0, 1), t} [{‖ ϵ - ϵ_{θ} (z_{t}, t, τ_{θ} (y)) ‖}_{2}^{2}],

where both $τ_{θ}$ and $ϵ_{θ}$ are jointly optimized via this objective. This conditioning mechanism is flexible as $τ_{θ}$ can be parameterized with domain-specific experts, e.g. (unmasked) transformers when $y$ are text prompts.

其中 $τ_{θ}$ 和 $ϵ_{θ}$ 都通过该目标联合优化。这种条件机制是灵活的，因为 $τ_{θ}$ 可以用领域特定专家进行参数化，例如当 $y$ 是文本提示时使用（非掩码）Transformer。

4. Experiments

Text-to-image sample sign 1 — 图4：Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION database. Samples generated with 200 DDIM steps and $η = 1.0$ . We use unconditional guidance with $s = 10.0$ .

Text-to-image sample zombie picasso 1 — 图4：Samples for user-defined text prompts from our model for text-to-image synthesis, LDM-8 (KL), which was trained on the LAION database. Samples generated with 200 DDIM steps and $η = 1.0$ . We use unconditional guidance with $s = 10.0$ .

FID over training steps on ImageNet — 图5：Analyzing the training of class-conditional LDMs with different downsampling factors $f$ over 2M train steps on the ImageNet dataset. Pixel-based LDM-1 requires substantially larger train times compared to models with larger downsampling factors. Too much perceptual compression as in LDM-32 limits the overall sample quality. All models are trained on a single NVIDIA A100 with the same computational budget. Results obtained with 100 DDIM steps and $κ = 0$ .

Inception score over training steps on ImageNet — 图5：Analyzing the training of class-conditional LDMs with different downsampling factors $f$ over 2M train steps on the ImageNet dataset. Pixel-based LDM-1 requires substantially larger train times compared to models with larger downsampling factors. Too much perceptual compression as in LDM-32 limits the overall sample quality. All models are trained on a single NVIDIA A100 with the same computational budget. Results obtained with 100 DDIM steps and $κ = 0$ .

Speed vs FID on CelebA-HQ — 图6：Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate ${10, 20, 50, 100, 200}$ sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM- ${4 - 8}$ . FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.

Speed vs FID on ImageNet — 图6：Comparing LDMs with varying compression on the CelebA-HQ (left) and ImageNet (right) datasets. Different markers indicate ${10, 20, 50, 100, 200}$ sampling steps using DDIM, from right to left along each line. The dashed line shows the FID scores for 200 steps, indicating the strong performance of LDM- ${4 - 8}$ . FID scores assessed on 5000 samples. All models were trained for 500k (CelebA) / 2M (ImageNet) steps on an A100.

LDMs provide means to flexible and computationally tractable diffusion based image synthesis of various image modalities, which we empirically show in the following. Firstly, however, we analyze the gains of our models compared to pixel-based diffusion models in both training and inference. Interestingly, we find that LDMs trained in VQ-regularized latent spaces sometimes achieve better sample quality, even though the reconstruction capabilities of VQ-regularized first stage models slightly fall behind those of their continuous counterparts, cf. the appendix table. A visual comparison between the effects of first stage regularization schemes on LDM training and their generalization abilities to resolutions $> 256^{2}$ can be found in the appendix. In the appendix we list details on architecture, implementation, training and evaluation for all results presented in this section.

LDM 为各种图像模态提供了灵活且计算可行的基于扩散的图像合成手段，作者在下文中通过经验结果展示这一点。不过，作者首先分析模型相比基于像素的扩散模型在训练和推理两方面的收益。有趣的是，作者发现，在 VQ 正则化潜在空间中训练的 LDM 有时会获得更好的样本质量，尽管 VQ 正则化第一阶段模型的重建能力略落后于其连续对应模型，参见附表。关于第一阶段正则化方案对 LDM 训练的影响及其泛化到 $> 256^{2}$ 分辨率能力的可视化比较，可在附录中找到。作者在附录中列出了本节所有结果的架构、实现、训练和评估细节。

4.1 On Perceptual Compression Tradeoffs

This section analyzes the behavior of our LDMs with different downsampling factors $f \in {1, 2, 4, 8, 16, 32}$ (abbreviated as LDM- $f$ , where LDM-1 corresponds to pixel-based DMs). To obtain a comparable test-field, we fix the computational resources to a single NVIDIA A100 for all experiments in this section and train all models for the same number of steps and with the same number of parameters. The appendix table shows hyperparameters and reconstruction performance of the first stage models used for the LDMs compared in this section. Figure Figure 5 shows sample quality as a function of training progress for 2M steps of class-conditional models on the ImageNet dataset. We see that, i) small downsampling factors for LDM- ${$ 1,2 $}$ result in slow training progress, whereas ii) overly large values of $f$ cause stagnating fidelity after comparably few training steps. Revisiting the analysis above (Figure Figure 1 and Figure 2) we attribute this to i) leaving most of perceptual compression to the diffusion model and ii) too strong first stage compression resulting in information loss and thus limiting the achievable quality. LDM- ${4 - 16}$ strike a good balance between efficiency and perceptually faithful results, which manifests in a significant FID gap of 38 between pixel-based diffusion (LDM-1) and LDM-8 after 2M training steps. In Figure Figure 6, we compare models trained on CelebA-HQ and ImageNet in terms sampling speed for different numbers of denoising steps with the DDIM sampler and plot it against FID-scores. LDM- ${4 - 8}$ outperform models with unsuitable ratios of perceptual and conceptual compression. Especially compared to pixel-based LDM-1, they achieve much lower FID scores while simultaneously significantly increasing sample throughput. Complex datasets such as ImageNet require reduced compression rates to avoid reducing quality. In summary, LDM-4 and -8 offer the best conditions for achieving high-quality synthesis results.

本节分析作者的 LDM 在不同下采样因子 $f \in {1, 2, 4, 8, 16, 32}$ 下的行为（简写为 LDM- $f$ ，其中 LDM-1 对应基于像素的 DM）。为了得到可比测试场，作者固定本节所有实验的计算资源为单张 NVIDIA A100，并让所有模型训练相同步数、使用相同参数量。附表展示了本节比较的 LDM 所使用第一阶段模型的超参数和重建性能。图5 展示了 ImageNet 上类别条件模型训练 200 万步时样本质量随训练进度变化的情况。作者看到，i) LDM- ${$ 1,2 $}$ 的小下采样因子会导致训练进展缓慢，而 ii) 过大的 $f$ 值会在相对较少训练步后导致保真度停滞。回到上面的分析（图1 和图2），作者将其归因于 i) 把大部分感知压缩留给扩散模型，以及 ii) 第一阶段压缩过强导致信息损失，从而限制可达到的质量。 LDM- ${4 - 16}$ 在效率和感知忠实结果之间取得了良好平衡，这表现为训练 200 万步后，基于像素的扩散（LDM-1）与 LDM-8 之间存在 38 的显著 FID 差距。在图6 中，作者比较了在 CelebA-HQ 和 ImageNet 上训练的模型，它们在 DDIM 采样器不同去噪步数下的采样速度，并将其与 FID 分数作图。 LDM- ${4 - 8}$ 优于那些感知压缩和概念压缩比例不合适的模型。尤其是与基于像素的 LDM-1 相比，它们取得了低得多的 FID 分数，同时显著提高了样本吞吐量。 ImageNet 等复杂数据集需要降低压缩率，以避免质量下降。总之，LDM-4 和 -8 为获得高质量合成结果提供了最佳条件。

表1：Evaluation metrics for unconditional image synthesis. CelebA-HQ results reproduced from previous work, FFHQ from previous work.

^{†}

N

-s refers to

N

sampling steps with the DDIM sampler.

^{*}

: trained in KL-regularized latent space. Additional results can be found in the supplementary.

CelebA-HQ $256 \times 256$				FFHQ $256 \times 256$
Method	FID ↓	Prec. ↑	Recall ↑	Method	FID ↓	Prec. ↑	Recall ↑
DC-VAE	15.8	-	-	ImageBART	9.57	-	-
VQGAN+T. (k=400)	10.2	-	-	U-Net GAN (+aug)	10.9 (7.6)	-	-
PGGAN	8.0	-	-	UDM	5.54	-	-
LSGM	7.22	-	-	StyleGAN	4.16	0.71	0.46
UDM	7.16	-	-	ProjectedGAN	3.08	0.65	0.46
LDM-4 (ours, 500-s $^{†}$ )	5.11	0.72	0.49	LDM-4 (ours, 200-s)	4.98	0.73	0.50
LSUN-Churches $256 \times 256$				LSUN-Bedrooms $256 \times 256$
DDPM	7.89	-	-	ImageBART	5.51	-	-
ImageBART	7.32	-	-	DDPM	4.9	-	-
PGGAN	6.42	-	-	UDM	4.57	-	-
StyleGAN	4.21	-	-	StyleGAN	2.35	0.59	0.48
StyleGAN2	3.86	-	-	ADM	1.90	0.66	0.51
ProjectedGAN	1.59	0.61	0.44	ProjectedGAN	1.52	0.61	0.34
LDM-8 $^{*}$ (ours, 200-s)	4.02	0.64	0.52	LDM-4 (ours, 200-s)	2.95	0.66	0.48

表2：Evaluation of text-conditional image synthesis on the

256 \times 256

-sized MS-COCO dataset: with 250 DDIM steps our model is on par with the most recent diffusion and autoregressive methods despite using significantly less parameters.

^{†}

^{*}

: Numbers from prior work.

Text-Conditional Image Synthesis
Method	FID ↓	IS ↑	$N_{params}$	Setting
CogView $^{†}$	27.10	18.20	4B	self-ranking, rejection rate 0.017
LAFITE $^{†}$	26.94	26.02	75M
GLIDE $^{*}$	12.24	-	6B	277 DDIM steps, c.f.g. $s = 3$
Make-A-Scene $^{*}$	11.84	-	4B	c.f.g for AR models $s = 5$
LDM-KL-8	23.31	20.03 ± 0.33	1.45B	250 DDIM steps
LDM-KL-8-G $^{*}$	12.63	30.29 ± 0.42	1.45B	250 DDIM steps, c.f.g. $s=1.5$

4.2 Image Generation with Latent Diffusion

CelebAHQ sample — 图7：Samples from LDMs trained on CelebAHQ, FFHQ, LSUN-Churches, LSUN-Bedrooms and class-conditional ImageNet, each with a resolution of $256 \times 256$ . Best viewed when zoomed in. For more samples cf. the supplement.

We train unconditional models of $256^{2}$ images on CelebA-HQ, FFHQ, LSUN-Churches and -Bedrooms and evaluate the i) sample quality and ii) their coverage of the data manifold using ii) FID and ii) Precision-and-Recall. Table Table 1 summarizes our results. On CelebA-HQ, we report a new state-of-the-art FID of $5.11$ , outperforming previous likelihood-based models as well as GANs. We also outperform LSGM where a latent diffusion model is trained jointly together with the first stage. In contrast, we train diffusion models in a fixed space and avoid the difficulty of weighing reconstruction quality against learning the prior over the latent space, see Figure Figure 1-Figure 2. We outperform prior diffusion based approaches on all but the LSUN-Bedrooms dataset, where our score is close to ADM, despite utilizing half its parameters and requiring 4-times less train resources. Moreover, LDMs consistently improve upon GAN-based methods in Precision and Recall, thus confirming the advantages of their mode-covering likelihood-based training objective over adversarial approaches. In Figure Figure 7 we also show qualitative results on each dataset.

作者在 CelebA-HQ、FFHQ、LSUN-Churches 和 LSUN-Bedrooms 上训练 $256^{2}$ 图像的无条件模型，并使用 FID 与 Precision-and-Recall 评估 i) 样本质量以及 ii) 它们对数据流形的覆盖。表1 总结了作者的结果。在 CelebA-HQ 上，作者报告了 $5.11$ 的新最先进 FID，优于先前基于似然的模型以及 GAN。作者还优于 LSGM，后者联合训练潜在扩散模型和第一阶段。相比之下，作者在固定空间中训练扩散模型，避免了在重建质量和学习潜在空间先验之间权衡的困难，见图1-图2。作者在除 LSUN-Bedrooms 外的所有数据集上都优于先前基于扩散的方法；在 LSUN-Bedrooms 上，尽管只使用 ADM 一半参数、训练资源少 4 倍，作者的分数仍接近 ADM。此外，LDM 在 Precision 和 Recall 上持续优于基于 GAN 的方法，因此证实了其覆盖模式的基于似然训练目标相比对抗方法的优势。作者还在图7 中展示了每个数据集上的定性结果。

4.3 Conditional Latent Diffusion

Street layout condition — 图8：Layout-to-image synthesis with an LDM on COCO, see Section 4.3.1. Quantitative evaluation in the supplement.

Street layout sample — 图8：Layout-to-image synthesis with an LDM on COCO, see Section 4.3.1. Quantitative evaluation in the supplement.

Transformer Encoders for LDMs. By introducing cross-attention based conditioning into LDMs we open them up for various conditioning modalities previously unexplored for diffusion models. For text-to-image image modeling, we train a 1.45B parameter KL-regularized LDM conditioned on language prompts on LAION-400M. We employ the BERT-tokenizer and implement $τ_{θ}$ as a transformer to infer a latent code which is mapped into the UNet via (multi-head) cross-attention (Section 3.3). This combination of domain specific experts for learning a language representation and visual synthesis results in a powerful model, which generalizes well to complex, user-defined text prompts, cf. Figure Figure 8 and Figure 4. For quantitative analysis, we follow prior work and evaluate text-to-image generation on the MS-COCO validation set, where our model improves upon powerful AR and GAN-based methods, cf. Table Table 2. We note that applying classifier-free diffusion guidance greatly boosts sample quality, such that the guided LDM-KL-8-G is on par with the recent state-of-the-art AR and diffusion models for text-to-image synthesis, while substantially reducing parameter count. To further analyze the flexibility of the cross-attention based conditioning mechanism we also train models to synthesize images based on semantic layouts on OpenImages, and finetune on COCO, see Figure Figure 8. See the supplement for the quantitative evaluation and implementation details. Lastly, following prior work, we evaluate our best-performing class-conditional ImageNet models with $f \in {4, 8}$ from Section 4.1 in Table Table 3, Figure Figure 7 and the supplement. Here we outperform the state of the art diffusion model ADM while significantly reducing computational requirements and parameter count.

用于 LDM 的 Transformer 编码器。 通过把基于交叉注意力的条件机制引入 LDM，作者让它们能够处理此前在扩散模型中尚未探索过的各种条件模态。对于文本到图像图像建模，作者在 LAION-400M 上训练一个以语言提示为条件、具有 14.5 亿参数的 KL 正则化 LDM。作者采用 BERT tokenizer，并把 $τ_{θ}$ 实现为 Transformer，用来推断一个潜在代码，该代码随后通过（多头）交叉注意力映射到 UNet 中（第 3.3 节）。这种用于学习语言表示和视觉合成的领域特定专家组合产生了一个强大模型，能够很好地泛化到复杂的用户自定义文本提示，参见图8 和图4。对于定量分析，作者遵循先前工作，在 MS-COCO 验证集上评估文本到图像生成；该模型优于强大的 AR 和基于 GAN 的方法，参见表2。作者注意到，应用无分类器扩散引导会大幅提升样本质量，使得引导后的 LDM-KL-8-G 在文本到图像合成上与近期最先进的 AR 和扩散模型相当，同时显著减少参数量。为了进一步分析基于交叉注意力的条件机制的灵活性，作者还训练了基于 OpenImages 上语义布局合成图像的模型，并在 COCO 上进行微调，见图8。定量评估和实现细节见补充材料。最后，遵循先前工作，作者评估了第 4.1 节中 $f \in {4, 8}$ 的最佳类别条件 ImageNet 模型，见表3、图7 和补充材料。在这里，作者超越了最先进扩散模型 ADM，同时显著降低计算需求和参数量。

表3：Comparison of a class-conditional ImageNet LDM with recent state-of-the-art methods for class-conditional image generation on ImageNet. A more detailed comparison with additional baselines can be found in the supplement. c.f.g. denotes classifier-free guidance with a scale

s

Method	FID ↓	IS ↑	Precision ↑	Recall ↑	$N_{params}$	Setting
BigGan-deep	6.95	203.6 ± 2.6	0.87	0.28	340M	-
ADM	10.94	100.98	0.69	0.63	554M	250 DDIM steps
ADM-G	4.59	186.7	0.82	0.52	608M	250 DDIM steps
LDM-4 (ours)	10.56	103.49 ± 1.24	0.71	0.62	400M	250 DDIM steps
LDM-4-G (ours)	3.60	247.67 ± 5.59	0.87	0.48	400M	250 steps, c.f.g., $s = 1.5$

Large-resolution landscape synthesis sample — 图9：An LDM trained on $256^{2}$ resolution can generalize to larger resolution (here: $512 \times 1024$ ) for spatially conditioned tasks such as semantic synthesis of landscape images.

Convolutional Sampling Beyond $256^{2}$ . By concatenating spatially aligned conditioning information to the input of $ϵ_{θ}$ , LDMs can serve as efficient general-purpose image-to-image translation models. We use this to train models for semantic synthesis, super-resolution (Section 4.4) and inpainting (Section 4.5). For semantic synthesis, we use images of landscapes paired with semantic maps and concatenate downsampled versions of the semantic maps with the latent image representation of a $f = 4$ model (VQ-reg., see the appendix table). We train on an input resolution of $256^{2}$ (crops from $384^{2}$ ) but find that our model generalizes to larger resolutions and can generate images up to the megapixel regime when evaluated in a convolutional manner (see Figure Figure 9). We exploit this behavior to also apply the super-resolution models in Section 4.4 and the inpainting models in Section 4.5 to generate large images between $512^{2}$ and $1024^{2}$ . For this application, the signal-to-noise ratio (induced by the scale of the latent space) significantly affects the results. In the supplement we illustrate this when learning an LDM on (i) the latent space as provided by a $f = 4$ model (KL-reg., see the appendix table), and (ii) a rescaled version, scaled by the component-wise standard deviation. The latter, in combination with classifier-free guidance, also enables the direct synthesis of $> 256^{2}$ images for the text-conditional LDM-KL-8-G as in the appendix.

超过 $256^{2}$ 的卷积采样。 通过把空间对齐的条件信息拼接到 $ϵ_{θ}$ 的输入中，LDM 可以作为高效的通用图像到图像翻译模型。作者利用这一点训练语义合成、超分辨率（第 4.4 节）和图像修复（第 4.5 节）模型。对于语义合成，作者使用与语义图配对的风景图像，并把语义图的下采样版本与 $f = 4$ 模型的潜在图像表示进行拼接（VQ 正则化，见附表）。作者在 $256^{2}$ 的输入分辨率（来自 $384^{2}$ 的裁剪）上训练，但发现模型能够泛化到更高分辨率，并且在以卷积方式评估时可以生成高达百万像素级别的图像（见图9）。作者还利用这一行为，将第 4.4 节的超分辨率模型和第 4.5 节的图像修复模型应用于生成 $512^{2}$ 到 $1024^{2}$ 之间的大图像。对于这一应用，信噪比（由潜在空间尺度诱导）会显著影响结果。在补充材料中，作者说明了在以下两种情况下学习 LDM 的差异：(i) 使用 $f = 4$ 模型提供的潜在空间（KL 正则化，见附表），以及 (ii) 使用按分量标准差缩放后的重缩放版本。后者结合无分类器引导，也使文本条件 LDM-KL-8-G 能够像附录中那样直接合成 $> 256^{2}$ 的图像。

4.4 Super-Resolution with Latent Diffusion

Cat bicubic super-resolution input — 图10：ImageNet $64 \to 256$ super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. SR3 results from prior work.

Cat LDM super-resolution result — 图10：ImageNet $64 \to 256$ super-resolution on ImageNet-Val. LDM-SR has advantages at rendering realistic textures but SR3 can synthesize more coherent fine structures. SR3 results from prior work.

LDMs can be efficiently trained for super-resolution by diretly conditioning on low-resolution images via concatenation (cf. Section 3.3). In a first experiment, we follow SR3 and fix the image degradation to a bicubic interpolation with $4 \times$ -downsampling and train on ImageNet following SR3's data processing pipeline. We use the $f = 4$ autoencoding model pretrained on OpenImages (VQ-reg., cf. the appendix table) and concatenate the low-resolution conditioning $y$ and the inputs to the UNet, i.e. $τ_{θ}$ is the identity. Our qualitative and quantitative results (see Figure Figure 10 and Table Table 5) show competitive performance and LDM-SR outperforms SR3 in FID while SR3 has a better IS. A simple image regression model achieves the highest PSNR and SSIM scores; however these metrics do not align well with human perception and favor blurriness over imperfectly aligned high frequency details. Further, we conduct a user study comparing the pixel-baseline with LDM-SR. We follow SR3 where human subjects were shown a low-res image in between two high-res images and asked for preference. The results in Table Table 4 affirm the good performance of LDM-SR. PSNR and SSIM can be pushed by using a post-hoc guiding mechanism and we implement this image-based guider via a perceptual loss, see the supplement. Since the bicubic degradation process does not generalize well to images which do not follow this pre-processing, we also train a generic model, LDM-BSR, by using more diverse degradation. The results are shown in the supplement.

LDM 可以通过拼接方式直接以低分辨率图像为条件，从而高效地训练用于超分辨率（参见第 3.3 节）。在第一个实验中，作者遵循 SR3，将图像退化固定为 $4 \times$ 下采样的双三次插值，并按照 SR3 的数据处理流程在 ImageNet 上训练。作者使用在 OpenImages 上预训练的 $f = 4$ 自编码模型（VQ 正则化，参见附表），并把低分辨率条件 $y$ 与 UNet 输入进行拼接，也就是说 $τ_{θ}$ 是恒等映射。作者的定性和定量结果（见图10 和表5）显示出有竞争力的性能，并且 LDM-SR 在 FID 上优于 SR3，而 SR3 具有更好的 IS。简单图像回归模型取得最高的 PSNR 和 SSIM 分数；然而这些指标与人类感知并不很好一致，并且偏好模糊而非未完全对齐的高频细节。此外，作者进行了一项用户研究，将像素基线与 LDM-SR 进行比较。作者遵循 SR3，让人类受试者看到位于两张高分辨率图像之间的一张低分辨率图像，并要求他们表达偏好。表4 中的结果确认了 LDM-SR 的良好表现。可以使用事后引导机制提升 PSNR 和 SSIM，作者通过感知损失实现了这种基于图像的引导器，见补充材料。由于双三次退化过程对不遵循这种预处理的图像泛化不好，作者还使用更多样的退化方式训练了一个通用模型 LDM-BSR。结果见补充材料。

表4：Task 1: Subjects were shown ground truth and generated image and asked for preference. Task 2: Subjects had to decide between two generated images. More details in the supplement.

	SR on ImageNet		Inpainting on Places
User Study	Pixel-DM ( $f 1$ )	LDM-4	LAMA	LDM-4
Task 1: Preference vs GT ↑	16.0%	30.4%	13.6%	21.0%
Task 2: Preference Score ↑	29.4%	70.6%	31.9%	68.1%

表5：

\times 4

upscaling results on ImageNet-Val (

256^{2}

);

^{†}

: FID features computed on validation split,

^{‡}

: FID features computed on train split;

^{*}

: Assessed on a NVIDIA A100.

Method	FID ↓	IS ↑	PSNR ↑	SSIM ↑	$N_{params}$	samples/s $^{*}$
Image Regression	15.2	121.1	27.9	0.801	625M	N/A
SR3	5.2	180.1	26.4	0.762	625M	N/A
LDM-4 (ours, 100 steps)	2.8 $^{†}$ /4.8 $^{‡}$	166.3	24.4 ± 3.8	0.69 ± 0.14	169M	4.62
LDM-4 (ours, big, 100 steps)	2.4 $^{†}$ /4.3 $^{‡}$	174.9	24.7 ± 4.1	0.71 ± 0.15	552M	4.5
LDM-4 (ours, 50 steps, guiding)	4.4 $^{†}$ /6.4 $^{‡}$	153.7	25.8 ± 3.7	0.74 ± 0.12	184M	0.38

4.5 Inpainting with Latent Diffusion

Object removal input 000007 — 图11：Qualitative results on object removal with our big, w/ ft inpainting model. For more results, see the supplement.

Object removal result 000007 — 图11：Qualitative results on object removal with our big, w/ ft inpainting model. For more results, see the supplement.

Inpainting is the task of filling masked regions of an image with new content either because parts of the image are are corrupted or to replace existing but undesired content within the image. We evaluate how our general approach for conditional image generation compares to more specialized, state-of-the-art approaches for this task. Our evaluation follows the protocol of LaMa, a recent inpainting model that introduces a specialized architecture relying on Fast Fourier Convolutions. The exact training & evaluation protocol on Places is described in the supplement. We first analyze the effect of different design choices for the first stage. In particular, we compare the inpainting efficiency of LDM-1 (i.e. a pixel-based conditional DM) with LDM-4, for both KL and VQ regularizations, as well as VQ-LDM-4 without any attention in the first stage (see the appendix table), where the latter reduces GPU memory for decoding at high resolutions. For comparability, we fix the number of parameters for all models. Table Table 6 reports the training and sampling throughput at resolution $256^{2}$ and $512^{2}$ , the total training time in hours per epoch and the FID score on the validation split after six epochs. Overall, we observe a speed-up of at least $2.7 \times$ between pixel- and latent-based diffusion models while improving FID scores by a factor of at least $1.6 \times$ .

图像修复是用新内容填充图像中被遮罩区域的任务，原因可能是图像部分区域被破坏，也可能是为了替换图像中已有但不想要的内容。作者评估其用于条件图像生成的通用方法与该任务上更专门化的最先进方法相比如何。作者的评估遵循 LaMa 的协议，LaMa 是一种近期图像修复模型，引入了依赖快速傅里叶卷积的专用架构。 Places 上的确切训练和评估协议见补充材料。作者首先分析第一阶段不同设计选择的影响。具体而言，作者比较了 LDM-1（即基于像素的条件 DM）和 LDM-4 的图像修复效率，后者同时考虑 KL 与 VQ 正则化，以及第一阶段不含任何注意力的 VQ-LDM-4（见附表），其中后者能减少高分辨率解码时的 GPU 内存。为了可比性，作者固定所有模型的参数数量。表6 报告了 $256^{2}$ 和 $512^{2}$ 分辨率下的训练与采样吞吐量、每个 epoch 的总训练时间（小时），以及六个 epoch 后验证集上的 FID 分数。总体而言，作者观察到基于像素和基于潜在空间的扩散模型之间至少有 $2.7 \times$ 的加速，同时 FID 分数至少改善 $1.6 \times$ 。

表6：Assessing inpainting efficiency.

^{†}

: Deviations from Figure 6 due to varying GPU settings/batch sizes, cf. the supplement.

Model (reg.-type)	train throughput samples/sec.	sampling throughput $^{†}$		train+val hours/epoch	FID@2k epoch 6
		@256	@512
LDM-1 (no first stage)	0.11	0.26	0.07	20.66	24.74
LDM-4 (KL, w/ attn)	0.32	0.97	0.34	7.66	15.21
LDM-4 (VQ, w/ attn)	0.33	0.97	0.34	7.04	14.99
LDM-4 (VQ, w/o attn)	0.35	0.99	0.36	6.66	15.95

The comparison with other inpainting approaches in Table Table 7 shows that our model with attention improves the overall image quality as measured by FID over that of LaMa. LPIPS between the unmasked images and our samples is slightly higher than that of LaMa. We attribute this to LaMa only producing a single result which tends to recover more of an average image compared to the diverse results produced by our LDM cf. the supplement. Additionally in a user study (Table Table 4) human subjects favor our results over those of LaMa. Based on these initial results, we also trained a larger diffusion model (big in Table Table 7) in the latent space of the VQ-regularized first stage without attention. Following ADM, the UNet of this diffusion model uses attention layers on three levels of its feature hierarchy, the BigGAN residual block for up- and downsampling and has 387M parameters instead of 215M. After training, we noticed a discrepancy in the quality of samples produced at resolutions $256^{2}$ and $512^{2}$ , which we hypothesize to be caused by the additional attention modules. However, fine-tuning the model for half an epoch at resolution $512^{2}$ allows the model to adjust to the new feature statistics and sets a new state of the art FID on image inpainting (big, w/o attn, w/ ft in Table Table 7, Figure Figure 11).

与其他图像修复方法在表7 中的比较表明，作者带注意力的模型在 FID 衡量的整体图像质量上优于 LaMa。未遮罩图像与作者样本之间的 LPIPS 略高于 LaMa。作者将其归因于 LaMa 只生成单一结果，而该结果相比作者 LDM 产生的多样化结果，更倾向于恢复一张平均图像，参见补充材料。此外，在用户研究（表4）中，人类受试者也更偏好作者的结果而不是 LaMa。基于这些初步结果，作者还在不含注意力的 VQ 正则化第一阶段潜在空间中训练了一个更大的扩散模型（表7 中的 big）。遵循 ADM，这个扩散模型的 UNet 在其特征层级的三个层次上使用注意力层，使用 BigGAN 残差块进行上采样和下采样，并具有 3.87 亿参数，而不是 2.15 亿。训练后，作者注意到在 $256^{2}$ 和 $512^{2}$ 分辨率生成样本的质量存在差异，作者假设这是由额外注意力模块造成的。然而，在 $512^{2}$ 分辨率上对模型微调半个 epoch，使模型能够适应新的特征统计，并在图像修复上取得新的最先进 FID（表7 中的 big, w/o attn, w/ ft，图11）。

表7：Comparison of inpainting performance on 30k crops of size

512 \times 512

from test images of Places. The column 40-50% reports metrics computed over hard examples where 40-50% of the image region have to be inpainted.

^{†}

recomputed on our test set, since the original test set used in LaMa was not available.

	40-50% masked		All samples
Method	FID ↓	LPIPS ↓	FID ↓	LPIPS ↓
LDM-4 (ours, big, w/ ft)	9.39	0.246 ± 0.042	1.50	0.137 ± 0.080
LDM-4 (ours, big, w/o ft)	12.89	0.257 ± 0.047	2.40	0.142 ± 0.085
LDM-4 (ours, w/ attn)	11.87	0.257 ± 0.042	2.15	0.144 ± 0.084
LDM-4 (ours, w/o attn)	12.60	0.259 ± 0.041	2.37	0.145 ± 0.084
LaMa $^{†}$	12.31	0.243 ± 0.038	2.23	0.134 ± 0.080
LaMa	12.0	0.24	2.21	0.14
CoModGAN	10.4	0.26	1.82	0.15
RegionWise	21.3	0.27	4.75	0.15
DeepFill v2	22.1	0.28	5.20	0.16
EdgeConnect	30.5	0.28	8.37	0.16

5. Limitations & Societal Impact

Limitations. While LDMs significantly reduce computational requirements compared to pixel-based approaches, their sequential sampling process is still slower than that of GANs. Moreover, the use of LDMs can be questionable when high precision is required: although the loss of image quality is very small in our $f = 4$ autoencoding models (see Figure Figure 1), their reconstruction capability can become a bottleneck for tasks that require fine-grained accuracy in pixel space. We assume that our superresolution models (Section 4.4) are already somewhat limited in this respect.

局限性。 虽然 LDM 相比基于像素的方法显著降低了计算需求，但其顺序采样过程仍然比 GAN 更慢。此外，当需要高精度时，使用 LDM 可能是有疑问的：虽然作者的 $f = 4$ 自编码模型中的图像质量损失非常小（见图1），但对于需要像素空间细粒度精度的任务，其重建能力可能成为瓶颈。作者认为，其超分辨率模型（第 4.4 节）在这方面已经受到一定限制。

Societal Impact. Generative models for media like imagery are a double-edged sword: On the one hand, they enable various creative applications, and in particular approaches like ours that reduce the cost of training and inference have the potential to facilitate access to this technology and democratize its exploration. On the other hand, it also means that it becomes easier to create and disseminate manipulated data or spread misinformation and spam. In particular, the deliberate manipulation of images ("deep fakes") is a common problem in this context, and women in particular are disproportionately affected by it. Generative models can also reveal their training data, which is of great concern when the data contain sensitive or personal information and were collected without explicit consent. However, the extent to which this also applies to DMs of images is not yet fully understood. Finally, deep learning modules tend to reproduce or exacerbate biases that are already present in the data. While diffusion models achieve better coverage of the data distribution than e.g. GAN-based approaches, the extent to which our two-stage approach that combines adversarial training and a likelihood-based objective misrepresents the data remains an important research question. For a more general, detailed discussion of the ethical considerations of deep generative models, see e.g. Denton et al.

社会影响。 面向图像等媒体的生成模型是一把双刃剑：一方面，它们支持多种创造性应用，尤其是像作者这样降低训练和推理成本的方法，有潜力促进这项技术的获取，并让其探索更加普及。另一方面，这也意味着创建和传播被操纵数据，或传播错误信息和垃圾信息会变得更容易。尤其是，蓄意操纵图像（“deep fakes”）是这一语境中的常见问题，女性尤其会受到不成比例的影响。生成模型还可能泄露其训练数据；当数据包含敏感或个人信息且是在没有明确同意的情况下收集时，这一点非常令人担忧。然而，这在多大程度上也适用于图像 DM，目前尚未被充分理解。最后，深度学习模块往往会复现或加剧数据中已经存在的偏差。虽然扩散模型相比例如基于 GAN 的方法能更好地覆盖数据分布，但作者这种结合对抗训练和基于似然目标的两阶段方法在多大程度上会错误表征数据，仍然是一个重要研究问题。关于深度生成模型伦理考量的更一般、更详细讨论，可参见 Denton 等人的工作。

6. Conclusion

We have presented latent diffusion models, a simple and efficient way to significantly improve both the training and sampling efficiency of denoising diffusion models without degrading their quality. Based on this and our cross-attention conditioning mechanism, our experiments could demonstrate favorable results compared to state-of-the-art methods across a wide range of conditional image synthesis tasks without task-specific architectures.

作者提出了潜在扩散模型，这是一种简单而高效的方法，能够在不降低质量的情况下显著提升去噪扩散模型的训练和采样效率。基于这一点以及作者的交叉注意力条件机制，实验表明，在不使用任务特定架构的情况下，该方法在广泛的条件图像合成任务上相比最先进方法取得了有利结果。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

High-Resolution Image Synthesis with Latent Diffusion Models

利用潜在扩散模型进行高分辨率图像合成

Abstract

1. Introduction

3. Method

3.1 Perceptual Image Compression

3.2 Latent Diffusion Models

3.3 Conditioning Mechanisms

4. Experiments

4.1 On Perceptual Compression Tradeoffs

4.2 Image Generation with Latent Diffusion

4.3 Conditional Latent Diffusion

4.4 Super-Resolution with Latent Diffusion

4.5 Inpainting with Latent Diffusion

5. Limitations & Societal Impact

6. Conclusion

🤖 Rasa

High-Resolution Image Synthesis with Latent Diffusion Models ​

利用潜在扩散模型进行高分辨率图像合成 ​

Abstract ​

1. Introduction ​

2. Related Work ​

3. Method ​

3.1 Perceptual Image Compression ​

3.2 Latent Diffusion Models ​

3.3 Conditioning Mechanisms ​

4. Experiments ​

4.1 On Perceptual Compression Tradeoffs ​

4.2 Image Generation with Latent Diffusion ​

4.3 Conditional Latent Diffusion ​

4.4 Super-Resolution with Latent Diffusion ​

4.5 Inpainting with Latent Diffusion ​

5. Limitations & Societal Impact ​

6. Conclusion ​

High-Resolution Image Synthesis with Latent Diffusion Models

利用潜在扩散模型进行高分辨率图像合成

Abstract

1. Introduction

2. Related Work

3. Method

3.1 Perceptual Image Compression

3.2 Latent Diffusion Models

3.3 Conditioning Mechanisms

4. Experiments

4.1 On Perceptual Compression Tradeoffs

4.2 Image Generation with Latent Diffusion

4.3 Conditional Latent Diffusion

4.4 Super-Resolution with Latent Diffusion

4.5 Inpainting with Latent Diffusion

5. Limitations & Societal Impact

6. Conclusion