Skip to content


大型语言扩散模型

Abstract

The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs.

大型语言模型的能力被广泛认为依赖于自回归模型。我们通过引入 LLaDA 来挑战这一观念,这是一个在预训练和监督微调范式下从零开始训练的扩散模型。LLaDA 采用前向数据掩码过程和反向生成过程,由 Transformer 参数化以预测掩码词元。它通过优化似然下界为概率推理提供了一种有原则的生成方法。在通用任务、数学、代码等广泛基准上,LLaDA 展现出强大的可扩展性,并且与我们自行构建的自回归基线表现相当。值得注意的是,LLaDA 8B 在上下文学习方面与 LLaMA3 8B 等强大的 LLM 具有竞争力,并且在监督微调后,在多轮对话等案例研究中展现出令人印象深刻的指令遵循能力。此外,LLaDA 解决了逆向诅咒问题,在逆向诗歌补全任务上超越了 GPT-4o。我们的研究结果表明了扩散模型在大规模语言建模中的前景,并挑战了上述核心 LLM 能力本质上依赖于自回归模型的普遍假设。

Introduction

Large language models (LLMs) fall entirely within the framework of generative modeling. Specifically, LLMs aim to capture the true but unknown language distribution pdata() by optimizing a model distribution pθ() through maximum likelihood estimation, or equivalently KL divergence minimization between the two distributions:

大型语言模型完全属于生成式建模的范畴。具体来说,大型语言模型旨在通过最大化似然估计优化模型分布 pθ(),或者等价地最小化两者之间的KL散度,来捕捉真实但未知的语言分布 pdata()

(1)maxθEpdata (x)logpθ(x)minθKL(pdata (x)pθ(x))Generative modeling principles 

The predominant approach relies on the autoregressive modeling (ARM)—commonly referred to as the "next-token prediction" paradigm—to define the model distribution:

主流方法依赖于自回归建模来定义模型分布,即通常所说的“下一个词元预测”范式:

(2)pθ(x)=pθ(x1)i=2Lpθ(xix1,,xi1)Autoregressive formulation 

where x is a sequence of length L, and xi is the i-th token. This paradigm has proven remarkably effective and has become the foundation of current LLMs. Despite its widespread adoption, a fundamental question remains unanswered: Is the autoregressive paradigm the only path to achieving the core capabilities of LLMs, such as scalability, in-context learning, and instruction-following?

其中 x 是一个长度为 L 的序列,xi 是第 i 个词元。这一范式已被证明非常有效,并成为当前大型语言模型的基础。尽管被广泛采用,一个根本问题仍未得到解答:自回归范式是否是实现大型语言模型核心能力(如可扩展性、上下文学习和指令遵循)的唯一途径?

We argue that the answer is not a simple "yes". The key insight overlooked previously is: It is the generative modeling principles (i.e., Eq. (1)), rather than the autoregressive formulation (i.e., Eq. (2)) itself, that fundamentally underpin the essential properties of LLMs.

我们认为答案并非简单的“是”。先前被忽视的关键见解是:从根本上支撑大型语言模型基本属性的是生成式建模原则,而非自回归公式本身

In particular, we argue that scalability is primarily a consequence of the interplay between Transformers, model size, data size, and Fisher consistency induced by the generative principles in Eq. (1), rather than a unique result of the ARMs in Eq. (2). The success of diffusion transformers on visual data supports this claim. Furthermore, the instruction-following and in-context learning capabilities appear to be intrinsic properties of all conditional generative models on structurally consistent linguistic tasks, rather than exclusive advantages of ARMs. In addition, while ARMs can be interpreted as a lossless data compressor, any sufficiently expressive probabilistic model can achieve similar capabilities.

具体而言,我们认为可扩展性主要是Transformer、模型规模、数据规模以及由等式(1)中生成原则引发的费希尔一致性之间相互作用的结果,而非自回归模型的独特产物。扩散Transformer在视觉数据上的成功支持了这一观点。此外,指令遵循和上下文学习能力似乎是所有在结构一致的语言任务上的条件生成模型的内在属性,而非自回归模型的专属优势。另外,虽然自回归模型可以被解释为无损数据压缩器,但任何具有足够表达能力的概率模型都能实现类似的能力。

However, certain inherent limitations of LLMs can be directly attributed to their autoregressive nature. For instance, the left-to-right generation process restricts their ability to handle reversal reasoning tasks, highlighting a representative failure in the generalization capabilities of current models.

然而,大型语言模型的某些固有局限性可直接归因于其自回归性质。例如,从左到右的生成过程限制了它们处理逆向推理任务的能力,这凸显了当前模型在泛化能力方面的一个代表性失败。

Motivated by these insights, we introduce LLaDA (Large Language Diffusion with mAsking) to investigate whether the capabilities exhibited by LLMs can emerge from generative modeling principles beyond ARMs, thereby addressing the fundamental question posed earlier. In contrast to traditional ARMs, LLaDA leverages a masked diffusion model (MDM), which incorporates a forward data masking process and trains a mask predictor to approximate its reverse process. This design enables LLaDA to construct a model distribution with bidirectional dependencies and optimize a variational lower bound of its log-likelihood, offering a principled and previously unexplored perspective on the core capabilities of LLMs discussed above.

受这些见解的启发,我们引入了LLaDA,以探究大型语言模型所展现的能力是否可以在自回归模型之外的生成式建模原则中涌现,从而解决先前提出的根本问题。与传统自回归模型不同,LLaDA利用了一种掩码扩散模型,该模型包含一个前向数据掩码过程,并训练一个掩码预测器来近似其反向过程。这种设计使LLaDA能够构建具有双向依赖关系的模型分布,并优化其对数似然的变分下界,为上述大型语言模型的核心能力提供了一个有原则且先前未被探索的视角。

We adopt the standard pipeline of data preparation, pre-training, supervised fine-tuning (SFT), and evaluation, scaling LLaDA to an unprecedented language diffusion of size 8B. In particular, LLaDA 8B was pre-trained from scratch on 2.3 trillion tokens using 0.13 million H800 GPU hours, followed by SFT on 4.5 million pairs. Across diverse tasks, including language understanding, math, code, and Chinese, LLaDA demonstrates the following contributions:

我们采用了数据准备、预训练、监督微调和评估的标准流程,将LLaDA扩展到了前所未有的8B规模的语言扩散模型。具体来说,LLaDA 8B从零开始,在2.3万亿词元上使用13万H800 GPU小时进行了预训练,随后在450万对数据上进行了监督微调。在包括语言理解、数学、代码和中文在内的多种任务上,LLaDA展示了以下贡献:

  • LLaDA scales effectively to a compute budget of 1023 FLOPs, achieving comparable results to ARM baselines trained on the same data across six tasks, e.g., MMLU and GSM8K.
  • The pre-trained LLaDA 8B Base surpasses LLaMA2 7B Base on nearly all 15 standard zero/few-shot learning tasks while performing on par with LLaMA3 8B Base, showcasing effective in-context learning capability.
  • LLaDA significantly enhances the ability to follow instructions after SFT, as demonstrated in case studies such as multi-turn dialogue.
  • LLaDA effectively breaks the reversal curse with consistent performance across forward and reversal tasks. Notably, it outperforms GPT-4o in a reversal poem completion task.
  • LLaDA在高达 1023 FLOPs 的计算预算下有效扩展,在六个任务上取得了与在同一数据上训练的自回归模型基线相当的结果,例如MMLU和GSM8K。
  • 预训练的LLaDA 8B Base在几乎所有15个标准零样本/少样本学习任务上超越了LLaMA2 7B Base,同时与LLaMA3 8B Base表现相当,展示了有效的上下文学习能力。
  • 在监督微调后,LLaDA显著增强了遵循指令的能力,这在多轮对话等案例研究中得到了证明。
  • LLaDA有效地打破了逆向诅咒,在正向和逆向任务上表现一致。值得注意的是,它在逆向诗歌补全任务上超越了GPT-4o。

Approach

In this section, we introduce the probabilistic formulation, along with the pre-training, supervised fine-tuning, and inference procedures for LLaDA, as illustrated in Fig. 2.

在本节中,我们介绍LLaDA的概率公式,以及预训练、监督微调和推理流程,如图2所示。

2.1 Probabilistic Formulation

Unlike ARMs in Eq. (2), LLaDA defines a model distribution pθ(x0) through a forward process and a reverse process. The forward process gradually masks tokens independently in x0 until the sequence is fully masked at t=1. For t(0,1), the sequence xt is partially masked, with each being masked with probability t or remaining unmasked with probability 1t. The reverse process recovers the data distribution by iteratively predicting masked tokens as t moves from 1 to 0.

2.1 概率公式

与公式(2)中的自回归模型不同,LLaDA通过一个前向过程和一个反向过程来定义模型分布 pθ(x0)。前向过程逐步独立地掩码 x0 中的词元,直到序列在 t=1 时被完全掩码。对于 t(0,1),序列 xt 是部分掩码的,每个词元以概率 t 被掩码,或以概率 1t 保持未掩码。反向过程通过随着 t 从1到0移动,迭代地预测被掩码的词元来恢复数据分布。

The core of LLaDA is a mask predictor, a parametric model pθ(|xt) that takes xt as input and predicts all masked tokens (denoted as M) simultaneously. It is trained using a cross-entropy loss computed only on the masked tokens:

LLaDA的核心是一个掩码预测器,它是一个参数化模型 pθ(|xt),以 xt 作为输入,并同时预测所有被掩码的词元。它使用仅在掩码词元上计算的交叉熵损失进行训练:

(3)L(θ)Et,x0,xt[1ti=1L1[xti=M]logpθ(x0ixt)]

where x0 is a training sample, t is a continuous random variable drawn uniformly from [0,1], xt is sampled from the forward process and L is the sequence length. The indicator function 1[] ensures that the loss is computed only for masked tokens.

其中 x0 是一个训练样本,t 是从 [0,1] 中均匀抽取的连续随机变量,xt 是从前向过程中采样的,L 是序列长度。指示函数 1[] 确保仅对被掩码的词元计算损失。

Once trained, we can simulate a reverse process (see Sec. 2.4 for details) parameterized by the mask predictor and define the model distribution pθ(x0) as the marginal distribution induced at t=0. The loss function in Eq. (3) has been proven to be an upper bound on the negative log-likelihood of the model distribution, making it a principled objective for generative modeling:

训练完成后,我们可以模拟一个由掩码预测器参数化的反向过程,并将模型分布 pθ(x0) 定义为在 t=0 时诱导出的边缘分布。公式(3)中的损失函数已被证明是模型分布负对数似然的上界,使其成为一个有原则的生成建模目标:

(4)Epdata (x0)[logpθ(x0)]L(θ)

Notably, LLaDA employs a masking ratio that varies randomly between 0 and 1 while BERT uses a fixed ratio. The subtle differences have significant implications, especially at scale: as shown in Eq. (4), LLaDA is a principled generative model with the potential to perform in-context learning and instruction-following naturally, akin to LLMs. Moreover, its generative perspective implies strong scalability with large data and models as discussed in Sec. 1. In addition, MaskGIT adopts a heuristic training objective, which misses the 1t term compared to Eq. (3), and lacks a theoretical link to maximum likelihood. We emphasize that it is precisely the theoretical foundation of maximum likelihood estimation that motivated us to scale discrete diffusion models for language modeling.

值得注意的是,LLaDA采用的掩码比率在0和1之间随机变化,而BERT使用固定比率。这种微妙的差异具有重要影响,尤其是在大规模情况下:如公式(4)所示,LLaDA是一个有原则的生成模型,具有自然执行上下文学习和指令遵循的潜力,类似于大型语言模型。此外,其生成视角意味着在大数据和模型下具有很强的可扩展性,如第1节所述。另外,MaskGIT采用了启发式训练目标,与公式(3)相比缺少了 1t 项,且缺乏与最大似然的理论联系。我们强调,正是最大似然估计的理论基础促使我们将离散扩散模型扩展到语言建模。

2.2 Pre-training

LLaDA employs a Transformer as the mask predictor, similar to existing LLMs. However, LLaDA does not use a causal mask, as its formulation allows it to see the entire input for predictions.

2.2 预训练

LLaDA采用Transformer作为掩码预测器,类似于现有的大型语言模型。然而,LLaDA不使用因果掩码,因为其公式允许它在预测时看到整个输入。

We trained two variants of LLaDA with different sizes: 1B and 8B. We summarize the model architecture of LLaDA 8B and LLaMA3 8B here, and details are provided in Appendix B.2. We have ensured consistency in most hyperparameters while making several necessary modifications. We use vanilla multi-head attention instead of grouped query attention for simplicity, as LLaDA is incompatible with KV caching, resulting in a different number of key and value heads. Consequently, the attention layer has more parameters, and we reduce the FFN dimension to maintain a comparable model size. Additionally, the vocabulary size differs due to a tokenizer adapted on our data.

我们训练了两个不同大小的LLaDA变体:1B和8B。我们在此总结了LLaDA 8B和LLaMA3 8B的模型架构,详细内容见附录B.2。我们确保了大多数超参数的一致性,同时进行了一些必要的修改。为简单起见,我们使用普通的多头注意力而非分组查询注意力,因为LLaDA与KV缓存不兼容,导致键和值头的数量不同。因此,注意力层有更多参数,我们减少了前馈网络维度以保持相当的模型大小。此外,由于在我们的数据上适配了分词器,词汇表大小有所不同。

The LLaDA model is pre-trained on a dataset comprising 2.3 trillion (T) tokens, adhering to a data protocol that aligns closely with existing LLMs, without the incorporation of any special techniques. The data are derived from online corpora, with low-quality content filtered through manually designed rules and LLM-based approaches. Beyond general text, the dataset encompasses high-quality code, math, and multilingual data. Please refer to Appendix B.1 for more details about datasets. The mixing of data sources and domains is guided by scaled-down ARMs. The pre-training process utilizes a fixed sequence length of 4096 tokens, incurring a total computational cost of 0.13 million H800 GPU hours, similar to ARMs of the same scale and dataset size.

LLaDA模型在包含2.3万亿词元的数据集上进行了预训练,遵循与现有大型语言模型紧密一致的数据协议,没有采用任何特殊技术。数据来源于在线语料库,通过手工设计的规则和基于大型语言模型的方法过滤了低质量内容。除了通用文本,数据集还包括高质量的代码、数学和多语言数据。关于数据集的更多详情请参考附录B.1。数据源和领域的混合由缩小规模的自回归模型指导。预训练过程使用固定的4096词元序列长度,总计算成本为13万H800 GPU小时,与相同规模和数据集大小的自回归模型相似。

For a training sequence x0, we randomly sample t[0,1], mask each token independently with the same probability t to obtain xt (see Fig. 2 (a)) and estimate Eq. (3) via the Monte Carlo method for stochastic gradient descent training. In addition, following Nie et al., to enhance the ability of LLaDA to handle variable-length data, we set 1% of the pre-training data to a random length that is uniformly sampled from the range [1,4096].

对于一个训练序列 x0,我们随机采样 t[0,1],以相同概率 t 独立地掩码每个词元以获得 xt,并通过蒙特卡洛方法估计公式(3)以进行随机梯度下降训练。此外,为了增强LLaDA处理变长数据的能力,我们将1%的预训练数据设置为从 [1,4096] 范围内均匀采样的随机长度。

We adopted the Warmup-Stable-Decay learning rate scheduler to monitor the training progress without interrupting continuous training. Specifically, we linearly increased the learning rate from 0 to 4×104 over the first 2000 iterations and maintained it at 4×104. After processing 1.2T tokens, we decayed the learning rate to 1×104 and held it constant for the next 0.8T tokens to ensure stable training. Finally, we linearly reduced the learning rate from 1×104 to 1×105 for the last 0.3T tokens. Furthermore, we utilized the AdamW optimizer with a weight decay of 0.1, a batch size of 1280, and a local batch size of 4 per GPU. The 8B experiment was executed once, without any hyperparameter tuning.

我们采用了Warmup-Stable-Decay学习率调度器来监控训练进度,而不会中断连续训练。具体来说,我们在前2000次迭代中将学习率从0线性增加到 4×104,并保持 4×104。在处理了1.2T词元后,我们将学习率衰减到 1×104,并在接下来的0.8T词元中保持不变以确保稳定训练。最后,对于最后的0.3T词元,我们将学习率从 1×104 线性降低到 1×105。此外,我们使用AdamW优化器,权重衰减为0.1,批大小为1280,每个GPU的本地批大小为4。8B实验仅执行一次,没有进行任何超参数调优。

2.3 Supervised Fine-Tuning

We enhance the capability of LLaDA to follow instructions by supervised fine-tuning (SFT) with paired data (p0,r0), where p0 is the prompt and r0 denotes the response. This is the simplest and most basic post-training method for LLMs. Technically, this requires to model the conditional distribution pθ(r0|p0) instead of pθ(x0) in pre-training.

2.3 监督微调

我们通过监督微调增强LLaDA遵循指令的能力,使用成对数据 (p0,r0),其中 p0 是提示,r0 表示响应。这是LLM最简单、最基本的后训练方法。技术上,这需要对条件分布 pθ(r0|p0) 进行建模,而不是预训练中的 pθ(x0)

The implementation is similar to pre-training. As shown in Fig. 2 (b), we leave the prompt unchanged and mask the tokens in the response independently, as done for x0. Then, we feed both the prompt and the masked response rt to the pre-trained mask predictor to compute the loss for SFT:

实现方式与预训练类似。如图2(b)所示,我们保持提示不变,并独立地掩码响应中的词元,就像对 x0 所做的那样。然后,将提示和掩码后的响应 rt 一起输入预训练的掩码预测器,以计算SFT的损失:

(5)Et,p0,r0,rt[1ti=1L1[rti=M]logpθ(r0ip0,rt)]

where L denotes a dynamic length specified later, and all other notations remain the same as before.

其中 L 表示稍后指定的动态长度,所有其他符号与之前相同。

Note that this approach is fully compatible with pre-training. Essentially, the concatenation of p0 and r0 can be treated as clean pre-training data x0, while the concatenation of p0 and rt serves as the masked version xt. The process is identical to pre-training, with the only difference being that all masked tokens happen to appear in the r0 portion.

注意,这种方法与预训练完全兼容。本质上,p0r0 的拼接可以被视为干净的预训练数据 x0,而 p0rt 的拼接则作为掩码版本 xt。该过程与预训练相同,唯一的区别是所有掩码词元恰好出现在 r0 部分。

The LLaDA 8B model undergoes SFT on a dataset comprising 4.5 million pairs. Consistent with the pre-training process, both data preparation and training follow the SFT protocols utilized in existing LLMs, without introducing any additional techniques to optimize LLaDA's performance. The dataset spans multiple domains, including code, mathematics, and instruction-following. We append |EOS| tokens to the end of short pairs in each mini-batch to ensure equal lengths across all data. We treat |EOS| as a normal token during training and remove it during sampling, enabling LLaDA to control the response length automatically. Please refer to Appendix B.1 for more details.

LLaDA 8B模型在包含450万对数据的数据集上进行SFT。与预训练过程一致,数据准备和训练均遵循现有LLM中使用的SFT协议,未引入任何额外技术来优化LLaDA的性能。数据集涵盖多个领域,包括代码、数学和指令遵循。我们在每个小批次中给短对末尾添加|EOS|词元,以确保所有数据长度相等。在训练过程中,我们将|EOS|视为普通词元,在采样时将其移除,使LLaDA能够自动控制响应长度。更多详情请参考附录B.1。

We train for 3 epochs on the SFT data using a similar schedule to the pre-training phase. The learning rate is linearly increased from 0 to 2.5×105 over the first 50 iterations and then kept constant. During the final 10% of iterations, it is linearly reduced to 2.5×106. Additionally, we set the weight decay to 0.1, the global batch size to 256, and the local batch size to 2 per GPU. The SFT experiment was executed once, without any hyperparameter tuning.

我们采用与预训练阶段类似的时间表,在SFT数据上训练3个周期。学习率在前50次迭代中从0线性增加到 2.5×105,然后保持不变。在最后10%的迭代中,线性降低到 2.5×106。此外,我们设置权重衰减为0.1,全局批大小为256,每个GPU的本地批大小为2。SFT实验仅执行一次,没有进行任何超参数调优。

2.4 Inference

As a generative model, LLaDA can sample new text and evaluate the likelihood of candidate text in a diffusion manner instead of the left-to-right autoregressive fashion.

2.4 推理

作为一种生成模型,LLaDA能够以扩散方式而非从左到右的自回归方式采样新文本并评估候选文本的似然。

We begin with the reverse generation process. As illustrated in Fig. 2 (c), given a prompt p0, we discretize the reverse process to sample from the model distribution pθ(r0|p0), starting from a fully masked response. The total number of sampling steps is a hyperparameter, which naturally provides LLaDA with a trade-off between efficiency and sample quality, as analyzed in Sec. 3.3. We employ uniformly distributed timesteps by default. In addition, the generation length is also treated as a hyperparameter, specifying the length of the fully masked sentence at the beginning of the sampling process. After generation, tokens appearing after the |EOS| token are discarded. As detailed in Appendix B.5, since both pre-training and SFT are conducted using datasets with variable lengths, the final results are insensitive to this length hyperparameter.

我们从反向生成过程开始。如图2(c)所示,给定提示 p0,我们离散化反向过程,从完全掩码的响应开始,从模型分布 pθ(r0|p0) 中采样。总采样步数是一个超参数,它自然地为LLaDA提供了效率与样本质量之间的权衡,如第3.3节所述。默认情况下,我们采用均匀分布的时间步。此外,生成长度也被视为一个超参数,指定采样过程开始时完全掩码句子的长度。生成后,|EOS|词元之后的词元将被丢弃。如附录B.5所述,由于预训练和SFT都是使用变长数据集进行的,最终结果对此长度超参数不敏感。

At an intermediate step from time t(0,1] to s[0,t), we feed both p0 and rt into the mask predictor and predict all masked tokens simultaneously. Subsequently, we remask st of the predicted tokens in expectation to obtain rs, ensuring that the transition of the reverse process aligns with the forward process for accurate sampling. In principle, the remasking strategy should be purely random. However, inspired by the annealing tricks of sampling in LLMs, we adopt a low-confidence remasking strategy, where st of predicted tokens with the lowest confidence are remarked based on the predictions, same as the approach of Chang et al.

在从时间 t(0,1]s[0,t) 的中间步骤,我们将 p0rt 输入掩码预测器,同时预测所有掩码词元。随后,我们期望重新掩码预测词元中的 st 以获得 rs,确保反向过程的转移与正向过程对齐以实现准确采样。原则上,重新掩码策略应该是纯随机的。然而,受LLM采样中退火技巧的启发,我们采用了一种低置信度重新掩码策略,即根据预测结果,将置信度最低的 st 的预测词元重新掩码,与Chang等人的方法相同。

We mention that LLaDA enables flexible sampling. In particular, it supports autoregressive and block diffusion sampling directly after the pre-training or SFT processes described above, without requiring any further modifications or training. We provide a detailed analysis in Appendix B.4. Nevertheless, the diffusion sampling (i.e., the reverse generation process) yields the best performance and is adopted as the default throughout this paper, especially for all experiments presented in Sec. 3.

我们提到,LLaDA支持灵活的采样。特别是,它支持在上述预训练或SFT过程之后直接进行自回归和块扩散采样,无需任何进一步修改或训练。我们在附录B.4中提供了详细分析。尽管如此,扩散采样能产生最佳性能,并在本文中作为默认方法,特别是第3节中展示的所有实验。

For conditional likelihood evaluation, we can naturally utilize the upper bound in Eq. (5). However, we find that the following equivalent form exhibits lower variance and is more stable:

对于条件似然评估,我们可以自然地利用等式(5)中的上界。然而,我们发现以下等价形式方差更低且更稳定:

(6)El,r0,rl[Lli=1L1[rli=M]logpθ(r0ip0,rl)]

where L is the sequence length of r0, l is uniformly sampled from {1,2,,L}, and rl is obtained by uniformly sampling l tokens from r0 without replacement for masking.

其中 Lr0 的序列长度,l{1,2,,L} 中均匀采样,rl 是通过从 r0 中无放回地均匀采样 l 个词元进行掩码而得到的。

We present the training and inference algorithms, along with theoretical details, in Appendix A.

我们在附录A中给出了训练和推理算法以及理论细节。

Diffusion models have achieved remarkable success in visual domains but remain unverified for large-scale language modeling, despite growing interest and extensive research efforts.

扩散模型在视觉领域取得了显著成功,但在大规模语言建模方面仍未得到验证,尽管人们对其兴趣日益浓厚并投入了大量研究。

A simple approach is to continuousize text data and apply continuous diffusion models directly. Alternatively, some methods model continuous parameters of discrete distributions instead. However, scalability remains a significant challenge for these approaches. For instance, a 1B model may require 64 times the compute of an ARM to achieve comparable performance.

一种简单的方法是将文本数据连续化,并直接应用连续扩散模型。或者,一些方法转而建模离散分布的连续参数。然而,可扩展性对这些方法仍然是一个重大挑战。例如,一个1B模型可能需要64倍于自回归模型的计算量才能达到相当的性能。

Another approach replaces continuous diffusion with discrete processes featuring new forward and reverse dynamics, leading to numerous variants. The original diffusion model paper introduced both continuous-state and discrete-state transition kernels under a unified diffusion framework. Austin et al. was among the pioneering works that introduced discrete diffusion models into language modeling, demonstrating the feasibility of this approach. Lou et al. showed that masked diffusion, as a special case of discrete diffusion, achieves perplexity comparable to or surpassing ARMs at GPT-2 scale. Shi et al., Sahoo et al., Ou et al. established fundamental theoretical results, which motivated our model design, training, and inference. Nie et al. introduced the scaling laws for MDMs in language modeling and explored how MDMs can be leveraged for language tasks such as question answering at the GPT-2 scale. Gong et al. demonstrated the potential of fine-tuning an ARM within the MDM framework. However, the improvements observed by Gong et al. are limited to specific metrics, and their approach does not address the performance achievable through pure diffusion-based training. Concurrent work demonstrates the potential of diffusion language models in code generation and highlights their advantages in inference efficiency. Nonetheless, as it is a closed-source product, specific details such as training procedures and sampling methods remain unknown.

另一种方法是用具有新前向和反向动力学的离散过程取代连续扩散,产生了众多变体。最初的扩散模型论文在统一的扩散框架下引入了连续状态和离散状态转移核。Austin等人是将离散扩散模型引入语言建模的开创性工作之一,证明了这种方法的可行性。Lou等人表明,掩码扩散作为离散扩散的一个特例,在GPT-2规模上实现了与自回归模型相当或更优的困惑度。Shi等人、Sahoo等人、Ou等人建立了基础理论结果,这启发了我们的模型设计、训练和推理。Nie等人引入了语言建模中MDM的缩放定律,并探索了如何利用MDM完成GPT-2规模的语言任务,如问答。Gong等人展示了在MDM框架内微调自回归模型的潜力。然而,Gong等人观察到的改进仅限于特定指标,他们的方法并未触及纯扩散训练所能达到的性能。同期工作展示了扩散语言模型在代码生成方面的潜力,并强调了它们在推理效率上的优势。尽管如此,由于它是一个闭源产品,训练过程和采样方法等具体细节仍然未知。

In comparison, this study scales MDM to an unprecedented size of 8B parameters from scratch, achieving performance comparable to leading LLMs such as LLaMA 3.

相比之下,本研究从头开始将MDM扩展到前所未有的8B参数规模,实现了与LLaMA 3等领先LLM相当的性能。

Additionally, a parallel line of work on image generation aligns well with the application of MDMs to text data. Moreover, MDMs have also shown promise in other domains such as protein generation, where they have achieved promising results. Notably, a series of studies have explored techniques such as architectural optimization, distillation, and sampling algorithm design to accelerate MDMs sampling.

此外,关于图像生成的并行工作与MDM在文本数据上的应用非常吻合。而且,MDM在其他领域也显示出潜力,例如蛋白质生成,并已取得有希望的结果。值得注意的是,一系列研究探索了架构优化、蒸馏和采样算法设计等技术,以加速MDM采样。

Conclusion and Discussion

We introduce LLaDA, a diffusion language model trained from scratch with an unprecedented scale of 8B parameters. LLaDA demonstrates strong capabilities in scalability, in-context learning, and instruction-following, achieving performance comparable to strong LLMs such as LLaMA3. In addition, LLaDA offers unique advantages, such as bidirectional modeling and enhanced robustness, effectively addressing the relevant limitations of existing LLMs. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that these essential capabilities are inherently tied to ARMs. These results represent a new paradigm for language modeling and uncover novel insights, demonstrating a high degree of scientific innovation.

我们提出了LLaDA,一个从零开始训练、规模达到前所未有的8B参数的扩散语言模型。LLaDA在可扩展性、上下文学习和指令遵循方面展现出强大能力,实现了与LLaMA3等强大LLM相当的性能。此外,LLaDA提供了独特的优势,如双向建模和增强的鲁棒性,有效解决了现有LLM的相关局限性。我们的研究结果表明了扩散模型在大规模语言建模中的前景,并挑战了这些基本能力本质上与自回归模型相关联的普遍假设。这些结果代表了语言建模的新范式,并揭示了新颖的见解,展示了高度的科学创新性。

Limitations. While promising, the full potential of diffusion models remains to be fully explored. Several limitations of this work present significant opportunities for future research. The generation length is a user-specified hyperparameter. Although LLaDA is insensitive to this hyperparameter as detailed in Appendix B.5, we believe that adopting an adaptive generation length would offer a more efficient solution. Due to computational constraints, direct comparisons between LLaDA and ARMs—such as training on identical datasets—were restricted to a computational budget of less than 1023 FLOPs. To allocate resources for training the largest possible LLaDA model and showcasing its potential, we were unable to scale the ARM baseline to the same extent. Moreover, no specialized attention mechanisms or position embeddings were designed for LLaDA, nor were any system-level architectural optimizations such as KV cache applied. On the inference side, more efficient and controllable sampling algorithms remain preliminary. Furthermore, LLaDA has yet to undergo alignment with reinforcement learning, which is crucial for improving its performance and alignment with human intent.

局限性。 尽管前景广阔,扩散模型的全部潜力仍有待充分挖掘。本工作的几个局限性为未来研究提供了重要机会。生成长度是一个用户指定的超参数。尽管LLaDA对此超参数不敏感,但我们认为采用自适应生成长度将提供更高效的解决方案。由于计算限制,LLaDA与自回归模型之间的直接比较被限制在低于 1023 FLOPs的计算预算内。为了将资源分配给训练尽可能大的LLaDA模型并展示其潜力,我们无法将自回归模型基线扩展到相同程度。此外,没有为LLaDA设计专门的注意力机制或位置嵌入,也没有应用任何系统级架构优化。在推理方面,更高效和可控的采样算法仍处于初步阶段。此外,LLaDA尚未进行强化学习对齐,这对于提升其性能和与人类意图的一致性至关重要。

Looking ahead, both the model scale and the amount of training data for LLaDA remain smaller than those of leading ARM counterparts, highlighting the need for further scaling to fully evaluate its capabilities. In addition, LLaDA's ability to process multi-modal data remains unexplored. Its impact on prompt tuning techniques and integration into agent-based systems is still not fully understood. Finally, a systematic investigation into post-training for LLaDA (e.g., O1-like systems) is needed to further unlock the potential of diffusion language models.

展望未来,LLaDA的模型规模和训练数据量仍小于领先的自回归模型,需要进一步扩展以全面评估其能力。此外,LLaDA处理多模态数据的能力仍未探索。它对提示调优技术和集成到基于智能体的系统的影响仍不完全清楚。最后,需要对LLaDA的后训练进行系统性研究,以进一步释放扩散语言模型的潜力。