大型语言扩散模型
Abstract
The capabilities of large language models (LLMs) are widely regarded as relying on autoregressive models (ARMs). We challenge this notion by introducing LLaDA, a diffusion model trained from scratch under the pre-training and supervised fine-tuning (SFT) paradigm. LLaDA employs a forward data masking process and a reverse generation process, parameterized by a Transformer to predict masked tokens. It provides a principled generative approach for probabilistic inference by optimizing a likelihood lower bound. Across extensive benchmarks on general tasks, math, code, and so on, LLaDA demonstrates strong scalability and performs comparably to our self-constructed ARM baselines. Remarkably, LLaDA 8B is competitive with strong LLMs like LLaMA3 8B in in-context learning and, after SFT, exhibits impressive instruction-following abilities in case studies such as multi-turn dialogue. Moreover, LLaDA addresses the reversal curse, surpassing GPT-4o in a reversal poem completion task. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that core LLM capabilities discussed above inherently depend on ARMs.
大型语言模型的能力被广泛认为依赖于自回归模型。我们通过引入 LLaDA 来挑战这一观念,这是一个在预训练和监督微调范式下从零开始训练的扩散模型。LLaDA 采用前向数据掩码过程和反向生成过程,由 Transformer 参数化以预测掩码词元。它通过优化似然下界为概率推理提供了一种有原则的生成方法。在通用任务、数学、代码等广泛基准上,LLaDA 展现出强大的可扩展性,并且与我们自行构建的自回归基线表现相当。值得注意的是,LLaDA 8B 在上下文学习方面与 LLaMA3 8B 等强大的 LLM 具有竞争力,并且在监督微调后,在多轮对话等案例研究中展现出令人印象深刻的指令遵循能力。此外,LLaDA 解决了逆向诅咒问题,在逆向诗歌补全任务上超越了 GPT-4o。我们的研究结果表明了扩散模型在大规模语言建模中的前景,并挑战了上述核心 LLM 能力本质上依赖于自回归模型的普遍假设。
Introduction
Large language models (LLMs) fall entirely within the framework of generative modeling. Specifically, LLMs aim to capture the true but unknown language distribution
大型语言模型完全属于生成式建模的范畴。具体来说,大型语言模型旨在通过最大化似然估计优化模型分布
The predominant approach relies on the autoregressive modeling (ARM)—commonly referred to as the "next-token prediction" paradigm—to define the model distribution:
主流方法依赖于自回归建模来定义模型分布,即通常所说的“下一个词元预测”范式:
where
其中
We argue that the answer is not a simple "yes". The key insight overlooked previously is: It is the generative modeling principles (i.e., Eq. (1)), rather than the autoregressive formulation (i.e., Eq. (2)) itself, that fundamentally underpin the essential properties of LLMs.
我们认为答案并非简单的“是”。先前被忽视的关键见解是:从根本上支撑大型语言模型基本属性的是生成式建模原则,而非自回归公式本身。
In particular, we argue that scalability is primarily a consequence of the interplay between Transformers, model size, data size, and Fisher consistency induced by the generative principles in Eq. (1), rather than a unique result of the ARMs in Eq. (2). The success of diffusion transformers on visual data supports this claim. Furthermore, the instruction-following and in-context learning capabilities appear to be intrinsic properties of all conditional generative models on structurally consistent linguistic tasks, rather than exclusive advantages of ARMs. In addition, while ARMs can be interpreted as a lossless data compressor, any sufficiently expressive probabilistic model can achieve similar capabilities.
具体而言,我们认为可扩展性主要是Transformer、模型规模、数据规模以及由等式(1)中生成原则引发的费希尔一致性之间相互作用的结果,而非自回归模型的独特产物。扩散Transformer在视觉数据上的成功支持了这一观点。此外,指令遵循和上下文学习能力似乎是所有在结构一致的语言任务上的条件生成模型的内在属性,而非自回归模型的专属优势。另外,虽然自回归模型可以被解释为无损数据压缩器,但任何具有足够表达能力的概率模型都能实现类似的能力。
However, certain inherent limitations of LLMs can be directly attributed to their autoregressive nature. For instance, the left-to-right generation process restricts their ability to handle reversal reasoning tasks, highlighting a representative failure in the generalization capabilities of current models.
然而,大型语言模型的某些固有局限性可直接归因于其自回归性质。例如,从左到右的生成过程限制了它们处理逆向推理任务的能力,这凸显了当前模型在泛化能力方面的一个代表性失败。
Motivated by these insights, we introduce LLaDA (Large Language Diffusion with mAsking) to investigate whether the capabilities exhibited by LLMs can emerge from generative modeling principles beyond ARMs, thereby addressing the fundamental question posed earlier. In contrast to traditional ARMs, LLaDA leverages a masked diffusion model (MDM), which incorporates a forward data masking process and trains a mask predictor to approximate its reverse process. This design enables LLaDA to construct a model distribution with bidirectional dependencies and optimize a variational lower bound of its log-likelihood, offering a principled and previously unexplored perspective on the core capabilities of LLMs discussed above.
受这些见解的启发,我们引入了LLaDA,以探究大型语言模型所展现的能力是否可以在自回归模型之外的生成式建模原则中涌现,从而解决先前提出的根本问题。与传统自回归模型不同,LLaDA利用了一种掩码扩散模型,该模型包含一个前向数据掩码过程,并训练一个掩码预测器来近似其反向过程。这种设计使LLaDA能够构建具有双向依赖关系的模型分布,并优化其对数似然的变分下界,为上述大型语言模型的核心能力提供了一个有原则且先前未被探索的视角。
We adopt the standard pipeline of data preparation, pre-training, supervised fine-tuning (SFT), and evaluation, scaling LLaDA to an unprecedented language diffusion of size 8B. In particular, LLaDA 8B was pre-trained from scratch on 2.3 trillion tokens using 0.13 million H800 GPU hours, followed by SFT on 4.5 million pairs. Across diverse tasks, including language understanding, math, code, and Chinese, LLaDA demonstrates the following contributions:
我们采用了数据准备、预训练、监督微调和评估的标准流程,将LLaDA扩展到了前所未有的8B规模的语言扩散模型。具体来说,LLaDA 8B从零开始,在2.3万亿词元上使用13万H800 GPU小时进行了预训练,随后在450万对数据上进行了监督微调。在包括语言理解、数学、代码和中文在内的多种任务上,LLaDA展示了以下贡献:
- LLaDA scales effectively to a compute budget of
FLOPs, achieving comparable results to ARM baselines trained on the same data across six tasks, e.g., MMLU and GSM8K. - The pre-trained LLaDA 8B Base surpasses LLaMA2 7B Base on nearly all 15 standard zero/few-shot learning tasks while performing on par with LLaMA3 8B Base, showcasing effective in-context learning capability.
- LLaDA significantly enhances the ability to follow instructions after SFT, as demonstrated in case studies such as multi-turn dialogue.
- LLaDA effectively breaks the reversal curse with consistent performance across forward and reversal tasks. Notably, it outperforms GPT-4o in a reversal poem completion task.
- LLaDA在高达
FLOPs 的计算预算下有效扩展,在六个任务上取得了与在同一数据上训练的自回归模型基线相当的结果,例如MMLU和GSM8K。 - 预训练的LLaDA 8B Base在几乎所有15个标准零样本/少样本学习任务上超越了LLaMA2 7B Base,同时与LLaMA3 8B Base表现相当,展示了有效的上下文学习能力。
- 在监督微调后,LLaDA显著增强了遵循指令的能力,这在多轮对话等案例研究中得到了证明。
- LLaDA有效地打破了逆向诅咒,在正向和逆向任务上表现一致。值得注意的是,它在逆向诗歌补全任务上超越了GPT-4o。
Approach
In this section, we introduce the probabilistic formulation, along with the pre-training, supervised fine-tuning, and inference procedures for LLaDA, as illustrated in Fig. 2.
在本节中,我们介绍LLaDA的概率公式,以及预训练、监督微调和推理流程,如图2所示。
2.1 Probabilistic Formulation
Unlike ARMs in Eq. (2), LLaDA defines a model distribution
2.1 概率公式
与公式(2)中的自回归模型不同,LLaDA通过一个前向过程和一个反向过程来定义模型分布
The core of LLaDA is a mask predictor, a parametric model
LLaDA的核心是一个掩码预测器,它是一个参数化模型
where
其中
Once trained, we can simulate a reverse process (see Sec. 2.4 for details) parameterized by the mask predictor and define the model distribution
训练完成后,我们可以模拟一个由掩码预测器参数化的反向过程,并将模型分布
Notably, LLaDA employs a masking ratio that varies randomly between 0 and 1 while BERT uses a fixed ratio. The subtle differences have significant implications, especially at scale: as shown in Eq. (4), LLaDA is a principled generative model with the potential to perform in-context learning and instruction-following naturally, akin to LLMs. Moreover, its generative perspective implies strong scalability with large data and models as discussed in Sec. 1. In addition, MaskGIT adopts a heuristic training objective, which misses the
值得注意的是,LLaDA采用的掩码比率在0和1之间随机变化,而BERT使用固定比率。这种微妙的差异具有重要影响,尤其是在大规模情况下:如公式(4)所示,LLaDA是一个有原则的生成模型,具有自然执行上下文学习和指令遵循的潜力,类似于大型语言模型。此外,其生成视角意味着在大数据和模型下具有很强的可扩展性,如第1节所述。另外,MaskGIT采用了启发式训练目标,与公式(3)相比缺少了
2.2 Pre-training
LLaDA employs a Transformer as the mask predictor, similar to existing LLMs. However, LLaDA does not use a causal mask, as its formulation allows it to see the entire input for predictions.
2.2 预训练
LLaDA采用Transformer作为掩码预测器,类似于现有的大型语言模型。然而,LLaDA不使用因果掩码,因为其公式允许它在预测时看到整个输入。
We trained two variants of LLaDA with different sizes: 1B and 8B. We summarize the model architecture of LLaDA 8B and LLaMA3 8B here, and details are provided in Appendix B.2. We have ensured consistency in most hyperparameters while making several necessary modifications. We use vanilla multi-head attention instead of grouped query attention for simplicity, as LLaDA is incompatible with KV caching, resulting in a different number of key and value heads. Consequently, the attention layer has more parameters, and we reduce the FFN dimension to maintain a comparable model size. Additionally, the vocabulary size differs due to a tokenizer adapted on our data.
我们训练了两个不同大小的LLaDA变体:1B和8B。我们在此总结了LLaDA 8B和LLaMA3 8B的模型架构,详细内容见附录B.2。我们确保了大多数超参数的一致性,同时进行了一些必要的修改。为简单起见,我们使用普通的多头注意力而非分组查询注意力,因为LLaDA与KV缓存不兼容,导致键和值头的数量不同。因此,注意力层有更多参数,我们减少了前馈网络维度以保持相当的模型大小。此外,由于在我们的数据上适配了分词器,词汇表大小有所不同。
The LLaDA model is pre-trained on a dataset comprising 2.3 trillion (T) tokens, adhering to a data protocol that aligns closely with existing LLMs, without the incorporation of any special techniques. The data are derived from online corpora, with low-quality content filtered through manually designed rules and LLM-based approaches. Beyond general text, the dataset encompasses high-quality code, math, and multilingual data. Please refer to Appendix B.1 for more details about datasets. The mixing of data sources and domains is guided by scaled-down ARMs. The pre-training process utilizes a fixed sequence length of 4096 tokens, incurring a total computational cost of 0.13 million H800 GPU hours, similar to ARMs of the same scale and dataset size.
LLaDA模型在包含2.3万亿词元的数据集上进行了预训练,遵循与现有大型语言模型紧密一致的数据协议,没有采用任何特殊技术。数据来源于在线语料库,通过手工设计的规则和基于大型语言模型的方法过滤了低质量内容。除了通用文本,数据集还包括高质量的代码、数学和多语言数据。关于数据集的更多详情请参考附录B.1。数据源和领域的混合由缩小规模的自回归模型指导。预训练过程使用固定的4096词元序列长度,总计算成本为13万H800 GPU小时,与相同规模和数据集大小的自回归模型相似。
For a training sequence
对于一个训练序列
We adopted the Warmup-Stable-Decay learning rate scheduler to monitor the training progress without interrupting continuous training. Specifically, we linearly increased the learning rate from 0 to
我们采用了Warmup-Stable-Decay学习率调度器来监控训练进度,而不会中断连续训练。具体来说,我们在前2000次迭代中将学习率从0线性增加到
2.3 Supervised Fine-Tuning
We enhance the capability of LLaDA to follow instructions by supervised fine-tuning (SFT) with paired data
2.3 监督微调
我们通过监督微调增强LLaDA遵循指令的能力,使用成对数据
The implementation is similar to pre-training. As shown in Fig. 2 (b), we leave the prompt unchanged and mask the tokens in the response independently, as done for
实现方式与预训练类似。如图2(b)所示,我们保持提示不变,并独立地掩码响应中的词元,就像对
where
其中
Note that this approach is fully compatible with pre-training. Essentially, the concatenation of
注意,这种方法与预训练完全兼容。本质上,
The LLaDA 8B model undergoes SFT on a dataset comprising 4.5 million pairs. Consistent with the pre-training process, both data preparation and training follow the SFT protocols utilized in existing LLMs, without introducing any additional techniques to optimize LLaDA's performance. The dataset spans multiple domains, including code, mathematics, and instruction-following. We append |EOS| tokens to the end of short pairs in each mini-batch to ensure equal lengths across all data. We treat |EOS| as a normal token during training and remove it during sampling, enabling LLaDA to control the response length automatically. Please refer to Appendix B.1 for more details.
LLaDA 8B模型在包含450万对数据的数据集上进行SFT。与预训练过程一致,数据准备和训练均遵循现有LLM中使用的SFT协议,未引入任何额外技术来优化LLaDA的性能。数据集涵盖多个领域,包括代码、数学和指令遵循。我们在每个小批次中给短对末尾添加|EOS|词元,以确保所有数据长度相等。在训练过程中,我们将|EOS|视为普通词元,在采样时将其移除,使LLaDA能够自动控制响应长度。更多详情请参考附录B.1。
We train for 3 epochs on the SFT data using a similar schedule to the pre-training phase. The learning rate is linearly increased from 0 to
我们采用与预训练阶段类似的时间表,在SFT数据上训练3个周期。学习率在前50次迭代中从0线性增加到
2.4 Inference
As a generative model, LLaDA can sample new text and evaluate the likelihood of candidate text in a diffusion manner instead of the left-to-right autoregressive fashion.
2.4 推理
作为一种生成模型,LLaDA能够以扩散方式而非从左到右的自回归方式采样新文本并评估候选文本的似然。
We begin with the reverse generation process. As illustrated in Fig. 2 (c), given a prompt
我们从反向生成过程开始。如图2(c)所示,给定提示
At an intermediate step from time
在从时间
We mention that LLaDA enables flexible sampling. In particular, it supports autoregressive and block diffusion sampling directly after the pre-training or SFT processes described above, without requiring any further modifications or training. We provide a detailed analysis in Appendix B.4. Nevertheless, the diffusion sampling (i.e., the reverse generation process) yields the best performance and is adopted as the default throughout this paper, especially for all experiments presented in Sec. 3.
我们提到,LLaDA支持灵活的采样。特别是,它支持在上述预训练或SFT过程之后直接进行自回归和块扩散采样,无需任何进一步修改或训练。我们在附录B.4中提供了详细分析。尽管如此,扩散采样能产生最佳性能,并在本文中作为默认方法,特别是第3节中展示的所有实验。
For conditional likelihood evaluation, we can naturally utilize the upper bound in Eq. (5). However, we find that the following equivalent form exhibits lower variance and is more stable:
对于条件似然评估,我们可以自然地利用等式(5)中的上界。然而,我们发现以下等价形式方差更低且更稳定:
where
其中
We present the training and inference algorithms, along with theoretical details, in Appendix A.
我们在附录A中给出了训练和推理算法以及理论细节。
Related Work
Diffusion models have achieved remarkable success in visual domains but remain unverified for large-scale language modeling, despite growing interest and extensive research efforts.
扩散模型在视觉领域取得了显著成功,但在大规模语言建模方面仍未得到验证,尽管人们对其兴趣日益浓厚并投入了大量研究。
A simple approach is to continuousize text data and apply continuous diffusion models directly. Alternatively, some methods model continuous parameters of discrete distributions instead. However, scalability remains a significant challenge for these approaches. For instance, a 1B model may require 64 times the compute of an ARM to achieve comparable performance.
一种简单的方法是将文本数据连续化,并直接应用连续扩散模型。或者,一些方法转而建模离散分布的连续参数。然而,可扩展性对这些方法仍然是一个重大挑战。例如,一个1B模型可能需要64倍于自回归模型的计算量才能达到相当的性能。
Another approach replaces continuous diffusion with discrete processes featuring new forward and reverse dynamics, leading to numerous variants. The original diffusion model paper introduced both continuous-state and discrete-state transition kernels under a unified diffusion framework. Austin et al. was among the pioneering works that introduced discrete diffusion models into language modeling, demonstrating the feasibility of this approach. Lou et al. showed that masked diffusion, as a special case of discrete diffusion, achieves perplexity comparable to or surpassing ARMs at GPT-2 scale. Shi et al., Sahoo et al., Ou et al. established fundamental theoretical results, which motivated our model design, training, and inference. Nie et al. introduced the scaling laws for MDMs in language modeling and explored how MDMs can be leveraged for language tasks such as question answering at the GPT-2 scale. Gong et al. demonstrated the potential of fine-tuning an ARM within the MDM framework. However, the improvements observed by Gong et al. are limited to specific metrics, and their approach does not address the performance achievable through pure diffusion-based training. Concurrent work demonstrates the potential of diffusion language models in code generation and highlights their advantages in inference efficiency. Nonetheless, as it is a closed-source product, specific details such as training procedures and sampling methods remain unknown.
另一种方法是用具有新前向和反向动力学的离散过程取代连续扩散,产生了众多变体。最初的扩散模型论文在统一的扩散框架下引入了连续状态和离散状态转移核。Austin等人是将离散扩散模型引入语言建模的开创性工作之一,证明了这种方法的可行性。Lou等人表明,掩码扩散作为离散扩散的一个特例,在GPT-2规模上实现了与自回归模型相当或更优的困惑度。Shi等人、Sahoo等人、Ou等人建立了基础理论结果,这启发了我们的模型设计、训练和推理。Nie等人引入了语言建模中MDM的缩放定律,并探索了如何利用MDM完成GPT-2规模的语言任务,如问答。Gong等人展示了在MDM框架内微调自回归模型的潜力。然而,Gong等人观察到的改进仅限于特定指标,他们的方法并未触及纯扩散训练所能达到的性能。同期工作展示了扩散语言模型在代码生成方面的潜力,并强调了它们在推理效率上的优势。尽管如此,由于它是一个闭源产品,训练过程和采样方法等具体细节仍然未知。
In comparison, this study scales MDM to an unprecedented size of 8B parameters from scratch, achieving performance comparable to leading LLMs such as LLaMA 3.
相比之下,本研究从头开始将MDM扩展到前所未有的8B参数规模,实现了与LLaMA 3等领先LLM相当的性能。
Additionally, a parallel line of work on image generation aligns well with the application of MDMs to text data. Moreover, MDMs have also shown promise in other domains such as protein generation, where they have achieved promising results. Notably, a series of studies have explored techniques such as architectural optimization, distillation, and sampling algorithm design to accelerate MDMs sampling.
此外,关于图像生成的并行工作与MDM在文本数据上的应用非常吻合。而且,MDM在其他领域也显示出潜力,例如蛋白质生成,并已取得有希望的结果。值得注意的是,一系列研究探索了架构优化、蒸馏和采样算法设计等技术,以加速MDM采样。
Conclusion and Discussion
We introduce LLaDA, a diffusion language model trained from scratch with an unprecedented scale of 8B parameters. LLaDA demonstrates strong capabilities in scalability, in-context learning, and instruction-following, achieving performance comparable to strong LLMs such as LLaMA3. In addition, LLaDA offers unique advantages, such as bidirectional modeling and enhanced robustness, effectively addressing the relevant limitations of existing LLMs. Our findings show the promise of diffusion models for language modeling at scale and challenge the common assumption that these essential capabilities are inherently tied to ARMs. These results represent a new paradigm for language modeling and uncover novel insights, demonstrating a high degree of scientific innovation.
我们提出了LLaDA,一个从零开始训练、规模达到前所未有的8B参数的扩散语言模型。LLaDA在可扩展性、上下文学习和指令遵循方面展现出强大能力,实现了与LLaMA3等强大LLM相当的性能。此外,LLaDA提供了独特的优势,如双向建模和增强的鲁棒性,有效解决了现有LLM的相关局限性。我们的研究结果表明了扩散模型在大规模语言建模中的前景,并挑战了这些基本能力本质上与自回归模型相关联的普遍假设。这些结果代表了语言建模的新范式,并揭示了新颖的见解,展示了高度的科学创新性。
Limitations. While promising, the full potential of diffusion models remains to be fully explored. Several limitations of this work present significant opportunities for future research. The generation length is a user-specified hyperparameter. Although LLaDA is insensitive to this hyperparameter as detailed in Appendix B.5, we believe that adopting an adaptive generation length would offer a more efficient solution. Due to computational constraints, direct comparisons between LLaDA and ARMs—such as training on identical datasets—were restricted to a computational budget of less than
局限性。 尽管前景广阔,扩散模型的全部潜力仍有待充分挖掘。本工作的几个局限性为未来研究提供了重要机会。生成长度是一个用户指定的超参数。尽管LLaDA对此超参数不敏感,但我们认为采用自适应生成长度将提供更高效的解决方案。由于计算限制,LLaDA与自回归模型之间的直接比较被限制在低于
Looking ahead, both the model scale and the amount of training data for LLaDA remain smaller than those of leading ARM counterparts, highlighting the need for further scaling to fully evaluate its capabilities. In addition, LLaDA's ability to process multi-modal data remains unexplored. Its impact on prompt tuning techniques and integration into agent-based systems is still not fully understood. Finally, a systematic investigation into post-training for LLaDA (e.g., O1-like systems) is needed to further unlock the potential of diffusion language models.
展望未来,LLaDA的模型规模和训练数据量仍小于领先的自回归模型,需要进一步扩展以全面评估其能力。此外,LLaDA处理多模态数据的能力仍未探索。它对提示调优技术和集成到基于智能体的系统的影响仍不完全清楚。最后,需要对LLaDA的后训练进行系统性研究,以进一步释放扩散语言模型的潜力。