LoRA:大型语言模型的低秩自适应
Abstract
The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA.
自然语言处理的主流范式包括在通用领域数据上进行大规模预训练,然后针对特定任务或领域进行适配。随着我们预训练出越来越大的模型,传统的微调方法——即重新训练所有模型参数——变得愈发不可行。以GPT-3 175B为例,部署大量独立的微调模型实例(每个实例都有175B参数)成本极其高昂。我们提出了低秩自适应,即LoRA,它冻结预训练模型权重,并向Transformer架构的每一层注入可训练的秩分解矩阵,从而大幅减少下游任务的可训练参数数量。对于GPT-3而言,与完全微调相比,LoRA可以将可训练参数数量减少10,000倍,计算硬件需求减少3倍。尽管可训练参数更少、训练吞吐量更高且推理延迟无增加,LoRA在GPT-3和GPT-2上的模型质量仍与微调相当甚至更优。我们还对语言模型自适应中的秩不足现象进行了实证研究,这揭示了LoRA有效性的原因。
Introduction
Many applications in natural language processing rely on adapting one large scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that it requires storing as many parameters as in the original model. As larger models are trained every few months, this changes from a mere "inconvenience" for GPT-2 or BERT-large to a critical deployment challenge for GPT-3 with 175 billion trainable parameters.
自然语言处理中的许多应用依赖于将一个大规模的预训练语言模型适配到 多个 下游应用。这种适配通常通过 微调 完成,即更新预训练模型的所有参数。微调的主要缺点是需要存储与原模型同样多的参数。随着每几个月就有更大规模的模型被训练出来,对于 GPT-2 或 BERT-large 而言的“不便”,到了拥有1750亿可训练参数的 GPT-3 这里,就变成了一个关键的部署挑战。
Many researchers sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to load a small number of task-specific parameters to the pre-trained model for each task, which greatly boosts the deployment efficiency. However, in practice, existing techniques either introduce inference latency by extending model depth or reduce the model's usable sequence length. More importantly, these prior attempts sometimes fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.
许多研究者试图通过仅适配部分参数或为新任务学习外部模块来缓解这一问题。这样,我们只需为每个任务加载少量特定于任务的参数到预训练模型中,这极大地提升了部署效率。然而,在实践中,现有技术要么通过扩展模型深度引入推理延迟,要么减少了模型可用的序列长度。更重要的是,这些先前的方法有时无法达到微调的基线水平,从而在效率与模型质量之间产生了权衡。
We take inspiration from [1, 20] which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the update matrices in language model adaptation also have a low "intrinsic rank", leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train every dense layer in a neural network indirectly by injecting and optimizing rank decomposition matrices of the dense layer's update instead, while keeping the original matrices frozen, as shown in Fig. 1. Using GPT-3 175B as an example, we show that a very low rank (i.e.,
我们从文献[1, 20]中获得灵感,它们表明学习到的过度参数化模型实际上存在于一个低的内在维度上。我们假设语言模型自适应中的更新矩阵也具有较低的“内在秩”,进而提出了低秩自适应(LoRA)方法。LoRA允许我们间接地训练神经网络中的每个稠密层,通过注入并优化稠密层更新的秩分解矩阵来实现,同时保持原始矩阵冻结,如图1所示。以GPT-3 175B为例,我们证明即使当全秩(即
LoRA possesses several key advantages as follows.
- A single pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can keep the shared original model in VRAM and efficiently switch tasks by replacing the matrices
and in Fig. 1, which significantly reduces the storage requirement and task-switching overhead. - It makes training more efficient and lowers the hardware barrier to entry by 3 times, since we do not need to calculate the gradients or maintain the optimizer states for most model parameters when using adaptive optimizers. Instead, we only optimize the injected low-rank matrices, which have much fewer parameters.
- Its simple linear design allows us to merge the update matrices with the original weights during deployment, introducing no inference latency.
- LoRA is orthogonal to prior techniques and can be combined with many of them (such as prefix-tuning). We provide an example in Appendix D.
LoRA具有以下几个关键优势:
- 单个预训练模型可以被共享,并用于为不同任务构建许多小的LoRA模块。我们可以将共享的原始模型保留在VRAM中,并通过替换图1中的矩阵
和 来高效地切换任务,这显著降低了存储需求和任务切换开销。 - 它使训练更高效,并将硬件入门门槛降低了3倍,因为在使用自适应优化器时,我们无需计算大部分模型参数的梯度或维护其优化器状态。相反,我们只优化注入的低秩矩阵,这些矩阵的参数少得多。
- 其简单的线性设计允许我们在部署时将更新矩阵与原始权重合并,不引入任何推理延迟。
- LoRA与先前技术正交,并可与其中许多技术(如前缀微调)结合使用。我们在附录D中提供了一个示例。
Terminologies We make frequent references to the Transformer architecture and use the conventional terminologies for its dimensions. We call its hidden size, or the size of its activations,
术语 我们频繁引用Transformer架构,并使用其维度的常规术语。我们将隐藏层大小,即激活值的大小,称为
Problem Statement
While our proposal is agnostic to the training objective, we focus on language modeling as the primary use case. Below is a brief description of the language modeling problem and, in particular, the maximization of conditional probabilities given a task-specific prompt.
尽管我们的提议与训练目标无关,但我们专注于语言建模作为主要用例。以下是语言建模问题,特别是给定特定任务提示的条件概率最大化问题的简要描述。
Suppose we are given a pre-trained autoregressive language model
假设我们有一个预训练的自回归语言模型
During fine-tuning, the model is initialized with pre-trained parameters
在微调过程中,模型初始化为预训练参数
One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters
完全微调的主要缺点之一是,对于每个下游任务,我们学习一组不同的参数
In this paper, we adopt a parameter-efficient approach, where the task-specific parameter increment
在本文中,我们采用了一种参数高效的方法,其中特定于任务的参数增量
As we shall see in the subsequent sections, we propose to use a low-rank representation to encode
正如我们在后续章节中将会看到的,我们提议使用一种低秩表示来编码
Our Method
We describe the simple design of LoRA and its practical implications. The principles outlined here apply to any dense layers in deep learning models, though we only focus on certain weights in Transformers in our experiments for practical reasons.
我们描述了LoRA的简单设计及其实际意义。这里概述的原则适用于深度学习模型中的任何稠密层,尽管出于实际原因,我们在实验中只关注Transformer中的某些权重。
Low-Rank Constraint on Update Matrices. A typical neural network contains numerous dense layers that perform matrix multiplication. The weight matrices in these layers are allowed to have full-rank. When adapting to a specific task, however, Aghajanyan et al. shows that the pre-trained language models have a low "intrinsic dimension" and can still learn efficiently despite a low-dimensional reparametrization. Inspired by this observation, we wonder if the updates to the weights also have a low "intrinsic rank" when adapting to downstream tasks. For a pre-trained weight matrix
对更新矩阵的低秩约束。 一个典型的神经网络包含大量执行矩阵乘法的稠密层。这些层中的权重矩阵允许具有满秩。然而,Aghajanyan等人表明,在适配特定任务时,预训练语言模型具有较低的“内在维度”,并且尽管采用了低维重参数化,仍然可以高效学习。受此观察启发,我们思考在适配下游任务时,权重的更新是否也具有较低的“内在秩”。对于一个预训练权重矩阵
We illustrate our reparametrization in Fig. 1. We use a random Gaussian initialization for
我们在图1中展示了我们的重参数化。我们对
Weight Decay to Pre-trained Weights. We note that weight decay behaves differently with LoRA than with full fine-tuning. Specifically, performing the usual weight decay on
向预训练权重的权重衰减。 我们注意到,权重衰减在LoRA中的表现与完全微调中不同。具体来说,对
No Additional Inference Latency. During deployment, we can explicitly compute
无额外推理延迟。 在部署期间,我们可以显式地计算
3.1 Applying LoRA to Transformer
In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module (
将LoRA应用于Transformer
原则上,我们可以将LoRA应用于神经网络中任意子集的权重矩阵,以减少可训练参数的数量。在Transformer架构中,自注意力模块有四个权重矩阵(
Practical Benefits and Limitations. The most significant benefit comes from the reduction in memory and storage usage. For a Transformer, we reduce the VRAM consumption by
实际效益与局限性 最显著的效益来自于内存和存储使用的减少。对于Transformer,如果
On the other hand, LoRA has its limitations. For example, it is not straightforward to batch inputs to different tasks with different
另一方面,LoRA也有其局限性。例如,将针对不同任务的、具有不同
Related Works
Transformer Language Models. Transformer is a sequence-to-sequence architecture that makes heavy use of self-attention. [31] applied it to autoregressive language modeling by using a stack of Transformer decoders. Since then, Transformer-based language models have dominated NLP, achieving the state-of-the-art in many tasks. A new paradigm emerged with BERT and GPT-2 – both are large Transformer language models trained on a large amount of text – where fine-tuning on task-specific data after pre-training on general domain data provides a significant performance gain compared to training on task-specific data directly. Training larger Transformers generally results in better performance and remains an active research direction. GPT-3 is the largest single Transformer language model trained to-date with 175B parameters.
Transformer语言模型。 Transformer 是一种大量使用自注意力的序列到序列架构。[31] 通过使用堆叠的Transformer解码器将其应用于自回归语言建模。自此,基于Transformer的语言模型主导了NLP领域,在许多任务上取得了最先进的成果。BERT 和 GPT-2 的出现开启了一种新范式——它们都是在大量文本上训练的大型Transformer语言模型——在通用领域数据上预训练后,再在特定任务数据上进行微调,相比于直接在任务特定数据上训练,能带来显著的性能提升。训练更大的Transformer通常会带来更好的性能,这仍然是一个活跃的研究方向。GPT-3 是迄今为止训练过的最大的单一Transformer语言模型,拥有1750亿参数。
Prompt Engineering and Fine-Tuning. While GPT-3 175B can adapt its behavior with just a few additional training examples, the result depends heavily on the input prompt. This necessitates an empirical art of composing and formatting the prompt to maximize a model's performance on a desired task, which is known as prompt engineering or prompt hacking. Fine-tuning retrains a model pre-trained on general domains to a specific task. Variants of it include learning just a subset of the parameters, yet practitioners often retrain all of them to maximize the downstream performance. However, the enormity of GPT-3 175B makes it challenging to perform fine-tuning in the usual way due to the large checkpoint it produces and the high hardware barrier to entry since it has the same memory footprint as pre-training.
提示工程与微调。 尽管GPT-3 175B仅需少量额外的训练样本就能调整其行为,但结果在很大程度上取决于输入提示。这导致需要一门经验性的艺术来组织和格式化提示,以最大化模型在所需任务上的性能,这被称为提示工程或提示破解。微调将预训练于通用领域的模型重新训练以适应特定任务。其变体包括仅学习参数的一个子集,但实践者通常重新训练所有参数以最大化下游性能。然而,GPT-3 175B的巨大规模使得以通常方式进行微调变得极具挑战性,因为它会产生巨大的检查点,并且由于内存占用与预训练相同,硬件入门门槛很高。
Parameter-Efficient Adaptation. [15, 34] propose inserting adapter layers between existing layers in a neural network. Our method uses a bottleneck structure similar to [15] to impose a low-rank constraint on the weight updates. The key functional difference is that our learned weights can be merged with the main weights during inference, thus not introducing any latency, which is not the case for the adapter layers. More recently, [14, 19, 21, 25] proposed optimizing the input word embeddings in lieu of fine-tuning, akin to a continuous and differentiable generalization of prompt engineering. We include comparisons with [21] in our experiment section. However, this line of works can only scale up by using more special tokens in the prompt, which take up available sequence length for task tokens when positional embeddings are learned.
参数高效的自适应。 [15, 34] 提出在神经网络的现有层之间插入适配器层。我们的方法使用了类似于[15]的瓶颈结构,对权重更新施加低秩约束。关键的功能性区别在于,我们的学习权重可以在推理期间与主权重合并,因此不会引入任何延迟,而适配器层则不然。最近,[14, 19, 21, 25] 提出优化输入词嵌入以替代微调,类似于提示工程的一种连续且可微的泛化。我们在实验部分与[21]进行了比较。然而,这类工作只能通过在提示中使用更多的特殊标记来扩展规模,当学习位置嵌入时,这些特殊标记会占用任务标记可用的序列长度。
Low-Rank Structures in Deep Learning. Low-rank structure is very common in machine learning. A lot of machine learning problems have certain intrinsic low-rank structure. Moreover, it is known that for many deep learning tasks, especially those with a heavily over-parametrized neural network, the learned neural network will enjoy low-rank properties after training. Some prior works even explicitly impose the low-rank constraint when training the original neural network, however, to the best of our knowledge, none of these works considers low-rank update for adaptation to downstream tasks. In theory literature, it is known that neural networks outperform other classical learning methods, including the corresponding (finite-width) neural tangent kernels when the underlying concept class has certain low-rank structure. Another theoretical result in [4] suggests that low-rank adaptations can be useful for adversarial training. In sum, we believe that our proposed low-rank adaptation update is well-motivated by the literature.
深度学习中的低秩结构。 低秩结构在机器学习中非常普遍。许多机器学习问题都具有某种内在的低秩结构。此外,众所周知,对于许多深度学习任务,特别是那些具有高度过度参数化神经网络的任务,训练后的学习网络将具有低秩特性。一些先前的工作甚至在训练原始神经网络时显式地施加了低秩约束,然而,据我们所知,这些工作都没有考虑用于下游任务自适应的低秩更新。在理论文献中,已知当底层概念类具有某种低秩结构时,神经网络优于其他经典学习方法,包括相应的(有限宽度)神经正切核。另一个理论结果[4]表明,低秩自适应可能对对抗训练有用。总之,我们相信,我们提出的低秩自适应更新在文献中有充分的动机。
Conclusion and Future Work
Fine-tuning enormous language models is prohibitively expensive in terms of both the hardware requirement and the storage/switching cost for hosting multiple instances. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer, the proposed principles are generally applicable to any neural networks with dense layers.
对巨大的语言模型进行微调,无论是在硬件需求方面,还是在托管多个实例的存储/切换成本方面,都昂贵得令人望而却步。我们提出了LoRA,一种高效的适配策略,它既不引入推理延迟,也不减少输入序列长度,同时保持模型质量。重要的是,通过共享绝大部分模型参数,它在作为服务部署时允许快速的任务切换。虽然我们专注于Transformer,但所提出的原理普遍适用于任何具有稠密层的神经网络。
LoRA can potentially work in tandem with other fine-tuning techniques. In the future, we hope to explore only tuning some layers or adding adversarial training. Finally, the rank-deficiency of
LoRA有可能与其他微调技术协同工作。未来,我们希望探索仅调整某些层或添加对抗性训练。最后,