LoRA: Low-Rank Adaptation of Large Language Models

Hu E J, Shen Y, Wallis P, et al. Lora: Low-rank adaptation of large language models[J]. Iclr, 2022, 1(2): 3.
https://github.com/microsoft/LoRA

33920+

LoRA：大型语言模型的低秩自适应

Abstract

The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA.

自然语言处理的主流范式包括在通用领域数据上进行大规模预训练，然后针对特定任务或领域进行适配。随着我们预训练出越来越大的模型，传统的微调方法——即重新训练所有模型参数——变得愈发不可行。以GPT-3 175B为例，部署大量独立的微调模型实例（每个实例都有175B参数）成本极其高昂。我们提出了低秩自适应，即LoRA，它冻结预训练模型权重，并向Transformer架构的每一层注入可训练的秩分解矩阵，从而大幅减少下游任务的可训练参数数量。对于GPT-3而言，与完全微调相比，LoRA可以将可训练参数数量减少10,000倍，计算硬件需求减少3倍。尽管可训练参数更少、训练吞吐量更高且推理延迟无增加，LoRA在GPT-3和GPT-2上的模型质量仍与微调相当甚至更优。我们还对语言模型自适应中的秩不足现象进行了实证研究，这揭示了LoRA有效性的原因。

Introduction

Many applications in natural language processing rely on adapting one large scale, pre-trained language model to multiple downstream applications. Such adaptation is usually done via fine-tuning, which updates all the parameters of the pre-trained model. The major downside of fine-tuning is that it requires storing as many parameters as in the original model. As larger models are trained every few months, this changes from a mere "inconvenience" for GPT-2 or BERT-large to a critical deployment challenge for GPT-3 with 175 billion trainable parameters.

自然语言处理中的许多应用依赖于将一个大规模的预训练语言模型适配到多个下游应用。这种适配通常通过微调完成，即更新预训练模型的所有参数。微调的主要缺点是需要存储与原模型同样多的参数。随着每几个月就有更大规模的模型被训练出来，对于 GPT-2 或 BERT-large 而言的“不便”，到了拥有1750亿可训练参数的 GPT-3 这里，就变成了一个关键的部署挑战。

Many researchers sought to mitigate this by adapting only some parameters or learning external modules for new tasks. This way, we only need to load a small number of task-specific parameters to the pre-trained model for each task, which greatly boosts the deployment efficiency. However, in practice, existing techniques either introduce inference latency by extending model depth or reduce the model's usable sequence length. More importantly, these prior attempts sometimes fail to match the fine-tuning baselines, posing a trade-off between efficiency and model quality.

许多研究者试图通过仅适配部分参数或为新任务学习外部模块来缓解这一问题。这样，我们只需为每个任务加载少量特定于任务的参数到预训练模型中，这极大地提升了部署效率。然而，在实践中，现有技术要么通过扩展模型深度引入推理延迟，要么减少了模型可用的序列长度。更重要的是，这些先前的方法有时无法达到微调的基线水平，从而在效率与模型质量之间产生了权衡。

We take inspiration from [1, 20] which show that the learned over-parametrized models in fact reside on a low intrinsic dimension. We hypothesize that the update matrices in language model adaptation also have a low "intrinsic rank", leading to our proposed Low-Rank Adaptation (LoRA) approach. LoRA allows us to train every dense layer in a neural network indirectly by injecting and optimizing rank decomposition matrices of the dense layer's update instead, while keeping the original matrices frozen, as shown in Fig. 1. Using GPT-3 175B as an example, we show that a very low rank (i.e., $r$ in Fig. 1 can be one or two) suffices even when the full rank (i.e., $d$ ) is as high as 12288, making LoRA both space- and compute-efficient.

我们从文献[1, 20]中获得灵感，它们表明学习到的过度参数化模型实际上存在于一个低的内在维度上。我们假设语言模型自适应中的更新矩阵也具有较低的“内在秩”，进而提出了低秩自适应（LoRA）方法。LoRA允许我们间接地训练神经网络中的每个稠密层，通过注入并优化稠密层更新的秩分解矩阵来实现，同时保持原始矩阵冻结，如图1所示。以GPT-3 175B为例，我们证明即使当全秩（即 $d$ ）高达12288时，一个非常低的秩（即图1中的 $r$ 可以取1或2）就足够了，这使得LoRA既节省空间又计算高效。

LoRA possesses several key advantages as follows.

A single pre-trained model can be shared and used to build many small LoRA modules for different tasks. We can keep the shared original model in VRAM and efficiently switch tasks by replacing the matrices $A$ and $B$ in Fig. 1, which significantly reduces the storage requirement and task-switching overhead.
It makes training more efficient and lowers the hardware barrier to entry by 3 times, since we do not need to calculate the gradients or maintain the optimizer states for most model parameters when using adaptive optimizers. Instead, we only optimize the injected low-rank matrices, which have much fewer parameters.
Its simple linear design allows us to merge the update matrices with the original weights during deployment, introducing no inference latency.
LoRA is orthogonal to prior techniques and can be combined with many of them (such as prefix-tuning). We provide an example in Appendix D.

LoRA具有以下几个关键优势：

单个预训练模型可以被共享，并用于为不同任务构建许多小的LoRA模块。我们可以将共享的原始模型保留在VRAM中，并通过替换图1中的矩阵 $A$ 和 $B$ 来高效地切换任务，这显著降低了存储需求和任务切换开销。
它使训练更高效，并将硬件入门门槛降低了3倍，因为在使用自适应优化器时，我们无需计算大部分模型参数的梯度或维护其优化器状态。相反，我们只优化注入的低秩矩阵，这些矩阵的参数少得多。
其简单的线性设计允许我们在部署时将更新矩阵与原始权重合并，不引入任何推理延迟。
LoRA与先前技术正交，并可与其中许多技术（如前缀微调）结合使用。我们在附录D中提供了一个示例。

Terminologies We make frequent references to the Transformer architecture and use the conventional terminologies for its dimensions. We call its hidden size, or the size of its activations, $d_{m o d e l}$ . We use $W_{q}$ , $W_{k}$ , $W_{v}$ , and $W_{o}$ to refer to the query/key/value/output projection matrices in the self-attention module. $W$ or $W_{0}$ refers to a pre-trained weight matrix and $Δ W$ its update during adaptation. We use $r$ to denote the rank of a LoRA module.

术语我们频繁引用Transformer架构，并使用其维度的常规术语。我们将隐藏层大小，即激活值的大小，称为 $d_{m o d e l}$ 。我们用 $W_{q}$ 、 $W_{k}$ 、 $W_{v}$ 和 $W_{o}$ 来指代自注意力模块中的查询/键/值/输出投影矩阵。 $W$ 或 $W_{0}$ 指预训练权重矩阵， $Δ W$ 指其在适配过程中的更新。我们用 $r$ 表示LoRA模块的秩。

Problem Statement

While our proposal is agnostic to the training objective, we focus on language modeling as the primary use case. Below is a brief description of the language modeling problem and, in particular, the maximization of conditional probabilities given a task-specific prompt.

尽管我们的提议与训练目标无关，但我们专注于语言建模作为主要用例。以下是语言建模问题，特别是给定特定任务提示的条件概率最大化问题的简要描述。

Suppose we are given a pre-trained autoregressive language model $p_{Φ} (y | x)$ that is parametrized by $Φ$ . For instance, $p_{Φ} (y | x)$ can be a generic multi-task learner such as GPT-2 or GPT-3 based on the Transformer architecture. Now, consider adapting this pre-trained model to (possibly multiple) downstream conditional text generation tasks, such as summarization, machine reading comprehension (MRC), and natural language to SQL (NL2SQL). Each downstream task is represented by a training dataset of context-target pairs: $Z = {(x_{i}, y_{i})}_{i = 1, \dots, N}$ , where both $x_{i}$ and $y_{i}$ are sequences of tokens. For example, in NL2SQL, $x_{i}$ is a natural language query and $y_{i}$ its equivalent SQL command; for summarization, $x_{i}$ is the content of an article and $y_{i}$ its short summary.

假设我们有一个预训练的自回归语言模型 $p_{Φ} (y | x)$ ，由参数 $Φ$ 参数化。例如， $p_{Φ} (y | x)$ 可以是一个通用的多任务学习器，如基于Transformer架构的 GPT-2 或 GPT-3 。现在，考虑将这个预训练模型适配到（可能的多个）下游条件文本生成任务，如摘要、机器阅读理解（MRC）和自然语言到SQL（NL2SQL）。每个下游任务由一个上下文-目标对的训练数据集表示： $Z = {(x_{i}, y_{i})}_{i = 1, \dots, N}$ ，其中 $x_{i}$ 和 $y_{i}$ 都是token序列。例如，在NL2SQL中， $x_{i}$ 是一个自然语言查询， $y_{i}$ 是其等价的SQL命令；对于摘要， $x_{i}$ 是一篇文章的内容， $y_{i}$ 是其简短摘要。

During fine-tuning, the model is initialized with pre-trained parameters $Φ_{0}$ and updated to $Φ_{0} + Δ Φ$ by repeatedly following the gradient to maximize the conditional language modeling objective:

在微调过程中，模型初始化为预训练参数 $Φ_{0}$ ，并通过重复遵循梯度更新为 $Φ_{0} + Δ Φ$ ，以最大化条件语言建模目标：

\begin{matrix} (1) & max_{Φ} \sum_{(x, y) \in Z} \sum_{t = 1}^{| y |} \log (p_{Φ} (y_{t} | x, y_{< t})) \end{matrix}

One of the main drawbacks for full fine-tuning is that for each downstream task, we learn a different set of parameters $Δ Φ$ whose dimensions $| Δ Φ |$ equals $| Φ_{0} |$ . Thus, if the pre-trained model is large (such as GPT-3 with $| Φ_{0} | \approx 175$ Billion), storing and deploying many independent instances of fine-tuned models can be challenging, if at all plausible.

完全微调的主要缺点之一是，对于每个下游任务，我们学习一组不同的参数 $Δ Φ$ ，其维度 $| Δ Φ |$ 等于 $| Φ_{0} |$ 。因此，如果预训练模型很大（如参数规模 $| Φ_{0} | \approx 1750$ 亿的GPT-3），存储和部署多个独立的微调模型实例将极具挑战性，甚至几乎不可能。

In this paper, we adopt a parameter-efficient approach, where the task-specific parameter increment $Δ Φ = Δ Φ (Θ)$ is further encoded by a much smaller-sized set of parameters $Θ$ with $| Θ | ≪ | Φ_{0} |$ . The task of finding $Δ Φ$ thus becomes optimizing over $Θ$ :

在本文中，我们采用了一种参数高效的方法，其中特定于任务的参数增量 $Δ Φ = Δ Φ (Θ)$ 由一个规模小得多的参数集 $Θ$ 编码，且 $| Θ | ≪ | Φ_{0} |$ 。因此，寻找 $Δ Φ$ 的任务就变成了对 $Θ$ 的优化：

\begin{matrix} (2) & max_{Θ} \sum_{(x, y) \in Z} \sum_{t = 1}^{| y |} \log (p_{Φ_{0} + Δ Φ (Θ)} (y_{t} | x, y_{< t})) \end{matrix}

As we shall see in the subsequent sections, we propose to use a low-rank representation to encode $Δ Φ$ that is both computational and memory efficient. When the pre-trained model is GPT-3, the size of trainable parameters $| Θ |$ can be as small as 0.01% of $| Φ_{0} |$ .

正如我们在后续章节中将会看到的，我们提议使用一种低秩表示来编码 $Δ Φ$ ，这种方法既计算高效又内存高效。当预训练模型是GPT-3时，可训练参数集 $| Θ |$ 的规模可以小至 $| Φ_{0} |$ 的0.01%。

Our Method

We describe the simple design of LoRA and its practical implications. The principles outlined here apply to any dense layers in deep learning models, though we only focus on certain weights in Transformers in our experiments for practical reasons.

我们描述了LoRA的简单设计及其实际意义。这里概述的原则适用于深度学习模型中的任何稠密层，尽管出于实际原因，我们在实验中只关注Transformer中的某些权重。

Low-Rank Constraint on Update Matrices. A typical neural network contains numerous dense layers that perform matrix multiplication. The weight matrices in these layers are allowed to have full-rank. When adapting to a specific task, however, Aghajanyan et al. shows that the pre-trained language models have a low "intrinsic dimension" and can still learn efficiently despite a low-dimensional reparametrization. Inspired by this observation, we wonder if the updates to the weights also have a low "intrinsic rank" when adapting to downstream tasks. For a pre-trained weight matrix $W_{0} \in R^{d \times k}$ , we constrain its update by representing it with a low-rank decomposition $W_{0} + Δ W = W_{0} + B A$ , where $B \in R^{d \times r}, A \in R^{r \times k}$ , and the rank $r ≪ m i n (d, k)$ . During training, $W_{0}$ is frozen and does not receive gradient updates, while $A$ and $B$ contain trainable parameters. Note both $W_{0}$ and $Δ W = B A$ are multiplied with the same input, and their respective output vectors are summed coordinate-wise. For $h = W_{0} x$ , our modified forward pass yields:

对更新矩阵的低秩约束。 一个典型的神经网络包含大量执行矩阵乘法的稠密层。这些层中的权重矩阵允许具有满秩。然而，Aghajanyan等人表明，在适配特定任务时，预训练语言模型具有较低的“内在维度”，并且尽管采用了低维重参数化，仍然可以高效学习。受此观察启发，我们思考在适配下游任务时，权重的更新是否也具有较低的“内在秩”。对于一个预训练权重矩阵 $W_{0} \in R^{d \times k}$ ，我们通过使用低秩分解来表示其更新： $W_{0} + Δ W = W_{0} + B A$ ，其中 $B \in R^{d \times r}$ , $A \in R^{r \times k}$ ，且秩 $r ≪ min (d, k)$ 。在训练过程中， $W_{0}$ 被冻结，不接收梯度更新，而 $A$ 和 $B$ 包含可训练参数。需要注意的是， $W_{0}$ 和 $Δ W = B A$ 都与相同的输入相乘，并且它们各自的输出向量按坐标相加。对于 $h = W_{0} x$ ，我们修改后的前向传播结果为：

\begin{matrix} (3) & h = W_{0} x + Δ W x = W_{0} x + B A x \end{matrix}

We illustrate our reparametrization in Fig. 1. We use a random Gaussian initialization for $A$ and zero for $B$ , so $Δ W = B A$ is zero at the beginning of training. We then scale $Δ W x$ by $\frac{1}{r}$ during training to keep the coordinates of $Δ W x$ roughly $Θ (1)$ in $r$ after training.

我们在图1中展示了我们的重参数化。我们对 $A$ 使用随机高斯初始化，对 $B$ 使用零初始化，因此训练开始时 $Δ W = B A$ 为零。然后我们在训练过程中将 $Δ W x$ 乘以缩放因子 $\frac{1}{r}$ ，以使训练后 $Δ W x$ 的各坐标在 $r$ 尺度上大致保持 $Θ (1)$ 的量级。

Weight Decay to Pre-trained Weights. We note that weight decay behaves differently with LoRA than with full fine-tuning. Specifically, performing the usual weight decay on $A$ and $B$ is similar to decaying back to the pre-trained weights, which has been studied as a potentially effective form of regularization against "catastrophic forgetting". While extensive experiments isolating its effect is out-of-scope for this work, we believe that this, coupled with a constrained parameter space, might provide some regularization advantages. For example, see how LoRA with $r = d_{m o d e l} = 1024$ outperforms full fine-tuning on GPT-2 Medium in Sec. 5.3 and Sec. G.2.

向预训练权重的权重衰减。 我们注意到，权重衰减在LoRA中的表现与完全微调中不同。具体来说，对 $A$ 和 $B$ 执行通常的权重衰减类似于向预训练权重衰减，这已被研究为一种对抗“灾难性遗忘”的潜在有效正则化形式。虽然分离其影响的大量实验超出了本工作的范围，但我们相信，这一点与受限的参数空间相结合，可能会提供一些正则化优势。例如，请参见第5.3节和第G.2节中，在GPT-2 Medium上， $r = d_{m o d e l} = 1024$ 的LoRA如何优于完全微调。

No Additional Inference Latency. During deployment, we can explicitly compute $W = W_{0} + B A$ and perform inference as usual. When we need to switch to another downstream task, we can recover $W_{0}$ by subtracting $B A$ and then adding a different $B^{'} A^{'}$ . This causes a minor increase in peak memory usage and adds a latency to model switching that does not exceed a single model forward pass. Critically, we do not introduce any additional latency during inference in return.

无额外推理延迟。 在部署期间，我们可以显式地计算 $W = W_{0} + B A$ 并照常进行推理。当我们需要切换到另一个下游任务时，我们可以通过减去 $B A$ 然后加上不同的 $B^{'} A^{'}$ 来恢复 $W_{0}$ 。这会导致峰值内存使用量略有增加，并给模型切换增加一个不超过单次模型前向传播的延迟。关键的是，我们不会在推理过程中引入任何额外的延迟。

3.1 Applying LoRA to Transformer

In principle, we can apply LoRA to any subset of weight matrices in a neural network to reduce the number of trainable parameters. In the Transformer architecture, there are four weight matrices in the self-attention module ( $W_{q}, W_{k}, W_{v}, W_{o}$ ) and two in the MLP module. We treat $W_{q}$ (or $W_{k}, W_{v}$ ) as a single matrix of dimension $d_{m o d e l} \times d_{m o d e l}$ , even though the output dimension is usually sliced into attention heads. We limit our study to only changing the attention weights for downstream tasks and freeze the MLP modules (so they are not trained in downstream tasks) since applying LoRA to the latter results in 4 times the number of trainable parameters given the same rank $r$ . We further study the effect on adapting different types of weight matrices in a Transformer in Sec. 6.1.

将LoRA应用于Transformer

原则上，我们可以将LoRA应用于神经网络中任意子集的权重矩阵，以减少可训练参数的数量。在Transformer架构中，自注意力模块有四个权重矩阵（ $W_{q}, W_{k}, W_{v}, W_{o}$ ），MLP模块有两个。我们将 $W_{q}$ （或 $W_{k}, W_{v}$ ）视为维度为 $d_{m o d e l} \times d_{m o d e l}$ 的单个矩阵，尽管输出维度通常会被分割成多个注意力头。我们将研究范围限定为只改变下游任务的注意力权重，并冻结MLP模块（因此它们在下游任务中不进行训练），因为在相同秩 $r$ 下，将LoRA应用于后者会导致可训练参数数量增加4倍。我们将在第6.1节进一步研究适配Transformer中不同类型权重矩阵的效果。

Practical Benefits and Limitations. The most significant benefit comes from the reduction in memory and storage usage. For a Transformer, we reduce the VRAM consumption by $2 / 3$ if $r ≪ d_{m o d e l}$ as we do not need to keep track of the optimizer states for the frozen parameters. We also reduce the checkpoint size by roughly $\frac{d_{m o d e l}}{2 γ r}$ times, where $γ$ is the fraction of weight matrices on which we apply LoRA. On GPT-3, our motivating use case, we reduce the VRAM consumption from 1.2TB to 350GB. With $r = 4$ and $γ = 1 / 6$ , the checkpoint size is reduced by roughly 10,000× (from 350GB to 35MB) $^{3}$ . This allows us to train with significantly fewer GPUs and avoid I/O bottlenecks. Another benefit is that during deployment, we can switch between tasks at a much lower cost by only swapping the LoRA weights, often measured in megabytes, as opposed to all the weights (350GB). This allows for the creation of many customized models that can be activated and deactivated on the fly on machines that store the pre-trained weights. We also observe a 25% speedup during training as we do not need to calculate the gradient for the vast majority of the parameters.

实际效益与局限性 最显著的效益来自于内存和存储使用的减少。对于Transformer，如果 $r ≪ d_{m o d e l}$ ，由于我们无需跟踪冻结参数的优化器状态，可以将VRAM消耗减少 $2 / 3$ 。我们还将检查点大小大约减少 $\frac{d_{m o d e l}}{2 γ r}$ 倍，其中 $γ$ 是我们应用LoRA的权重矩阵的比例。在我们的动机用例GPT-3上，我们将VRAM消耗从1.2TB减少到350GB。当 $r = 4$ 且 $γ = 1 / 6$ 时，检查点大小大约减少10,000倍（从350GB降至35MB）。这使得我们可以用显著更少的GPU进行训练，并避免I/O瓶颈。另一个好处是，在部署期间，我们可以仅通过切换通常以兆字节计量的LoRA权重（而不是全部350GB的权重）来以更低的成本在任务之间切换。这允许创建许多定制化模型，这些模型可以在存储了预训练权重的机器上动态激活和停用。我们还观察到训练速度提升了25%，因为我们无需计算绝大多数参数的梯度。

On the other hand, LoRA has its limitations. For example, it is not straightforward to batch inputs to different tasks with different $A$ and $B$ in a single forward pass, because we absorb $A$ and $B$ into $W$ to prevent additional inference latency.

另一方面，LoRA也有其局限性。例如，将针对不同任务的、具有不同 $A$ 和 $B$ 的输入批处理到单次前向传播中并不直接，因为我们为了在推理时不引入额外延迟，会将 $A$ 和 $B$ 吸收到 $W$ 中。

Transformer Language Models. Transformer is a sequence-to-sequence architecture that makes heavy use of self-attention. [31] applied it to autoregressive language modeling by using a stack of Transformer decoders. Since then, Transformer-based language models have dominated NLP, achieving the state-of-the-art in many tasks. A new paradigm emerged with BERT and GPT-2 – both are large Transformer language models trained on a large amount of text – where fine-tuning on task-specific data after pre-training on general domain data provides a significant performance gain compared to training on task-specific data directly. Training larger Transformers generally results in better performance and remains an active research direction. GPT-3 is the largest single Transformer language model trained to-date with 175B parameters.

Transformer语言模型。 Transformer 是一种大量使用自注意力的序列到序列架构。[31] 通过使用堆叠的Transformer解码器将其应用于自回归语言建模。自此，基于Transformer的语言模型主导了NLP领域，在许多任务上取得了最先进的成果。BERT 和 GPT-2 的出现开启了一种新范式——它们都是在大量文本上训练的大型Transformer语言模型——在通用领域数据上预训练后，再在特定任务数据上进行微调，相比于直接在任务特定数据上训练，能带来显著的性能提升。训练更大的Transformer通常会带来更好的性能，这仍然是一个活跃的研究方向。GPT-3 是迄今为止训练过的最大的单一Transformer语言模型，拥有1750亿参数。

Prompt Engineering and Fine-Tuning. While GPT-3 175B can adapt its behavior with just a few additional training examples, the result depends heavily on the input prompt. This necessitates an empirical art of composing and formatting the prompt to maximize a model's performance on a desired task, which is known as prompt engineering or prompt hacking. Fine-tuning retrains a model pre-trained on general domains to a specific task. Variants of it include learning just a subset of the parameters, yet practitioners often retrain all of them to maximize the downstream performance. However, the enormity of GPT-3 175B makes it challenging to perform fine-tuning in the usual way due to the large checkpoint it produces and the high hardware barrier to entry since it has the same memory footprint as pre-training.

提示工程与微调。 尽管GPT-3 175B仅需少量额外的训练样本就能调整其行为，但结果在很大程度上取决于输入提示。这导致需要一门经验性的艺术来组织和格式化提示，以最大化模型在所需任务上的性能，这被称为提示工程或提示破解。微调将预训练于通用领域的模型重新训练以适应特定任务。其变体包括仅学习参数的一个子集，但实践者通常重新训练所有参数以最大化下游性能。然而，GPT-3 175B的巨大规模使得以通常方式进行微调变得极具挑战性，因为它会产生巨大的检查点，并且由于内存占用与预训练相同，硬件入门门槛很高。

Parameter-Efficient Adaptation. [15, 34] propose inserting adapter layers between existing layers in a neural network. Our method uses a bottleneck structure similar to [15] to impose a low-rank constraint on the weight updates. The key functional difference is that our learned weights can be merged with the main weights during inference, thus not introducing any latency, which is not the case for the adapter layers. More recently, [14, 19, 21, 25] proposed optimizing the input word embeddings in lieu of fine-tuning, akin to a continuous and differentiable generalization of prompt engineering. We include comparisons with [21] in our experiment section. However, this line of works can only scale up by using more special tokens in the prompt, which take up available sequence length for task tokens when positional embeddings are learned.

参数高效的自适应。 [15, 34] 提出在神经网络的现有层之间插入适配器层。我们的方法使用了类似于[15]的瓶颈结构，对权重更新施加低秩约束。关键的功能性区别在于，我们的学习权重可以在推理期间与主权重合并，因此不会引入任何延迟，而适配器层则不然。最近，[14, 19, 21, 25] 提出优化输入词嵌入以替代微调，类似于提示工程的一种连续且可微的泛化。我们在实验部分与[21]进行了比较。然而，这类工作只能通过在提示中使用更多的特殊标记来扩展规模，当学习位置嵌入时，这些特殊标记会占用任务标记可用的序列长度。

Low-Rank Structures in Deep Learning. Low-rank structure is very common in machine learning. A lot of machine learning problems have certain intrinsic low-rank structure. Moreover, it is known that for many deep learning tasks, especially those with a heavily over-parametrized neural network, the learned neural network will enjoy low-rank properties after training. Some prior works even explicitly impose the low-rank constraint when training the original neural network, however, to the best of our knowledge, none of these works considers low-rank update for adaptation to downstream tasks. In theory literature, it is known that neural networks outperform other classical learning methods, including the corresponding (finite-width) neural tangent kernels when the underlying concept class has certain low-rank structure. Another theoretical result in [4] suggests that low-rank adaptations can be useful for adversarial training. In sum, we believe that our proposed low-rank adaptation update is well-motivated by the literature.

深度学习中的低秩结构。 低秩结构在机器学习中非常普遍。许多机器学习问题都具有某种内在的低秩结构。此外，众所周知，对于许多深度学习任务，特别是那些具有高度过度参数化神经网络的任务，训练后的学习网络将具有低秩特性。一些先前的工作甚至在训练原始神经网络时显式地施加了低秩约束，然而，据我们所知，这些工作都没有考虑用于下游任务自适应的低秩更新。在理论文献中，已知当底层概念类具有某种低秩结构时，神经网络优于其他经典学习方法，包括相应的（有限宽度）神经正切核。另一个理论结果[4]表明，低秩自适应可能对对抗训练有用。总之，我们相信，我们提出的低秩自适应更新在文献中有充分的动机。

Conclusion and Future Work

Fine-tuning enormous language models is prohibitively expensive in terms of both the hardware requirement and the storage/switching cost for hosting multiple instances. We propose LoRA, an efficient adaptation strategy that neither introduces inference latency nor reduces input sequence length while retaining model quality. Importantly, it allows for quick task-switching when deployed as a service by sharing the vast majority of the model parameters. While we focused on Transformer, the proposed principles are generally applicable to any neural networks with dense layers.

对巨大的语言模型进行微调，无论是在硬件需求方面，还是在托管多个实例的存储/切换成本方面，都昂贵得令人望而却步。我们提出了LoRA，一种高效的适配策略，它既不引入推理延迟，也不减少输入序列长度，同时保持模型质量。重要的是，通过共享绝大部分模型参数，它在作为服务部署时允许快速的任务切换。虽然我们专注于Transformer，但所提出的原理普遍适用于任何具有稠密层的神经网络。

LoRA can potentially work in tandem with other fine-tuning techniques. In the future, we hope to explore only tuning some layers or adding adversarial training. Finally, the rank-deficiency of $Δ W$ suggests that $W$ could be rank-deficient as well, which might inspire lots of future works.

LoRA有可能与其他微调技术协同工作。未来，我们希望探索仅调整某些层或添加对抗性训练。最后， $Δ W$ 的秩不足现象表明 $W$ 也可能是秩不足的，这可能会启发大量的未来工作。

综合类

Memory

⚛️ Next.js

📈 Seo

⚛️ React.js

🎨 css

📊 d3.js

🌿 Node.js

🌱 koa.js

🥘 GAMES101

🌌 three.js

🫧 WebGPU

高等数学

🧰 工具安装

🤖 Rasa

🥝 机器学习

🧠 LLM专题

🍿 强化学习

🍳 计算机视觉

🤖 智能体

🐬 mysql

🧪 jest

LoRA: Low-Rank Adaptation of Large Language Models

LoRA：大型语言模型的低秩自适应

Abstract

Introduction

Problem Statement

Our Method

3.1 Applying LoRA to Transformer

将LoRA应用于Transformer

Conclusion and Future Work

🤖 Rasa

LoRA: Low-Rank Adaptation of Large Language Models ​

LoRA：大型语言模型的低秩自适应 ​

Abstract ​

Introduction ​

Problem Statement ​

Our Method ​

3.1 Applying LoRA to Transformer ​

将LoRA应用于Transformer ​

Related Works ​

Conclusion and Future Work ​

LoRA: Low-Rank Adaptation of Large Language Models

LoRA：大型语言模型的低秩自适应

Abstract

Introduction

Problem Statement

Our Method

3.1 Applying LoRA to Transformer

将LoRA应用于Transformer

Related Works

Conclusion and Future Work