面向知识密集型NLP任务的检索增强生成
Abstract
Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, and another which can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state of the art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.
大型预训练语言模型已被证明能够在其参数中存储事实知识,并在下游NLP任务上进行微调时取得最先进的结果。然而,它们访问和精确操纵知识的能力仍然有限,因此在知识密集型任务上,其性能落后于任务特定的架构。此外,为其决策提供出处以及更新其世界知识仍然是开放的研究问题。迄今为止,具有可微分访问机制来访问显式非参数化记忆的预训练模型仅被研究用于抽取式下游任务。我们探索了一种用于检索增强生成(RAG)的通用微调方法——这类模型结合了预训练的参数化记忆和非参数化记忆进行语言生成。我们引入了RAG模型,其中参数化记忆是一个预训练的seq2seq模型,非参数化记忆是一个稠密的维基百科向量索引,通过一个预训练的神经检索器进行访问。我们比较了两种RAG公式:一种在整个生成序列中基于相同的检索段落,另一种则可以在每个词元使用不同的段落。我们在广泛的知识密集型NLP任务上微调和评估我们的模型,并在三个开放域问答任务上取得了最先进的结果,超越了参数化的seq2seq模型和任务特定的检索-抽取架构。对于语言生成任务,我们发现RAG模型比最先进的纯参数化seq2seq基线生成更具体、更多样化和更符合事实的语言。
Introduction
Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data. They can do so without any access to an external memory, as a parameterized implicit knowledge base. While this development is exciting, such models do have downsides: They cannot easily expand or revise their memory, can't straightforwardly provide insight into their predictions, and may produce "hallucinations". Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories can address some of these issues because knowledge can be directly revised and expanded, and accessed knowledge can be inspected and interpreted. REALM and ORQA, two recently introduced models that combine masked language models with a differentiable retriever, have shown promising results, but have only explored open-domain extractive question answering. Here, we bring hybrid parametric and non-parametric memory to the "workhorse of NLP," i.e. sequence-to-sequence (seq2seq) models.
预训练的神经语言模型已被证明可以从数据中学习大量深入的知识。它们可以在无需任何外部记忆的情况下做到这一点,充当参数化的隐式知识库。尽管这一发展令人兴奋,但此类模型确实存在缺点:它们不能轻易扩展或修正其记忆,无法直接提供对其预测的洞察,并且可能产生“幻觉”。结合参数化记忆与非参数化(即基于检索的)记忆的混合模型可以解决其中一些问题,因为知识可以被直接修正和扩展,并且访问到的知识可以被检查和解释。最近引入的两个模型REALM和ORQA将掩码语言模型与可微分检索器相结合,已显示出有希望的结果,但仅探索了开放域抽取式问答。在这里,我们将混合参数化和非参数化记忆引入到“NLP的主力军”,即序列到序列模型中。
We endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG). We build RAG models where the parametric memory is a pre-trained seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We combine these components in a probabilistic model trained end-to-end (Fig. 1). The retriever (Dense Passage Retriever, henceforth DPR) provides latent documents conditioned on the input, and the seq2seq model (BART) then conditions on these latent documents together with the input to generate the output. We marginalize the latent documents with a top-K approximation, either on a per-output basis (assuming the same document is responsible for all tokens) or a per-token basis (where different documents are responsible for different tokens). Like T5 or BART, RAG can be fine-tuned on any seq2seq task, whereby both the generator and retriever are jointly learned.
我们通过一种通用的微调方法(称为检索增强生成,RAG)为预训练的参数化记忆生成模型配备了非参数化记忆。我们构建的RAG模型中,参数化记忆是一个预训练的seq2seq Transformer,非参数化记忆是一个稠密的维基百科向量索引,通过预训练的神经检索器进行访问。我们在一个端到端训练的概率模型中组合了这些组件(图1)。检索器(密集段落检索器,以下简称DPR)根据输入提供潜在文档,然后seq2seq模型(BART)根据这些潜在文档以及输入来生成输出。我们通过top-K近似对潜在文档进行边缘化,可以是基于每个输出(假设同一个文档负责所有词元),也可以是基于每个词元(不同的文档负责不同的词元)。与T5或BART类似,RAG可以在任何seq2seq任务上进行微调,同时学习生成器和检索器。
There has been extensive previous work proposing architectures to enrich systems with non-parametric memory which are trained from scratch for specific tasks, e.g. memory networks, stack-augmented networks and memory layers. In contrast, we explore a setting where both parametric and non-parametric memory components are pre-trained and pre-loaded with extensive knowledge. Crucially, by using pre-trained access mechanisms, the ability to access knowledge is present without additional training.
已有大量先前的工作提出了用非参数化记忆丰富系统的架构,这些架构是针对特定任务从头开始训练的,例如记忆网络、堆栈增强网络和记忆层。相比之下,我们探索了一种设定,其中参数化和非参数化记忆组件都是预训练的,并预加载了大量知识。关键在于,通过使用预训练的访问机制,无需额外训练即可具备访问知识的能力。
Our results highlight the benefits of combining parametric and non-parametric memory with generation for knowledge-intensive tasks—tasks that humans could not reasonably be expected to perform without access to an external knowledge source. Our RAG models achieve state-of-the-art results on open Natural Questions, WebQuestions and CuratedTrec and strongly outperform recent approaches that use specialised pre-training objectives on TriviaQA. Despite these being extractive tasks, we find that unconstrained generation outperforms previous extractive approaches. For knowledge-intensive generation, we experiment with MS-MARCO and Jeopardy question generation, and we find that our models generate responses that are more factual, specific, and diverse than a BART baseline. For FEVER fact verification, we achieve results within 4.3% of state-of-the-art pipeline models which use strong retrieval supervision. Finally, we demonstrate that the non-parametric memory can be replaced to update the models' knowledge as the world changes.
我们的结果凸显了将参数化和非参数化记忆与生成相结合对于知识密集型任务的好处——这些任务人类如果没有外部知识源是无法合理完成的。我们的RAG模型在开放域Natural Questions、WebQuestions和CuratedTrec上取得了最先进的结果,并在TriviaQA上显著优于使用专门预训练目标的近期方法。尽管这些是抽取式任务,但我们发现无约束生成优于先前的抽取式方法。对于知识密集型生成,我们在MS-MARCO和Jeopardy问题生成上进行了实验,发现我们的模型生成的响应比BART基线更符合事实、更具体且更多样化。在FEVER事实验证上,我们的结果与使用强检索监督的最先进流水线模型相差在4.3%以内。最后,我们证明了非参数化记忆可以被替换,以便随着世界变化更新模型的知识。
Methods
We explore RAG models, which use the input sequence
我们探索了RAG模型,该模型使用输入序列
To train the retriever and generator end-to-end, we treat the retrieved document as a latent variable. We propose two models that marginalize over the latent documents in different ways to produce a distribution over generated text. In one approach, RAG-Sequence, the model uses the same document to predict each target token. The second approach, RAG-Token, can predict each target token based on a different document. In the following, we formally introduce both models and then describe the
为了端到端地训练检索器和生成器,我们将检索到的文档视为潜在变量。我们提出了两种模型,以不同方式对潜在文档进行边缘化,以产生生成文本的分布。一种方法是 RAG-Sequence,模型使用同一个文档来预测每个目标词元。第二种方法是 RAG-Token,它可以基于不同的文档预测每个目标词元。下面,我们将正式介绍这两种模型,然后描述
2.1 Models
RAG-Sequence Model The RAG-Sequence model uses the same retrieved document to generate the complete sequence. Technically, it treats the retrieved document as a single latent variable that is marginalized to get the seq2seq probability
2.1 模型
RAG-Sequence 模型 RAG-Sequence 模型使用同一个检索到的文档来生成完整的序列。技术上,它将检索到的文档视为单个潜在变量,通过 top-K 近似进行边缘化得到 seq2seq 概率
RAG-Token Model In the RAG-Token model we can draw a different latent document for each target token and marginalize accordingly. This allows the generator to choose content from several documents when producing an answer. Concretely, the top K documents are retrieved using the retriever, and then the generator produces a distribution for the next output token for each document, before marginalizing, and repeating the process with the following output token, Formally, we define:
RAG-Token 模型 在 RAG-Token 模型中,我们可以为每个目标词元抽取不同的潜在文档并进行相应的边缘化。这允许生成器在生成答案时从多个文档中选择内容。具体来说,使用检索器检索 top K 个文档,然后生成器为每个文档产生下一个输出词元的分布,之后进行边缘化,并对后续输出词元重复该过程。形式上,我们定义:
Finally, we note that RAG can be used for sequence classification tasks by considering the target class as a target sequence of length one, in which case RAG-Sequence and RAG-Token are equivalent.
最后,我们注意到 RAG 也可以用于序列分类任务,只需将目标类别视为长度为1的目标序列,此时 RAG-Sequence 和 RAG-Token 是等价的。
2.2 Retriever: DPR
The retrieval component
2.2 检索器:DPR
检索组件
where
其中
2.3 Generator: BART
The generator component
2.3 生成器:BART
生成器组件
2.4 Training
We jointly train the retriever and generator components without any direct supervision on what document should be retrieved. Given a fine-tuning training corpus of input/output pairs
2.4 训练
我们在没有对应该检索哪个文档进行任何直接监督的情况下,联合训练检索器和生成器组件。给定一个由输入/输出对
2.5 Decoding
At test time, RAG-Sequence and RAG-Token require different ways to approximate
2.5 解码
在测试时,RAG-Sequence 和 RAG-Token 需要不同的方法来近似
RAG-Token The RAG-Token model can be seen as a standard, autoregressive seq2seq generator with transition probability:
RAG-Token RAG-Token 模型可以看作是一个标准的、自回归的 seq2seq 生成器,其转移概率为:
RAG-Sequence For RAG-Sequence, the likelihood
RAG-Sequence 对于 RAG-Sequence,似然度
Related Work
Single-Task Retrieval Prior work has shown that retrieval improves performance across a variety of NLP tasks when considered in isolation. Such tasks include open-domain question answering, fact checking, fact completion, long-form question answering, Wikipedia article generation, dialogue, translation, and language modeling. Our work unifies previous successes in incorporating retrieval into individual tasks, showing that a single retrieval-based architecture is capable of achieving strong performance across several tasks.
单任务检索 先前的研究表明,当孤立地考虑时,检索能够提升各种 NLP 任务的性能。这些任务包括开放域问答、事实验证、事实补全、长文本问答、维基百科文章生成、对话、翻译和语言建模。我们的工作统一了先前将检索融入单个任务的成功经验,表明单一的基于检索的架构能够在多个任务上取得强劲性能。
General-Purpose Architectures for NLP Prior work on general-purpose architectures for NLP tasks has shown great success without the use of retrieval. A single, pre-trained language model has been shown to achieve strong performance on various classification tasks in the GLUE benchmark after fine-tuning. GPT-2 later showed that a single, left-to-right, pre-trained language model could achieve strong performance across both discriminative and generative tasks. For further improvement, BART and T5 propose a single, pre-trained encoder-decoder model that leverages bi-directional attention to achieve stronger performance on discriminative and generative tasks. Our work aims to expand the space of possible tasks with a single, unified architecture, by learning a retrieval module to augment pre-trained, generative language models.
NLP 通用架构 先前关于 NLP 任务通用架构的工作在没有使用检索的情况下已经显示出巨大成功。一个单一的、预训练的语言模型在微调后,能够在 GLUE 基准的多种分类任务上取得强劲性能。GPT-2 后来表明,一个单一的、从左到右的预训练语言模型能够在判别式和生成式任务上都取得强劲性能。为了进一步提升,BART 和 T5 提出了单一的预训练编码器-解码器模型,利用双向注意力在判别式和生成式任务上取得更强的性能。我们的工作旨在通过学习一个检索模块来增强预训练的生成式语言模型,从而用单一的统一架构扩展可能完成的任务空间。
Learned Retrieval There is significant work on learning to retrieve documents in information retrieval, more recently with pre-trained, neural language models similar to ours. Some work optimizes the retrieval module to aid in a specific, downstream task such as question answering, using search, reinforcement learning, or a latent variable approach as in our work. These successes leverage different retrieval-based architectures and optimization techniques to achieve strong performance on a single task, while we show that a single retrieval-based architecture can be fine-tuned for strong performance on a variety of tasks.
学习型检索 在信息检索领域,关于学习检索文档有大量重要工作,最近的研究使用了与我们类似的预训练神经语言模型。一些工作优化检索模块以辅助特定的下游任务(如问答),使用了搜索、强化学习或如我们工作中的潜在变量方法。这些成功经验利用不同的基于检索的架构和优化技术,在单个任务上取得了强劲性能,而我们展示了单一的基于检索的架构可以通过微调在多种任务上取得强劲性能。
Memory-based Architectures Our document index can be seen as a large external memory for neural networks to attend to, analogous to memory networks. Concurrent work learns to retrieve a trained embedding for each entity in the input, rather than to retrieve raw text as in our work. Other work improves the ability of dialog models to generate factual text by attending over fact embeddings. A key feature of our memory is that it is comprised of raw text rather distributed representations, which makes the memory both (i) human-readable, lending a form of interpretability to our model, and (ii) human-writable, enabling us to dynamically update the model's memory by editing the document index. This approach has also been used in knowledge-intensive dialog, where generators have been conditioned on retrieved text directly, albeit obtained via TF-IDF rather than end-to-end learnt retrieval.
基于记忆的架构 我们的文档索引可以看作神经网络可以关注的大型外部记忆,类似于记忆网络。同期工作学习检索输入中每个实体的训练嵌入,而非像我们这样检索原始文本。其他工作通过关注事实嵌入来提高对话模型生成符合事实的文本的能力。我们记忆的一个关键特征在于它由原始文本而非分布式表示组成,这使得记忆既(i)可读,为模型提供了一种可解释性,又(ii)可写,使我们能够通过编辑文档索引来动态更新模型的记忆。这种方法也已用于知识密集型对话,尽管生成器直接基于检索到的文本进行条件生成,但这些文本是通过 TF-IDF 而非端到端学习的检索获得的。
Retrieve-and-Edit approaches Our method shares some similarities with retrieve-and-edit style approaches, where a similar training input-output pair is retrieved for a given input, and then edited to provide a final output. These approaches have proved successful in a number of domains including Machine Translation and Semantic Parsing. Our approach does have several differences, including less of emphasis on lightly editing a retrieved item, but on aggregating content from several pieces of retrieved content, as well as learning latent retrieval, and retrieving evidence documents rather than related training pairs. This said, RAG techniques may work well in these settings, and could represent promising future work.
检索-编辑方法 我们的方法与检索-编辑风格的方法有一些相似之处,后者为给定输入检索一个相似的训练输入-输出对,然后进行编辑以提供最终输出。这些方法在包括机器翻译和语义解析在内的多个领域已被证明是成功的。我们的方法确实有几个不同之处,包括不那么强调对检索项进行轻微编辑,而是侧重于聚合来自多个检索项的内容,以及学习潜在检索,并检索证据文档而非相关的训练对。尽管如此,RAG 技术在这些场景下也可能效果很好,并可能代表有前景的未来工作。
Discussion
In this work, we presented hybrid generation models with access to parametric and non-parametric memory. We showed that our RAG models obtain state of the art results on open-domain QA. We found that people prefer RAG's generation over purely parametric BART, finding RAG more factual and specific. We conducted an thorough investigation of the learned retrieval component, validating its effectiveness, and we illustrated how the retrieval index can be hot-swapped to update the model without requiring any retraining. In future work, it may be fruitful to investigate if the two components can be jointly pre-trained from scratch, either with a denoising objective similar to BART or some another objective. Our work opens up new research directions on how parametric and non-parametric memories interact and how to most effectively combine them, showing promise in being applied to a wide variety of NLP tasks.
在这项工作中,我们提出了能够访问参数化和非参数化记忆的混合生成模型。我们展示了 RAG 模型在开放域问答上取得了最先进的结果。我们发现人们更偏好 RAG 的生成结果而非纯参数化的 BART,认为 RAG 的生成更符合事实且更具体。我们对学习到的检索组件进行了彻底研究,验证了其有效性,并展示了如何热交换检索索引以更新模型而无需任何重新训练。在未来工作中,研究这两个组件是否可以一起从头开始联合预训练,使用类似于 BART 的去噪目标或其他目标,可能会很有成效。我们的工作为参数化和非参数化记忆如何相互作用以及如何最有效地结合它们开辟了新的研究方向,显示出应用于广泛 NLP 任务的潜力。
Broader Impact
This work offers several positive societal benefits over previous work: the fact that it is more strongly grounded in real factual knowledge (in this case Wikipedia) makes it "hallucinate" less with generations that are more factual, and offers more control and interpretability. RAG could be employed in a wide variety of scenarios with direct benefit to society, for example by endowing it with a medical index and asking it open-domain questions on that topic, or by helping people be more effective at their jobs.
这项工作比先前的工作提供了若干积极的社会效益:它更牢固地基于真实的事实知识(此处为维基百科),使得生成的“幻觉”更少,结果更符合事实,并提供了更多的可控性和可解释性。RAG 可以应用于广泛的对社会有直接益处的场景,例如为其配备医学索引并就相关主题提问开放域问题,或帮助人们提高工作效率。
With these advantages also come potential downsides: Wikipedia, or any potential external knowledge source, will probably never be entirely factual and completely devoid of bias. Since RAG can be employed as a language model, similar concerns as for GPT-2 are valid here, although arguably to a lesser extent, including that it might be used to generate abuse, faked or misleading content in the news or on social media; to impersonate others; or to automate the production of spam/phishing content. Advanced language models may also lead to the automation of various jobs in the coming decades. In order to mitigate these risks, AI systems could be employed to fight against misleading content and automated spam/phishing.
伴随这些优势而来的也有潜在缺点:维基百科或任何潜在的外部知识源,可能永远不会完全符合事实且完全没有偏见。由于 RAG 可以作为语言模型使用,与 GPT-2 类似的担忧在这里也是存在的,尽管可以说程度较轻,包括它可能被用来生成辱骂性、虚假或误导性的新闻或社交媒体内容;冒充他人;或自动化生成垃圾邮件/钓鱼内容。先进的语言模型也可能在未来几十年导致各种工作的自动化。为了减轻这些风险,可以部署 AI 系统来对抗误导性内容和自动化垃圾邮件/钓鱼信息。