Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
Self-RAG:通过自我反思学习检索、生成与批判
Despite their remarkable capabilities, large language models (LLMs) often produce responses containing factual inaccuracies due to their sole reliance on the parametric knowledge they encapsulate. Retrieval-Augmented Generation (RAG), an ad hoc approach that augments LMs with retrieval of relevant knowledge, decreases such issues. However, indiscriminately retrieving and incorporating a fixed number of retrieved passages, regardless of whether retrieval is necessary, or passages are relevant, diminishes LM versatility or can lead to unhelpful response generation. We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (SELF-RAG) that enhances an LM's quality and factuality through retrieval and self-reflection. Our framework trains a single arbitrary LM that adaptively retrieves passages on-demand, and generates and reflects on retrieved passages and its own generations using special tokens, called reflection tokens. Generating reflection tokens makes the LM controllable during the inference phase, enabling it to tailor its behavior to diverse task requirements. Experiments show that SELF-RAG (7B and 13B parameters) significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks. Specifically, SELF-RAG outperforms ChatGPT and retrieval-augmented Llama2-chat on Open-domain QA, reasoning and fact verification tasks, and it shows significant gains in improving factuality and citation accuracy for long-form generations relative to these models.
尽管大语言模型能力非凡,但由于其完全依赖封装的参数化知识,常常会产生包含事实性错误的回复。检索增强生成作为一种即席方法,通过向语言模型补充相关知识的检索,减少了此类问题。然而,无论检索是否必要,也无论检索到的段落是否相关,都一视同仁地检索并整合固定数量的检索段落,这削弱了语言模型的多功能性,或可能导致无益的回复生成。我们引入了一个名为 Self-RAG 的新框架,它通过检索和自我反思来提升语言模型的质量和事实性。我们的框架训练一个单一的、任意的语言模型,使其能按需自适应地检索段落,并使用称为反思标记的特殊标记来生成并反思检索到的段落及其自身的生成内容。生成反思标记使得语言模型在推理阶段具有可控性,能够使其行为适应多样化的任务需求。实验表明,SELF-RAG 在一系列多样化的任务上显著优于最先进的大语言模型和检索增强模型。具体来说,SELF-RAG 在开放域问答、推理和事实验证任务上优于 ChatGPT 和检索增强的 Llama2-chat,并且在提高长文本生成的事实性和引文准确性方面,相较于这些模型显示出显著的优势。