Skip to content


DSPy:将声明式语言模型调用编译为自我优化的流水线

Abstract

The ML community is rapidly exploring techniques for prompting language models (LMs) and for stacking them into pipelines that solve complex tasks. Unfortunately, existing LM pipelines are typically implemented using hard-coded "prompt templates", i.e. lengthy strings discovered via trial and error. Toward a more systematic approach for developing and optimizing LM pipelines, we introduce DSPy, a programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computation graphs where LMs are invoked through declarative modules. DSPy modules are parameterized, meaning they can learn (by creating and collecting demonstrations) how to apply compositions of prompting, finetuning, augmentation, and reasoning techniques. We design a compiler that will optimize any DSPy pipeline to maximize a given metric. We conduct two case studies, showing that succinct DSPy programs can express and optimize sophisticated LM pipelines that reason about math word problems, tackle multi-hop retrieval, answer complex questions, and control agent loops. Within minutes of compiling, a few lines of DSPy allow GPT-3.5 and llama2-13b-chat to self-bootstrap pipelines that outperform standard few-shot prompting (generally by over 25% and 65%, respectively) and pipelines with expert-created demonstrations (by up to 5–46% and 16–40%, respectively). On top of that, DSPy programs compiled to open and relatively small LMs like 770M-parameter T5 and llama2-13b-chat are competitive with approaches that rely on expert-written prompt chains for proprietary GPT-3.5.

机器学习社区正在快速探索提示语言模型以及将它们堆叠成流水线以解决复杂任务的技术。不幸的是,现有的语言模型流水线通常使用硬编码的“提示模板”实现,即通过试错发现的冗长字符串。为了朝着更系统化的方法开发和优化语言模型流水线,我们引入了 DSPy,一种将语言模型流水线抽象为文本转换图的编程模型,即通过声明式模块调用语言模型的命令式计算图。DSPy 模块是参数化的,这意味着它们可以学习如何应用提示、微调、增强和推理技术的组合。我们设计了一个编译器,可以优化任何 DSPy 流水线以最大化给定的指标。我们进行了两项案例研究,表明简洁的 DSPy 程序可以表达和优化复杂的语言模型流水线,这些流水线可以推理数学应用题、处理多跳检索、回答复杂问题以及控制智能体循环。在几分钟的编译时间内,几行 DSPy 代码就能让 GPT-3.5 和 llama2-13b-chat 自我引导出流水线,其性能优于标准的少样本提示,并且优于具有专家创建示例的流水线。除此之外,编译到开放且相对较小的语言模型上的 DSPy 程序与依赖为专有 GPT-3.5 编写的专家提示链的方法相比具有竞争力。

Introduction

Language models (LMs) are enabling researchers to build NLP systems at higher levels of abstraction and with lower data requirements than ever before. This is fueling an exploding space of “prompting” techniques—and lightweight finetuning techniques—for adapting LMs to new tasks, eliciting systematic reasoning from them, and augmenting them with retrieved sources or with tools. Most of these techniques are explored in isolation, but interest has been growing in building multi-stage pipelines and agents that decompose complex tasks into more manageable calls to LMs in an effort to improve performance.

语言模型使研究人员能够以比以往更高的抽象层次和更低的数据需求构建NLP系统。这正在催生一个爆炸性的“提示”技术空间——以及轻量级微调技术——用于将语言模型适配到新任务、从中引出系统推理,并通过检索来源或工具对其进行增强。大多数这些技术都是孤立研究的,但人们对构建多阶段流水线和智能体的兴趣日益增长,这些流水线和智能体将复杂任务分解为对语言模型更易管理的调用,以提高性能。

Unfortunately, LMs are known to be sensitive to how they are prompted for each task, and this is exacerbated in pipelines where multiple LM calls have to interact effectively. As a result, the LM calls in existing LM pipelines and in popular developer frameworks are generally implemented using hard-coded ‘prompt templates’, that is, long strings of instructions and demonstrations that are hand crafted through manual trial and error. We argue that this approach, while pervasive, can be brittle and unscalable—conceptually akin to hand-tuning the weights for a classifier. A given string prompt might not generalize to different pipelines or across different LMs, data domains, or even inputs.

不幸的是,语言模型对每个任务的提示方式很敏感,而在需要多个语言模型调用有效交互的流水线中,这一问题更加严重。因此,现有语言模型流水线和流行开发者框架中的语言模型调用通常使用硬编码的“提示模板”实现,即通过手动试错精心制作的冗长指令和演示字符串。我们认为,这种方法虽然普遍,但可能是脆弱且不可扩展的——在概念上类似于手动调整分类器的权重。一个给定的字符串提示可能无法泛化到不同的流水线或不同的语言模型、数据领域,甚至不同的输入。

Toward a more systematic approach to designing AI pipelines, we introduce the DSPy programming model. DSPy pushes building new LM pipelines away from manipulating free-form strings and closer to programming (composing modular operators to build text transformation graphs) where a compiler automatically generates optimized LM invocation strategies and prompts from a program. We draw inspiration from the consensus that emerged around neural network abstractions, where (1) many general-purpose layers can be modularly composed in any complex architecture and (2) the model weights can be trained using optimizers instead of being hand-tuned.

为了朝着更系统化的AI流水线设计方法迈进,我们引入了 DSPy 编程模型。DSPy 推动构建新的语言模型流水线从操纵自由形式的字符串转向更接近编程,其中编译器根据程序自动生成优化的语言模型调用策略和提示。我们从围绕神经网络抽象形成的共识中汲取灵感,其中许多通用层可以在任何复杂架构中模块化地组合,并且模型权重可以使用优化器进行训练,而不是手动调整。

To this end, we propose the DSPy programming model. We first translate string-based prompting techniques, including complex and task-dependent ones like Chain of Thought and ReAct, into declarative modules that carry natural-language typed signatures. DSPy modules are task-adaptive components—akin to neural network layers—that abstract any particular text transformation, like answering a question or summarizing a paper. We then parameterize each module so that it can learn its desired behavior by iteratively bootstrapping useful demonstrations within the pipeline. Inspired directly by PyTorch abstractions, DSPy modules are used via expressive define-by-run computational graphs. Pipelines are expressed by (1) declaring the modules needed and (2) using these modules in any logical control flow (e.g., if statements, for loops, exceptions, etc.) to logically connect the modules.

为此,我们提出了 DSPy 编程模型。我们首先将基于字符串的提示技术,包括思维链和ReAct等复杂且依赖任务的技术,转化为携带自然语言类型化签名的声明式模块。DSPy 模块是任务自适应组件——类似于神经网络层——抽象了任何特定的文本转换,如回答问题或总结论文。然后,我们对每个模块进行参数化,使其能够通过在流水线内迭代地引导有用的示例来学习其期望的行为。直接受 PyTorch 抽象的启发,DSPy 模块通过表达式的定义即运行计算图来使用。流水线通过声明所需的模块并在任何逻辑控制流中使用这些模块来逻辑连接它们。

We then develop the DSPy compiler, which optimizes any DSPy program to improve quality or cost. The compiler inputs are the program, a few training inputs with optional labels, and a validation metric. The compiler simulates versions of the program on the inputs and bootstrap examples traces of each module for self-improvement, using them to construct effective few-shot prompts or finetuning small LMs for steps of the pipeline. Optimization in DSPy is highly modular: it is conducted by teleprompters, which are general-purpose optimization strategies that determine how the modules should learn from data. In this way, the compiler automatically maps the declarative modules to high-quality compositions of prompting, finetuning, reasoning, and augmentation.

然后,我们开发了 DSPy 编译器,它可以优化任何 DSPy 程序以提高质量或降低成本。编译器的输入是程序、一些带有可选标签的训练输入以及一个验证指标。编译器在输入上模拟程序的版本,并引导每个模块的示例轨迹以进行自我改进,利用它们为流水线步骤构建有效的少样本提示或微调小型语言模型。DSPy 中的优化是高度模块化的:它由远程提示器执行,这些是决定模块应如何从数据中学习的通用优化策略。通过这种方式,编译器自动将声明式模块映射到提示、微调、推理和增强的高质量组合。

Programming models like DSPy could be assessed along many dimensions, but we focus on the role of expert-crafted prompts in shaping system performance. We are seeking to reduce or even remove their role through DSPy modules (e.g., versions of popular techniques like Chain of Thought) and teleprompters. We report on two expansive case studies: math word problems and multi-hop question answering with explorations of chain of thought, multi-chain reflection, multi-hop retrieval, retrieval-augmented question answering, and agent loops. Our evaluations use a number of different compiling strategies effectively and show that straightforward DSPy programs outperform systems using hand-crafted prompts, while also allowing our programs to use much smaller and hence more efficient LMs effectively.

像 DSPy 这样的编程模型可以从多个维度进行评估,但我们重点关注专家制作的提示在塑造系统性能方面的作用。我们旨在通过 DSPy 模块和远程提示器减少甚至消除它们的作用。我们报告了两个广泛的案例研究:数学应用题和多跳问答,探索了思维链、多链反思、多跳检索、检索增强问答和智能体循环。我们的评估有效地使用了多种不同的编译策略,并表明直接的 DSPy 程序优于使用手工制作提示的系统,同时也允许我们的程序有效地使用更小、因此更高效的语言模型。

Overall, this work proposes the first programming model that translates prompting techniques into parameterized declarative modules and introduces an effective compiler with general optimization strategies (teleprompters) to optimize arbitrary pipelines of these modules. Our main contributions are empirical and algorithmic: with DSPy, we have found that we can implement very short programs that can bootstrap self-improving multi-stage NLP systems using LMs as small as llama2-13b-chat and T5-Large. Without hand-crafted prompts and within minutes to tens of minutes of compiling, compositions of DSPy modules can raise the quality of simple programs from 33% to 82% and from 32% to 46% for GPT-3.5 and, similarly, from 9% to 47% and from 22% to 41% for llama2-13b-chat.

总的来说,这项工作提出了第一个将提示技术转化为参数化声明式模块的编程模型,并引入了一个具有通用优化策略的有效编译器,以优化这些模块的任意流水线。我们的主要贡献是经验性和算法性的:通过 DSPy,我们发现可以实现非常短的程序,这些程序可以使用小至 llama2-13b-chat 和 T5-Large 的语言模型来引导自我改进的多阶段 NLP 系统。无需手工制作提示,在几分钟到几十分钟的编译时间内,DSPy 模块的组合可以将简单程序的质量从 33% 提高到 82% 和从 32% 提高到 46%,对于 llama2-13b-chat 则分别从 9% 提高到 47% 和从 22% 提高到 41%。

This work is inspired by the role that Torch, Theano, Chainer, and others played in the development in deep learning by providing powerful abstractions. A similar transformation is emerging with higher-level pipelines of LMs, and we are seeking to offer a solid conceptual framework and programming abstractions for what we call foundation model programming. We draw on differentiable programming but applied to LM calls rather than neural networks, and borrow syntactic elements from PyTorch.

本工作受到 Torch、Theano、Chainer 等通过提供强大抽象在深度学习发展中发挥的作用的启发。语言模型的高层次流水线正在经历类似的转变,我们旨在为我们所称的基础模型编程提供一个坚实的概念框架和编程抽象。我们借鉴了可微分编程的思想,但将其应用于语言模型调用而非神经网络,并借用了 PyTorch 的语法元素。

In-context learning is a key mechanism for foundation model programming. A growing body of work has revealed that, especially with instruction tuning, we can elicit sophisticated behavior via prompting. Similarly, forms of weak supervision that would normally require task-specific or hand-built heuristics are now done by LMs.

上下文学习是基础模型编程的一个关键机制。越来越多的研究表明,特别是在指令调优的帮助下,我们可以通过提示引出复杂的行为。类似地,通常需要特定任务或手工构建启发式方法的弱监督形式,现在也由语言模型完成。

In-context learning methods now routinely invoke tools, leading to LM pipelines that use retrieval models, multimodal foundation models, and more traditional tools like APIs and calculators. A number of toolkits have been developed to facilitate this, including LangChain, Semantic Kernel, LlamaIndex, and many other retrieval and agent libraries. These toolkits provide pre-packaged chains and agents that connect LMs with numerous accessible tools. However, they suffer from the pervasive prompt engineering challenges we address in DSPy: they express task-specific behavior through hand-written prompt templates.

上下文学习方法现在常规性地调用工具,从而产生了使用检索模型、多模态基础模型以及 API 和计算器等更传统工具的语言模型流水线。已经开发了许多工具包来促进这一点,包括 LangChain、Semantic Kernel、LlamaIndex 以及许多其他检索和智能体库。这些工具包提供了预打包的链和智能体,将语言模型与众多可访问的工具连接起来。然而,它们面临着我们在 DSPy 中解决的普遍提示工程挑战:它们通过手写的提示模板来表达任务特定的行为。

Researchers are starting to apply discrete optimization and RL to find effective prompts, generally for a single logical LM call. DSPy seeks to generalize this space: it offers a rich framework for optimizing arbitrary pipelines from high-level declarative signatures, by bootstrapping high-quality multi-stage demonstrations with constraints. In this framework, DSPy teleprompters may apply optimization using model selection techniques like cross-validation or, in principle, with sophisticated techniques involving RL and LM feedback or learned or Bayesian hyperparameter optimization methods.

研究人员开始应用离散优化和强化学习来寻找有效的提示,通常是针对单个逻辑语言模型调用。DSPy 旨在推广这一领域:它提供了一个丰富的框架,用于通过引导高质量的多阶段演示并带有约束,从高级声明式签名优化任意流水线。在这个框架中,DSPy 远程提示器可以应用使用交叉验证等模型选择技术的优化,或者原则上,应用涉及强化学习和语言模型反馈或学习或贝叶斯超参数优化方法的复杂技术。

The present paper seeks to motivate DSPy as a programming model and to report new empirical findings from applying the DSPy compiler. This is inspired by formative work by Bergstra et al., Paszke et al., and Wolf et al., who support their respective programming models with a mix of benchmark numbers and some qualitative measures. For the current paper, we focus on showing that DSPy and its compiler allow us to build outstanding LM systems without hand-crafted prompt strings, but instead from truly modular units, and that this opens up doors for systematically exploring a rich design space at a very high programmatic level of abstraction.

本文旨在推广 DSPy 作为一种编程模型,并报告应用 DSPy 编译器的新实证发现。这受到 Bergstra 等人、Paszke 等人和 Wolf 等人开创性工作的启发,他们以基准数据和定性评估相结合的方式支持各自的编程模型。对于本文,我们专注于展示 DSPy 及其编译器使我们能够构建出色的语言模型系统,无需手工制作的提示字符串,而是基于真正模块化的单元,并为在非常高的编程抽象级别上系统地探索丰富的设计空间打开了大门。

The DSPy Programming Model

We present DSPy, which treats LMs as abstract devices for text generation, and optimizes their usage in arbitrary computational graphs. DSPy programs are expressed in Python: each program takes the task input (e.g., a question to answer or a paper to summarize) and returns the output (e.g., an answer or a summary) after a series of steps. DSPy contributes three abstractions toward automatic optimization: signatures, modules, and teleprompters. Signatures abstract the input/output behavior of a module; modules replace existing hand-prompting techniques and can be composed in arbitrary pipelines; and teleprompters optimize all modules in the pipeline to maximize a metric.

我们提出 DSPy,它将语言模型视为文本生成的抽象设备,并优化它们在任意计算图中的使用。DSPy 程序用 Python 表达:每个程序接受任务输入,经过一系列步骤后返回输出。DSPy 为自动优化提供了三种抽象:签名、模块和远程提示器。签名抽象了模块的输入/输出行为;模块取代了现有的手工提示技术,并可以在任意流水线中组合;远程提示器优化流水线中的所有模块以最大化某个指标。

3.1 Natural Language Signatures can abstract prompting & finetuning

Instead of free-form string prompts, DSPy programs use natural language signatures to assign work to the LM. A DSPy signature is natural-language typed declaration of a function: a short declarative spec that tells DSPy what a text transformation needs to do (e.g., “consume questions and return answers”), rather than how a specific LM should be prompted to implement that behavior. More formally, a DSPy signature is a tuple of input fields and output fields (and an optional instruction). A field consists of field name and optional metadata. In typical usage, the roles of fields are inferred by DSPy as a function of field names. For instance, the DSPy compiler will use in-context learning to interpret question differently from answer and will iteratively refine its usage of these fields.

3.1 自然语言签名可以抽象提示和微调

DSPy 程序不使用自由形式的字符串提示,而是使用自然语言签名来为语言模型分配工作。DSPy 签名是一种函数的自然语言类型化声明:一个简短的声明式规范,告诉 DSPy 一个文本转换需要做什么,而不是如何提示特定的语言模型来实现该行为。更正式地说,一个 DSPy 签名是输入字段和输出字段的元组。一个字段由字段名和可选的元数据组成。在典型用法中,字段的角色由 DSPy 根据字段名推断。例如,DSPy 编译器将使用上下文学习来区分问题与答案,并将迭代地优化这些字段的使用。

Signatures offer two benefits over prompts: they can be compiled into self-improving and pipeline-adaptive prompts or finetunes. This is primarily done by bootstrapping (Sec 4) useful demonstrating examples for each signature. Additionally, they handle structured formatting and parsing logic to reduce (or, ideally, avoid) brittle string manipulation in user programs.

与提示相比,签名有两个好处:它们可以被编译成自我优化和流水线自适应的提示或微调。这主要通过为每个签名引导有用的示例来完成。此外,它们处理结构化的格式化和解析逻辑,以减少用户程序中脆弱的字符串操作。

In practice, DSPy signatures can be expressed with a shorthand notation like question -> answer, so that line 1 in the following is a complete DSPy program for a basic question-answering system (with line 2 illustrating usage and line 3 the response when GPT-3.5 is the LM):

在实践中,DSPy 签名可以用像 question -> answer 这样的简写符号来表达,因此下面代码的第 1 行就是一个用于基本问答系统的完整 DSPy 程序(第 2 行说明了用法,第 3 行是当语言模型为 GPT-3.5 时的响应):

python
qa=dspy.Predict("question -> answer")
qa(question="Where is Guarani spoken?")
# Out: Prediction(answer='Guarani is spoken mainly in South America.')

In the shorthand notation, each field's name indicates the semantic role that the input (or output) field plays in the transformation. DSPy will parse this notation and expand the field names into meaningful instructions for the LM, so that english_document -> french_translation would prompt for English to French translation. When needed, DSPy offers more advanced programming interfaces for expressing more explicit constraints on signatures.

在简写符号中,每个字段的名称指示了输入(或输出)字段在转换中扮演的语义角色。DSPy 将解析此符号并将字段名扩展为对语言模型有意义的指令,因此 english_document -> french_translation 会提示进行英译法。需要时,DSPy 提供更高级的编程接口来表达对签名的更明确约束(附录 A)。

3.2 Parameterized & templated modules can abstract prompting techniques

Akin to type signatures in programming languages, DSPy signatures simply define an interface and provide type-like hints on the expected behavior. To use a signature, we must declare a module with that signature, like we instantiated a Predict module above. A module declaration like this returns a function having that signature.

3.2 参数化和模板化的模块可以抽象提示技术

类似于编程语言中的类型签名,DSPy 签名简单地定义了一个接口,并提供关于预期行为的类似类型的提示。要使用一个签名,我们必须声明一个具有该签名的模块。这样的模块声明返回一个具有该签名的函数。

The Predict Module The core module for working with signatures in DSPy is Predict. Internally, Predict stores the supplied signature, an optional LM to use (initially None, but otherwise overrides the default LM for this module), and a list of demonstrations for prompting (initially empty). Like layers in PyTorch, the instantiated module behaves as a callable function: it takes in keyword arguments corresponding to the signature input fields (e.g., question), formats a prompt to implement the signature and includes the appropriate demonstrations, calls the LM, and parses the output fields. When Predict detects it's being used in compile mode, it will also internally track input/output traces to assist the teleprompter at bootstrapping the demonstrations.

Predict 模块 在 DSPy 中处理签名的核心模块是 Predict。在内部,Predict 存储提供的签名、一个可选的语言模型以及一个用于提示的演示列表。像 PyTorch 中的层一样,实例化的模块表现为一个可调用函数:它接收与签名输入字段对应的关键字参数,格式化一个实现该签名的提示并包含适当的演示,调用语言模型,并解析输出字段。当 Predict 检测到它在编译模式下使用时,它还会在内部跟踪输入/输出轨迹,以协助远程提示器引导演示。

Other Built-in Modules DSPy modules translate prompting techniques into modular functions that support any signature, contrasting with the standard approach of prompting LMs with task-specific details (e.g., hand-written few-shot examples). To this end, DSPy includes a number of more sophisticated modules like ChainOfThought, ProgramOfThought, MultiChainComparison, and ReAct. These can all be used interchangeably to implement a DSPy signature. For instance, simply changing Predict to ChainOfThought in the above program leads to a system that thinks step by step before committing to its output field.

其他内置模块 DSPy 模块将提示技术转化为支持任何签名的模块化函数,这与用特定任务细节提示语言模型的标准方法形成对比。为此,DSPy 包含了许多更复杂的模块,如 ChainOfThought、ProgramOfThought、MultiChainComparison 和 ReAct。它们都可以互换使用来实现一个 DSPy 签名。例如,只需将上述程序中的 Predict 改为 ChainOfThought,就会得到一个在输出前逐步思考的系统。

Importantly, all of these modules are implemented in a few lines of code by expanding the user-defined signature and calling Predict one or more times on new signatures as appropriate. For instance, we show a simplified implementation of the built-in ChainOfThought below.

重要的是,所有这些模块都通过扩展用户定义的签名并适当地在新签名上调用 Predict 一次或多次,仅用几行代码即可实现。例如,我们在下面展示了内置 ChainOfThought 模块的简化实现。

python
class ChainOfThought(dspy.Module):
    def __init__(self, signature):
        # 将签名从 'inputs -> outputs' 修改为 'inputs -> rationale', 'outputs'
        rationale_field = dspy.OutputField(prefix="推理:让我们一步步思考。")
        signature = dspy.Signature(signature).prepend_output_field(rationale_field)

        # 使用修改后的签名声明一个子模块
        self.predict = dspy.Predict(signature)

    def forward(self, **kwargs):
        # 将输入直接传递给子模块
        return self.predict(**kwargs)

This is a fully-fledged module capable of learning effective few-shot prompting for any LM or task. We contrast that with Appendix C, which copies long reasoning prompts hand-written by sources ranging from recent research to popular prompting libraries.

这是一个成熟的模块,能够为任何语言模型或任务学习有效的少样本提示。我们将其与附录 C 进行对比,附录 C 中复制了由近期研究到流行提示库等来源手工编写的冗长推理提示。

Parameterization Uniquely, DSPy parameterizes these prompting techniques. To understand this parameterization, observe that any LM call seeking to implement a particular signature needs to specify parameters that include: (1) the specific LM to call, (2) the prompt instructions and the string prefix of each signature field and, most importantly, (3) the demonstrations used as few-shot prompts (for frozen LMs) or as training data (for finetuning). We focus primarily on automatically generating and selecting useful demonstrations. In our case studies, we find that bootstrapping good demonstrations gives us a powerful way to teach sophisticated pipelines of LMs new behaviors systematically.

参数化 独特的是,DSPy 对这些提示技术进行了参数化。为了理解这种参数化,请注意任何试图实现特定签名的语言模型调用都需要指定参数,这些参数包括:(1) 要调用的特定语言模型,(2) 提示指令和每个签名字段的字符串前缀,以及最重要的 (3) 用作少样本提示或作为微调训练数据的演示。我们主要专注于自动生成和选择有用的演示。在我们的案例研究中,我们发现引导出好的演示为我们提供了一种强大的方式,可以系统地教授复杂的语言模型流水线新行为。

Tools DSPy programs may use tools, which are modules that execute computation. We support retrieval models through a dspy.Retrieve module. At the time of writing, DSPy has built-in support for ColBERTv2, Pyserini, and Pinecone retrievers, and we have explored experimental dspy.SQL for executing SQL queries and dspy.PythonInterpreter for executing Python code in a sandbox.

工具 DSPy 程序可以使用执行计算的模块作为工具。我们通过 dspy.Retrieve 模块支持检索模型。在撰写本文时,DSPy 内置支持 ColBERTv2、Pyserini 和 Pinecone 检索器,并且我们探索了用于执行 SQL 查询的实验性 dspy.SQL 和在沙箱中执行 Python 代码的 dspy.PythonInterpreter。

Programs DSPy modules can be composed in arbitrary pipelines in a define-by-run interface. Inspired directly by PyTorch and Chainer, one first declares the modules needed at initialization, allowing DSPy to keep track of them for optimization, and then one expresses the pipeline with arbitrary code that calls the modules in a forward method. As a simple illustration, we offer the following simple but complete retrieval-augmented generation (RAG) system.

程序 DSPy 模块可以在定义即运行的接口中组合成任意流水线。直接受 PyTorch 和 Chainer 的启发,我们首先在初始化时声明所需的模块,让 DSPy 跟踪它们以便优化,然后在 forward 方法中使用任意代码表达流水线来调用这些模块。作为一个简单的示例,我们提供了以下简单但完整的检索增强生成系统。

python
class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        # 'Retrieve' 将使用用户的默认检索设置,除非被覆盖
        self.retrieve = dspy.Retrieve(k=num_passages)
        # 'ChainOfThought' 签名:给定检索内容和问题,生成答案
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate_answer(context=context, question=question)

To highlight modularity, we use ChainOfThought as a drop-in replacement of the basic Predict. One can now simply write RAG()("Where is Guarani spoken?") to use it. Notice that, if we use a signature "context, question -> search_query", we get a system that generates search queries rather than answers.

为了突出模块化,我们使用 ChainOfThought 作为基本 Predict 的直接替代品。现在只需编写 RAG()("瓜拉尼语在哪里使用?") 即可使用它。请注意,如果我们使用签名 "context, question -> search_query",我们将得到一个生成搜索查询而非答案的系统。

3.3 Teleprompters can automate prompting for arbitrary pipelines

When compiling a DSPy program, we generally invoke a teleprompter, which is an optimizer that takes the program, a training set, and a metric—and returns a new optimized program. Different teleprompters (Sec 4) apply different strategies for optimization.

3.3 远程提示器可以自动化任意流水线的提示

在编译 DSPy 程序时,我们通常会调用一个远程提示器,它是一个优化器,接收程序、训练集和一个指标,并返回一个新的优化后的程序。不同的远程提示器应用不同的优化策略。

In DSPy, training sets may be small, potentially a handful of examples, though larger data enables more powerful optimization. Training examples may be incomplete, i.e., only input values are necessary. Labels for the pipeline steps are not required, unless they need to be used in the metric. In practice, we typically assume labels only for (at most) the program's final output, not the intermediate steps. This label-efficiency is critical for modularity: building a new pipeline in DSPy requires simply recompiling the new pipeline's code, not annotating data specific to the new pipeline.

在 DSPy 中,训练集可以很小,可能只有几个示例,尽管更大的数据能实现更强大的优化。训练示例可能是不完整的,即只需要输入值。除非需要在指标中使用,否则流水线步骤的标签不是必需的。在实践中,我们通常假设标签最多只用于程序的最终输出,而不是中间步骤。这种标签效率对于模块化至关重要:在 DSPy 中构建新的流水线只需重新编译新流水线的代码,而无需为新的流水线标注特定数据。

Metrics can be simple notions like exact match (EM) or F1, but they can be entire DSPy programs that balance multiple concerns. For example, we may compile the RAG module above against a dataset of question–answer pairs qa_trainset and the metric EM. The goal of optimization here is to effectively bootstrap few-shot demonstrations. The following code achieves this:

指标可以是简单的概念,如精确匹配或 F1 分数,但它们也可以是平衡多个关注点的完整 DSPy 程序。例如,我们可以根据一个问答对数据集和精确匹配指标编译上面的 RAG 模块。这里的优化目标是有效地引导少样本演示。以下代码实现了这一点:

python
# 仅包含问题和最终答案的小型训练集
qa_trainset = [dspy.Example(question="What is the capital of France?", answer="Paris")]

# 远程提示器将引导缺失的标签:推理链和检索上下文。
teleprompter = dspy.BootstrapFewShot(metric=dspy.evaluate.answer_exact_match)
compiled_rag = teleprompter.compile(RAG(), trainset=qa_trainset)

In this example, the BootstrapFewShot teleprompter (Sec 4, Appendix E.1) simulates RAG on the training example(s). It will collect demonstrations of each module (i.e., examples of its input–output behavior) that collectively lead to valid output (i.e., respecting the signatures and the metric).

在这个例子中,BootstrapFewShot 远程提示器在训练示例上模拟 RAG。它将收集每个模块的演示,这些演示共同导致有效的输出。

If one wanted to push the compiled program to be extractive given its retrieved contexts, one could define a custom metric to use in place of dspy.evaluate.answer_exact_match:

如果想要使编译后的程序在给定检索上下文的情况下具有抽取性,可以定义一个自定义指标来替代 dspy.evaluate.answer_exact_match

python
def answer_and_context_match(example, pred, trace=None):
	answer_match = dspy.evaluate.answer_exact_match(example, pred)

	# 预测是否是某段文本的子字符串?
	context_match = any((pred.answer.lower() in c) for c in pred.context)

	return answer_match and context_match

Notice that behavior like this might be more accurately checked by another DSPy program that checks for faithful grounding of answers. Such metrics are fully supported and encouraged in DSPy.

请注意,这种行为可能由另一个检查答案忠实基础的 DSPy 程序更准确地检查。DSPy 完全支持并鼓励此类指标。

Teleprompters can be composed by specifying a teacher program. DSPy will sample demonstrations from this program for prompt optimization. This composition can enable very rich pipelines, where expensive programs (e.g., complex expensive ensembles using large LMs) supervise cheap programs (e.g., simple pipelines using smaller LMs). One may start with compiled_rag from above (say, compiled to use a large Llama2-13b-chat LM) but now fine-tune Flan-T5-large to create an efficient program:

远程提示器可以通过指定一个教师程序来组合。DSPy 将从这个程序中采样演示以进行提示优化。这种组合可以实现非常丰富的流水线,其中昂贵的程序监督廉价的程序。例如,可以从上面的 compiled_rag 开始,但现在微调 Flan-T5-large 来创建一个高效的程序:

python
# 更大的问题集,没有标签。所有步骤的标签将被引导。
unlabeled_questions = [dspy.Example(question="What is the capital of Germany?"), ...]

# 由于我们没有答案,我们使用 'answer_passage_match' 来过滤无依据的答案。
finetuning_teleprompter = BootstrapFinetune(metric=dspy.evaluate.answer_passage_match)

# 我们设置 'teacher=compiled_rag' 以进行组合。引导现在将使用 'compiled_rag'。
compiled_rag_via_finetune = finetuning_teleprompter.compile(RAG(), teacher=compiled_rag,
	trainset=unlabeled_questions, target='google/flan-t5-large')

The DSPy Compiler

A key source of DSPy's expressive power is its ability to compile—or automatically optimize—any program in this programming model. Compiling relies on a teleprompter, which is an optimizer for DSPy programs that improves the quality (or cost) of modules via prompting or finetuning, which are unified in DSPy. While DSPy does not enforce this when creating new teleprompters, typical teleprompters go through three stages.

DSPy 表达能力的一个关键来源是其编译(即自动优化)该编程模型中任何程序的能力。编译依赖于远程提示器,它是一个针对 DSPy 程序的优化器,通过提示或微调来提高模块的质量(或降低成本),这两种方式在 DSPy 中是统一的。虽然 DSPy 在创建新的远程提示器时并不强制执行此操作,但典型的远程提示器会经历三个阶段。

Stage 1: Candidate Generation The compiler first (recursively) finds all unique Predict modules (predictors) in a program, including those nested under other modules. For each unique predictor p, the teleprompter may generate candidate values for the parameters of p: the instructions, field descriptions, or—most importantly—demonstrations (i.e., example input–output pairs). In this iteration of DSPy, we focus on demonstrations and find that simple rejection-sampling-like approaches can help bootstrap highly effective multi-stage systems.

第一阶段:候选项生成 编译器首先(递归地)找到程序中所有唯一的 Predict 模块,包括嵌套在其他模块下的模块。对于每个唯一的预测器 p,远程提示器可以为 p 的参数生成候选值:指令、字段描述,或者最重要的——演示。在 DSPy 的这个迭代版本中,我们专注于演示,并发现简单的类似拒绝采样的方法可以帮助引导出高效的多阶段系统。

Consider the simplest non-trivial teleprompter in DSPy,BootstrapFewShot (simplified pseudocode in Appendix E.1). This teleprompter will simulate a teacher program (or, if unset, the zero-shot version of the program being compiled) on some training inputs, possibly one or more times with a high temperature. When running in compile mode, multi-stage traces are tracked transparently and in a thread-safe fashion throughout execution. The program's metric is used to filter for multi-stage traces that together help the pipeline pass the metric. We thus obtain potential labels for all signatures in the program by throwing away the bad examples and using the good examples as potential demonstrations, though these design decisions are under user control.

考虑 DSPy 中最简单的非平凡远程提示器BootstrapFewShot。该远程提示器将在一些训练输入上模拟一个教师程序,可能会以较高的温度运行一次或多次。在编译模式下运行时,多阶段轨迹会在执行过程中以线程安全的方式被透明地跟踪。程序的指标用于筛选那些共同帮助流水线通过指标的多阶段轨迹。因此,我们通过丢弃不良示例并将良好示例用作潜在演示来获得程序中所有签名的潜在标签,尽管这些设计决策由用户控制。

While LMs can be highly unreliable, we find they can be rather efficient at searching the space of solutions for multi-stage designs. A well-decomposed program can typically find at least a few training examples where the LM can pass the constraints enforced by the signatures and metrics, allowing us to bootstrap iteratively if needed.

虽然语言模型可能非常不可靠,但我们发现它们在搜索多阶段设计的解空间方面相当高效。一个良好分解的程序通常可以至少找到几个训练示例,在这些示例中语言模型能够通过签名和指标施加的约束,从而允许我们在需要时进行迭代引导。

Stage 2: Parameter Optimization Now each parameter has a discrete set of candidates: demonstrations, instructions, etc. Many hyperparameter tuning algorithms (e.g., random search or Tree-structured Parzen Estimators as in HyperOpt and Optuna) can be applied for selection among candidates. We report simplified implementations of DSPy's BootstrapFewShotWithRandomSearch and BootstrapFewShotWithOptuna in Appendix E.2 and Appendix E.3.

第二阶段:参数优化 现在每个参数都有一组离散的候选项:演示、指令等。许多超参数调优算法可以用于在候选项中进行选择。我们在附录 E.2 和附录 E.3 中报告了 DSPy 的 BootstrapFewShotWithRandomSearchBootstrapFewShotWithOptuna 的简化实现。

Another type of optimization is finetuning with BootstrapFinetune, where the demonstrations are used to update the LM's weights for each predictor. When this is applied, the LM parameter of each module is updated to the new LM weights. Typically, we are optimizing average quality using the metric with cross-validation over the training set or a validation set. This is applicable even with no labels for any stages, depending on the nature of metric.

另一种优化类型是使用 BootstrapFinetune 进行微调,其中演示用于更新每个预测器的语言模型权重。应用此方法时,每个模块的语言模型参数将更新为新的语言模型权重。通常,我们使用指标在训练集或验证集上进行交叉验证来优化平均质量。这甚至可以在没有任何阶段标签的情况下应用,具体取决于指标的性质。

Stage 3: Higher-Order Program Optimization A different type of optimization that the DSPy compiler supports is modifying the control flow of the program. One of the simplest forms of these is ensembles, which we use in the case studies in this work. An ensemble will bootstrap multiple copies of the same program, and then replace the program with a new one that runs them all in parallel and reduces their predictions into one with a custom function (e.g., majority voting). In future work, this stage can easily accommodate techniques for more dynamic (i.e., test-time) bootstrapping as well as automatic backtracking-like logic.

第三阶段:高阶程序优化 DSPy 编译器支持的另一种优化类型是修改程序的控制流。其中最简单的形式之一是集成,我们在本工作的案例研究中使用了这种方法。集成将引导出同一程序的多个副本,然后用一个并行运行所有副本并通过自定义函数将它们的预测合并为一个的新程序替换原程序。在未来工作中,此阶段可以轻松容纳更动态的引导技术以及自动回溯类逻辑。

Goals Of Evaluation

Programming frameworks can be evaluated along many dimensions: computational efficiency, developer efficiency, intuitiveness of the code and concepts, and so forth. In this paper, we focus on perhaps the most pressing issue for current LM pipelines: the role of hand-written, task-specific prompts in achieving performant systems. Our evaluations seek to test the following hypotheses:

编程框架可以从多个维度进行评估:计算效率、开发者效率、代码和概念的直观性等。在本文中,我们聚焦于当前语言模型流水线中最紧迫的问题:手工编写的、任务特定的提示在实现高性能系统中的作用。我们的评估旨在检验以下假设:

  • H1 With DSPy, we can replace hand-crafted prompt strings with concise and well-defined modules, without reducing quality or expressive power.
  • H2 Parameterizing the modules and treating prompting as an optimization problem makes DSPy better at adapting to different LMs, and it may outperform expert-written prompts.
  • H3 The resulting modularity makes it possible to more thoroughly explore complex pipelines that have useful performance characteristics or that fit nuanced metrics.
  • H1 使用 DSPy,我们可以用简洁且定义明确的模块替代手工制作的提示字符串,而不会降低质量或表达能力。
  • H2 将模块参数化并将提示视为一个优化问题,使 DSPy 能更好地适应不同的语言模型,并可能超越专家编写的提示。
  • H3 由此产生的模块化使得更彻底地探索具有有用性能特征或符合细微指标的复杂流水线成为可能。

Our evaluation will explore these hypotheses using diverse task–program pairs. We hope this begins a shift from underspecified questions like “how do different LMs compare on GSM8K” toward “how they compare on GSM8K with program P when compiled with strategy S”, which is a well-defined and reproducible run. Ultimately, our goal is to reduce the role of artful prompt construction in modern AI in favor of the development of new modular, composable programs and optimizers.

我们的评估将使用多样的任务-程序对来探索这些假设。我们希望这能开启从“不同语言模型在 GSM8K 上表现如何”这类定义不清的问题,转向“在 GSM8K 上使用程序 P 并以策略 S 编译时,它们表现如何”这类定义明确且可复现的运行的转变。最终,我们的目标是减少现代 AI 中精巧的提示构建的作用,转而支持新的模块化、可组合的程序和优化器的开发。

Conclusion

This paper introduced DSPy, a new programming model for designing AI systems using pipelines of pretrained LMs and other tools. We presented three new concepts introduced in this abstraction (DSPy signatures, modules, and teleprompters), and showed in two very different case studies that it supports rapid development of highly effective systems that use relatively small LMs. We have maintained open-source versions of this framework for close to a year. In this period, we have seen and created a large number of programs that were compiled to high-quality systems by DSPy, spanning tasks from information extraction to low-resource synthetic data generation. In the interest of space and to maintain reasonable scope in this paper, we leave reporting on such tasks under controlled experimental conditions to future work. While in-context learning has proved transformative over the past 2–3 years of LM research, we argue that the true expressive power in this emerging paradigm is in building sophisticated text transformation graphs in which composable modules and optimizers (teleprompters) come together to leverage LMs in more systematic and reliable ways.

本文介绍了 DSPy,一种用于设计基于预训练语言模型和其他工具流水线的 AI 系统的新编程模型。我们提出了此抽象中引入的三个新概念,并通过两个截然不同的案例研究证明,它支持快速开发使用相对较小语言模型的高效系统。我们维护这个框架的开源版本已近一年。在此期间,我们见证并创建了大量由 DSPy 编译为高质量系统的程序,涵盖了从信息抽取到低资源合成数据生成等任务。出于篇幅限制并保持本文范围合理,我们将此类任务在受控实验条件下的报告留待未来工作。虽然上下文学习在过去 2-3 年的语言模型研究中已被证明具有变革性,但我们认为,这一新兴范式中真正的表达能力在于构建复杂的文本转换图,其中可组合模块和优化器共同作用,以更系统、更可靠的方式利用语言模型。