DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Liu A, Feng B, Wang B, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model[J]. arXiv preprint arXiv:2405.04434, 2024.
https://github.com/deepseek-ai/DeepSeek-V2

960+DeepSeek

DeepSeek-V2：一个强大、经济且高效的混合专家语言模型

Abstract

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

我们提出了 DeepSeek-V2，一个强大的混合专家语言模型，其特点是训练经济且推理高效。该模型总参数为 236B，每个词元激活其中的 21B 参数，并支持 128K 词元的上下文长度。DeepSeek-V2 采用了创新的架构，包括多头潜在注意力和 DeepSeekMoE。多头潜在注意力通过将键值缓存显著压缩为潜在向量来保证高效推理，而 DeepSeekMoE 则通过稀疏计算实现以经济成本训练强大模型。与 DeepSeek 67B 相比，DeepSeek-V2 实现了显著更强的性能，同时节省了 42.5% 的训练成本，键值缓存减少了 93.3%，并将最大生成吞吐量提升了 5.76 倍。我们在一个包含 8.1T 词元的高质量多源语料库上对 DeepSeek-V2 进行预训练，并进一步进行监督微调和强化学习以充分释放其潜力。评估结果表明，即使仅有 21B 激活参数，DeepSeek-V2 及其聊天版本在开源模型中仍能达到顶级性能。

Introduction

In the past few years, Large Language Models (LLMs) have undergone rapid development, offering a glimpse into the dawn of Artificial General Intelligence (AGI). In general, the intelligence of an LLM tends to improve as the number of parameters increases, allowing it to exhibit emergent capabilities across various tasks. However, the improvement comes at the cost of larger computing resources for training and a potential decrease in inference throughput. These constraints present significant challenges that impede the widespread adoption and utilization of LLMs. In order to tackle this problem, we introduce DeepSeek-V2, a strong open-source Mixture-of-Experts (MoE) language model, characterized by economical training and efficient inference through an innovative Transformer architecture. It is equipped with a total of 236B parameters, of which 21B are activated for each token, and supports a context length of 128K tokens.

过去几年，大型语言模型经历了快速发展，让我们得以一窥通用人工智能的曙光。通常，随着参数数量的增加，LLM的智能水平往往会提高，使其能够在各种任务中展现出涌现能力。然而，这种提升是以更大的训练计算资源和潜在推理吞吐量下降为代价的。这些限制带来了重大挑战，阻碍了LLM的广泛采用和使用。为了解决这个问题，我们推出了DeepSeek-V2，一个强大的开源混合专家语言模型，其特点是通过创新的Transformer架构实现经济的训练和高效的推理。该模型总参数为236B，每个词元激活其中的21B参数，并支持128K词元的上下文长度。

We optimize the attention modules and Feed-Forward Networks (FFNs) within the Transformer framework with our proposed Multi-head Latent Attention (MLA) and DeepSeekMoE. (1) In the context of attention mechanisms, the Key-Value (KV) cache of the Multi-Head Attention (MHA) poses a significant obstacle to the inference efficiency of LLMs. Various approaches have been explored to address this issue, including Grouped-Query Attention (GQA) and Multi-Query Attention (MQA). However, these methods often compromise performance in their attempt to reduce the KV cache. In order to achieve the best of both worlds, we introduce MLA, an attention mechanism equipped with low-rank key-value joint compression. Empirically, MLA achieves superior performance compared with MHA, and meanwhile significantly reduces the KV cache during inference, thus boosting the inference efficiency. (2) For Feed-Forward Networks (FFNs), we follow the DeepSeekMoE architecture, which adopts fine-grained expert segmentation and shared expert isolation for higher potential in expert specialization. The DeepSeekMoE architecture demonstrates great advantages compared with conventional MoE architectures like GShard, enabling us to train strong models at an economical cost. As we employ expert parallelism during training, we also devise supplementary mechanisms to control communication overheads and ensure load balance. By combining these two techniques, DeepSeek-V2 features strong performance, economical training costs, and efficient inference throughput, simultaneously.

我们在Transformer框架内通过我们提出的多头潜在注意力和DeepSeekMoE优化了注意力模块和前馈网络。（1）在注意力机制方面，多头注意力的键值缓存对LLM的推理效率构成了重大障碍。人们已经探索了多种方法来解决这个问题，包括分组查询注意力和多查询注意力。然而，这些方法在试图减少KV缓存时往往会影响性能。为了两全其美，我们引入了多头潜在注意力，一种配备低秩键值联合压缩的注意力机制。经验表明，多头潜在注意力比MHA实现了更优越的性能，同时在推理过程中显著减少了KV缓存，从而提高了推理效率。（2）对于前馈网络，我们遵循DeepSeekMoE架构，该架构采用细粒度的专家分割和共享专家隔离，以实现更高的专家专业化潜力。与GShard等传统MoE架构相比，DeepSeekMoE架构展现出巨大优势，使我们能够以经济成本训练强大模型。由于我们在训练期间采用专家并行，我们还设计了补充机制来控制通信开销并确保负载均衡。通过结合这两种技术，DeepSeek-V2同时具有强大的性能、经济的训练成本和高效的推理吞吐量。

We construct a high-quality and multi-source pre-training corpus consisting of 8.1T tokens. Compared with the corpus used in DeepSeek 67B, this corpus features an extended amount of data, especially Chinese data, and higher data quality. We first pretrain DeepSeek-V2 on the full pre-training corpus. Then, we collect 1.5M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 Chat (SFT). Finally, we follow DeepSeekMath to employ Group Relative Policy Optimization (GRPO) to further align the model with human preference and produce DeepSeek-V2 Chat (RL).

我们构建了一个高质量、多源的预训练语料库，包含8.1T词元。与DeepSeek 67B使用的语料库相比，该语料库数据量更大，尤其是中文数据，且数据质量更高。我们首先在整个预训练语料库上预训练DeepSeek-V2。然后，我们收集了150万个对话会话，涵盖数学、代码、写作、推理、安全等多个领域，对DeepSeek-V2 Chat进行监督微调。最后，我们遵循DeepSeekMath的方法，采用群组相对策略优化来进一步使模型与人类偏好对齐，并生成DeepSeek-V2 Chat。

We evaluate DeepSeek-V2 on a wide range of benchmarks in English and Chinese, and compare it with representative open-source models. Evaluation results show that even with only 21B activated parameters, DeepSeek-V2 still achieves top-tier performance among open-source models and becomes the strongest open-source MoE language model. Figure 1(a) highlights that, on MMLU, DeepSeek-V2 achieves top-ranking performance with only a small number of activated parameters. In addition, as shown in Figure 1(b), compared with DeepSeek 67B, DeepSeek-V2 saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We also evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on open-ended benchmarks. Notably, DeepSeek-V2 Chat (RL) achieves 38.9 length-controlled win rate on AlpacaEval 2.0, 8.97 overall score on MT-Bench, and 7.91 overall score on AlignBench. The English open-ended conversation evaluations demonstrate that DeepSeek-V2 Chat (RL) has top-tier performance among open-source chat models. In addition, the evaluation on AlignBench indicates that in Chinese, DeepSeek-V2 Chat (RL) outperforms all of open-source models, and even beats most of closed-source models.

我们在英文和中文的广泛基准上评估了DeepSeek-V2，并将其与具有代表性的开源模型进行了比较。评估结果表明，即使仅有21B激活参数，DeepSeek-V2在开源模型中仍能达到顶级性能，成为最强大的开源MoE语言模型。图1(a)强调，在MMLU上，DeepSeek-V2仅用少量激活参数就达到了顶尖性能。此外，如图1(b)所示，与DeepSeek 67B相比，DeepSeek-V2节省了42.5%的训练成本，KV缓存减少了93.3%，并将最大生成吞吐量提升了5.76倍。我们还在开放式基准上评估了DeepSeek-V2 Chat和DeepSeek-V2 Chat。值得注意的是，DeepSeek-V2 Chat在AlpacaEval 2.0上达到了38.9的长度控制胜率，在MT-Bench上获得了8.97的总分，在AlignBench上获得了7.91的总分。英文开放式对话评估表明，DeepSeek-V2 Chat在开源聊天模型中具有顶级性能。此外，在AlignBench上的评估表明，DeepSeek-V2 Chat在中文上优于所有开源模型，甚至击败了大多数闭源模型。

In order to facilitate further research and development on MLA and DeepSeekMoE, we also release DeepSeek-V2-Lite, a smaller model equipped with MLA and DeepSeekMoE, for the open-source community. It has a total of 15.7B parameters, where 2.4B are activated for each token. Detailed descriptions about DeepSeek-V2-Lite can be found in Appendix B.

为了促进对多头潜在注意力和DeepSeekMoE的进一步研究和开发，我们还为开源社区发布了DeepSeek-V2-Lite，这是一个配备多头潜在注意力和DeepSeekMoE的较小模型。它的总参数为15.7B，每个词元激活其中2.4B参数。关于DeepSeek-V2-Lite的详细描述见附录B。

In the rest of this paper, we first provide a detailed description of the model architecture of DeepSeek-V2 (Section 2). Subsequently, we introduce our pre-training endeavors, including the training data construction, hyper-parameter settings, infrastructures, long context extension, and the evaluation of model performance and efficiency (Section 3). Following this, we demonstrate our efforts in alignment, encompassing Supervised Fine-Tuning (SFT), Reinforcement Learning (RL), the evaluation results, and other discussion (Section 4). Finally, we summarize the conclusion, deliberate on the current limitations of DeepSeek-V2, and outline our future work (Section 5).

在本文的其余部分，我们首先详细介绍DeepSeek-V2的模型架构。随后，介绍我们的预训练工作，包括训练数据构建、超参数设置、基础设施、长上下文扩展以及模型性能和效率的评估。接下来，展示我们在对齐方面的努力，包括监督微调、强化学习、评估结果及其他讨论。最后，总结结论，探讨DeepSeek-V2当前的局限性，并概述未来的工作。

Architecture

By and large, DeepSeek-V2 is still in the Transformer architecture, where each Transformer block consists of an attention module and a Feed-Forward Network (FFN). However, for both the attention module and the FFN, we design and employ innovative architectures. For attention, we design MLA, which utilizes low-rank key-value joint compression to eliminate the bottleneck of inference-time key-value cache, thus supporting efficient inference. For FFNs, we adopt the DeepSeekMoE architecture, a high-performance MoE architecture that enables training strong models at an economical cost. An illustration of the architecture of DeepSeek-V2 is presented in Figure 2, and we will introduce the details of MLA and DeepSeekMoE in this section. For other tiny details (e.g., layer normalization and the activation function in FFNs), unless specifically stated, DeepSeek-V2 follows the settings of DeepSeek 67B.

总体而言，DeepSeek-V2 仍然采用 Transformer 架构，其中每个 Transformer 块由一个注意力模块和一个前馈网络组成。然而，对于注意力模块和前馈网络，我们设计并采用了创新的架构。对于注意力部分，我们设计了多头潜在注意力，它利用低秩键值联合压缩来消除推理时键值缓存的瓶颈，从而支持高效推理。对于前馈网络，我们采用了 DeepSeekMoE 架构，这是一种高性能的 MoE 架构，能够以经济的成本训练强大的模型。图 2 展示了 DeepSeek-V2 的架构示意图，本节将介绍多头潜在注意力和 DeepSeekMoE 的细节。对于其他微小细节（例如，层归一化和前馈网络中的激活函数），除非特别说明，DeepSeek-V2 遵循 DeepSeek 67B 的设置。

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency

Conventional Transformer models usually adopts Multi-Head Attention (MHA), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency. In order to reduce the KV cache, Multi-Query Attention (MQA) and Grouped-Query Attention (GQA) are proposed. They require a smaller magnitude of KV cache, but their performance does not match MHA (we provide the ablation of MHA, GQA and MQA in Appendix D.1).

2.1. 多头潜在注意力：提升推理效率

传统的 Transformer 模型通常采用多头注意力，但在生成过程中，其庞大的键值缓存将成为限制推理效率的瓶颈。为了减少键值缓存，人们提出了多查询注意力和分组查询注意力。它们需要的键值缓存量较小，但其性能无法与多头注意力匹配。

For DeepSeek-V2, we design an innovative attention mechanism called Multi-head Latent Attention (MLA). Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache. We introduce its architecture in the following, and also provide a comparison between MLA and MHA in Appendix D.2.

对于 DeepSeek-V2，我们设计了一种创新的注意力机制，称为多头潜在注意力。配备低秩键值联合压缩的多头潜在注意力实现了比多头注意力更好的性能，同时所需的键值缓存量显著减少。下文将介绍其架构，并在附录 D.2 中提供多头潜在注意力与多头注意力的比较。

2.1.1. Preliminaries: Standard Multi-Head Attention

We first introduce the standard MHA mechanism as background. Let $d$ be the embedding dimension, $n_{h}$ be the number of attention heads, $d_{h}$ be the dimension per head, and $h_{t} \in R^{d}$ be the attention input of the $t$ -th token at an attention layer. Standard MHA first produces $q_{t}, k_{t}, v_{t} \in R^{d_{h} n_{h}}$ through three matrices $W^{Q}, W^{K}, W^{V} \in R^{d_{h} n_{h} \times d}$ , respectively:

2.1.1. 预备知识：标准多头注意力

我们首先介绍作为背景的标准多头注意力机制。设 $d$ 为嵌入维度， $n_{h}$ 为注意力头数， $d_{h}$ 为每个头的维度， $h_{t} \in R^{d}$ 为第 $t$ 个词元在某一注意力层的注意力输入。标准多头注意力首先通过三个矩阵 $W^{Q}, W^{K}, W^{V} \in R^{d_{h} n_{h} \times d}$ 分别生成 $q_{t}, k_{t}, v_{t} \in R^{d_{h} n_{h}}$ ：

\begin{matrix} (1) & q_{t} = W^{Q} h_{t} \end{matrix}

\begin{matrix} (2) & k_{t} = W^{K} h_{t} \end{matrix}

\begin{matrix} (3) & v_{t} = W^{V} h_{t} \end{matrix}

Then, $q_{t}, k_{t}, v_{t}$ will be sliced into $n_{h}$ heads for the multi-head attention computation:

然后， $q_{t}, k_{t}, v_{t}$ 将被分割成 $n_{h}$ 个头进行多头注意力计算：

\begin{matrix} (4) & [q_{t, 1}; q_{t, 2}; \dots; q_{t, n_{h}}] = q_{t} \end{matrix}

\begin{matrix} (5) & [k_{t, 1}; k_{t, 2}; \dots; k_{t, n_{h}}] = k_{t} \end{matrix}

\begin{matrix} (6) & [v_{t, 1}; v_{t, 2}; \dots; v_{t, n_{h}}] = v_{t} \end{matrix}

\begin{matrix} (7) & o_{t, i} = \sum_{j = 1}^{t} {Softmax}_{j} (\frac{q_{t, j}^{T} k_{j, i}}{\sqrt{d_{h}}}) v_{j, i} \end{matrix}

\begin{matrix} (8) & u_{t} = W^{O} [o_{t, 1}; o_{t, 2}; \dots; o_{t, n_{h}}] \end{matrix}

where $q_{t, i}, k_{t, i}, v_{t, i} \in R^{d_{h}}$ denote the query, key, and value of the $i$ -th attention head, respectively; $W^{O} \in R^{d \times d_{h} n_{h}}$ denotes the output projection matrix. During inference, all keys and values need to be cached to accelerate inference, so MHA needs to cache $2 n_{h} d_{h} l$ elements for each token. In model deployment, this heavy KV cache is a large bottleneck that limits the maximum batch size and sequence length.

其中 $q_{t, i}, k_{t, i}, v_{t, i} \in R^{d_{h}}$ 分别表示第 $i$ 个注意力头的查询、键和值； $W^{O} \in R^{d \times d_{h} n_{h}}$ 表示输出投影矩阵。在推理过程中，需要缓存所有的键和值以加速推理，因此多头注意力需要为每个词元缓存 $2 n_{h} d_{h} l$ 个元素。在模型部署中，这种庞大的键值缓存是一个很大的瓶颈，限制了最大批大小和序列长度。

2.1.2. Low-Rank Key-Value Joint Compression

The core of MLA is the low-rank joint compression for keys and values to reduce KV cache:

2.1.2. 低秩键值联合压缩

MLA的核心是对键和值进行低秩联合压缩以减少KV缓存：

\begin{matrix} (9) & c_{t}^{K V} = W^{D K V} h_{t} \end{matrix}

\begin{matrix} (10) & k_{t}^{C} = W^{U K} c_{t}^{K V} \end{matrix}

\begin{matrix} (11) & v_{t}^{C} = W^{U V} c_{t}^{K V} \end{matrix}

where $c_{t}^{K V} \in R^{d_{c}}$ is the compressed latent vector for keys and values; $d_{c}$ ( $≪ d_{h} n_{h}$ ) denotes the KV compression dimension; $W^{D K V} \in R^{d_{c} \times d}$ is the down-projection matrix; and $W^{U K}, W^{U V} \in R^{d_{h} n_{h} \times d_{c}}$ are the up-projection matrices for keys and values, respectively. During inference, MLA only needs to cache $c_{t}^{K V}$ , so its KV cache has only $d_{c} l$ elements, where $l$ denotes the number of layers. In addition, during inference, since $W^{U K}$ can be absorbed into $W^{Q}$ , and $W^{U V}$ can be absorbed into $W^{O}$ , we even do not need to compute keys and values out for attention. Figure 3 intuitively illustrates how the KV joint compression in MLA reduces the KV cache.

其中 $c_{t}^{K V} \in R^{d_{c}}$ 是键和值压缩后的潜在向量； $d_{c}$ ( $≪ d_{h} n_{h}$ ) 表示KV压缩维度； $W^{D K V} \in R^{d_{c} \times d}$ 是下投影矩阵； $W^{U K}, W^{U V} \in R^{d_{h} n_{h} \times d_{c}}$ 分别是键和值的上投影矩阵。在推理过程中，MLA只需缓存 $c_{t}^{K V}$ ，因此其KV缓存仅有 $d_{c} l$ 个元素，其中 $l$ 表示层数。此外，在推理过程中，由于 $W^{U K}$ 可以吸收到 $W^{Q}$ 中， $W^{U V}$ 可以吸收到 $W^{O}$ 中，我们甚至不需要显式计算用于注意力的键和值。图3直观地展示了MLA中的KV联合压缩如何减少KV缓存。

Moreover, in order to reduce the activation memory during training, we also perform low-rank compression for the queries, even if it cannot reduce the KV cache:

此外，为了减少训练期间的激活内存，即使不能减少KV缓存，我们也对查询进行低秩压缩：

\begin{matrix} (12) & c_{t}^{Q} = W^{D Q} h_{t} \end{matrix}

\begin{matrix} (13) & q_{t}^{C} = W^{U Q} c_{t}^{Q} \end{matrix}

where $c_{t}^{Q} \in R^{d_{c}^{'}}$ is the compressed latent vector for queries; $d_{c}^{'} (≪ d_{h} n_{h})$ denotes the query compression dimension; and $W^{D Q} \in R^{d_{c}^{'} \times d}, W^{U Q} \in R^{d_{h} n_{h} \times d_{c}^{'}}$ are the down-projection and up-projection matrices for queries, respectively.

其中 $c_{t}^{Q} \in R^{d_{c}^{'}}$ 是查询压缩后的潜在向量； $d_{c}^{'} (≪ d_{h} n_{h})$ 表示查询压缩维度； $W^{D Q} \in R^{d_{c}^{'} \times d}, W^{U Q} \in R^{d_{h} n_{h} \times d_{c}^{'}}$ 分别是查询的下投影和上投影矩阵。

2.1.3. Decoupled Rotary Position Embedding

Following DeepSeek 67B, we intend to use the Rotary Position Embedding (RoPE) for DeepSeek-V2. However, RoPE is incompatible with low-rank KV compression. To be specific, RoPE is position-sensitive for both keys and queries. If we apply RoPE for the keys $k_{t}^{C}$ , $W^{U K}$ in Equation 10 will be coupled with a position-sensitive RoPE matrix. In this way, $W^{U K}$ cannot be absorbed into $W^{Q}$ any more during inference, since a RoPE matrix related to the currently generating token will lie between $W^{Q}$ and $W^{U K}$ and matrix multiplication does not obey a commutative law. As a result, we must recompute the keys for all the prefix tokens during inference, which will significantly hinder the inference efficiency.

2.1.3. 解耦旋转位置嵌入

遵循DeepSeek 67B的做法，我们打算在DeepSeek-V2中使用旋转位置嵌入。然而，RoPE与低秩KV压缩不兼容。具体来说，RoPE对键和查询都是位置敏感的。如果我们将RoPE应用于键 $k_{t}^{C}$ ，那么公式10中的 $W^{U K}$ 将与位置敏感的RoPE矩阵耦合。这样一来，在推理过程中， $W^{U K}$ 就无法再被吸收到 $W^{Q}$ 中，因为与当前生成词元相关的RoPE矩阵将位于 $W^{Q}$ 和 $W^{U K}$ 之间，而矩阵乘法不满足交换律。结果，我们必须在推理过程中为所有前缀词元重新计算键，这将显著降低推理效率。

As a solution, we propose the decoupled RoPE strategy that uses additional multi-head queries $q_{t, i}^{R} \in R^{d_{n}^{R}}$ and a shared key $k_{t}^{R} \in R^{d_{n}^{R}}$ to carry RoPE, where $d_{n}^{R}$ denotes the per-head dimension of the decoupled queries and key. Equipped with the decoupled RoPE strategy, MLA performs the following computation:

作为解决方案，我们提出了解耦RoPE策略，该策略使用额外的多头查询 $q_{t, i}^{R} \in R^{d_{n}^{R}}$ 和一个共享键 $k_{t}^{R} \in R^{d_{n}^{R}}$ 来承载RoPE，其中 $d_{n}^{R}$ 表示解耦查询和键的每头维度。采用解耦RoPE策略后，MLA执行以下计算：

\begin{matrix} (14) & [q_{t, 1}^{R}; q_{t, 2}^{R}; \dots; q_{t, n_{h}}^{R}] = q_{t}^{R} = RoPE (W^{Q R} c_{t}^{Q}) \end{matrix}

\begin{matrix} (15) & k_{t}^{R} = RoPE (W^{K R} h_{t}) \end{matrix}

\begin{matrix} (16) & q_{t, i} = [q_{t, i}^{C}; q_{t, i}^{R}] \end{matrix}

\begin{matrix} (17) & k_{t, i} = [k_{t, i}^{C}; k_{t}^{R}] \end{matrix}

\begin{matrix} (18) & o_{t, i} = \sum_{j = 1}^{t} {Softmax}_{j} (\frac{q_{t, i}^{T} k_{j, i}}{\sqrt{d_{h} + d_{h}^{R}}}) v_{j, i}^{C} \end{matrix}

\begin{matrix} (19) & u_{t} = W^{O} [o_{t, 1}; o_{t, 2}; \dots; o_{t, n_{h}}] \end{matrix}

where $W^{Q R} \in R^{d_{h}^{R} n_{h} \times d_{c}^{'}}$ and $W^{K R} \in R^{d_{h}^{R} \times d}$ are matrices to produce the decoupled queries and key, respectively; RoPE(·) denotes the operation that applies RoPE matrices; and $[;]$ denotes the concatenation operation. During inference, the decoupled key should also be cached. Therefore, DeepSeek-V2 requires a total KV cache containing $(d_{c} + d_{h}^{R}) l$ elements.

其中 $W^{Q R} \in R^{d_{h}^{R} n_{h} \times d_{c}^{'}}$ 和 $W^{K R} \in R^{d_{h}^{R} \times d}$ 分别是生成解耦查询和键的矩阵；RoPE(·) 表示应用RoPE矩阵的操作； $[;]$ 表示拼接操作。在推理过程中，解耦的键也需要被缓存。因此，DeepSeek-V2总共需要包含 $(d_{c} + d_{h}^{R}) l$ 个元素的KV缓存。

In order to demonstrate the complete computation process of MLA, we also organize and provide its full formulas in Appendix C.

为了展示MLA的完整计算过程，我们还在附录C中整理并提供了其完整公式。

2.1.4. Comparison of Key-Value Cache

We demonstrate a comparison of the KV cache per token among different attention mechanisms in Table 1. MLA requires only a small amount of KV cache, equal to GQA with only 2.25 groups, but can achieve stronger performance than MHA.

2.1.4. 键值缓存对比

我们在表1中展示了不同注意力机制下每个词元的KV缓存对比。MLA只需要少量的KV缓存，相当于仅有2.25个组的GQA，但可以实现比MHA更强的性能。

2.2. DeepSeekMoE: Training Strong Models at Economical Costs

2.2.1. Basic Architecture

For FFNs, we employ the DeepSeekMoE architecture. DeepSeekMoE has two key ideas: segmenting experts into finer granularity for higher expert specialization and more accurate knowledge acquisition, and isolating some shared experts for mitigating knowledge redundancy among routed experts. With the same number of activated and total expert parameters, DeepSeekMoE can outperform conventional MoE architectures like GShard by a large margin.

2.2. DeepSeekMoE：以经济成本训练强大模型

2.2.1. 基础架构

对于前馈网络，我们采用DeepSeekMoE架构。DeepSeekMoE有两个关键思想：将专家细粒度分割以实现更高的专家专业化和更准确的知识获取，以及隔离一些共享专家以减少路由专家之间的知识冗余。在激活和总专家参数数量相同的情况下，DeepSeekMoE可以大幅优于GShard等传统MoE架构。

Let $u_{t}$ be the FFN input of the $t$ -th token, we compute the FFN output $h_{t}^{'}$ as follows:

设 $u_{t}$ 为第 $t$ 个词元的FFN输入，我们如下计算FFN输出 $h_{t}^{'}$ ：

\begin{matrix} (20) & h_{t}^{'} = u_{t} + \sum_{i = 1}^{N_{s}} {FFN}_{i}^{(s)} (u_{t}) + \sum_{i = 1}^{N_{r}} g_{i, t} {FFN}_{i}^{(r)} (u_{t}) \end{matrix}

\begin{matrix} (21) & g_{i, t} = {\begin{cases} s_{i, t}, & s_{i, t} \in Topk ({s_{j, t} | 1 \leq j \leq N_{r}}, K_{r}) \\ 0, & otherwise \end{cases} \end{matrix}

\begin{matrix} (22) & s_{i, t} = {Softmax}_{i} (u_{t}^{T} e_{i}) \end{matrix}

where $N_{s}$ and $N_{r}$ denote the numbers of shared experts and routed experts, respectively; FFN $_{i}^{(s)} (\cdot)$ and FFN $_{i}^{(r)} (\cdot)$ denote the $i$ -th shared expert and the $i$ -th routed expert, respectively; $K_{r}$ denotes the number of activated routed experts; $g_{i, t}$ is the gate value for the $i$ -th expert; $s_{i, t}$ is the token-to-expert affinity; $e_{i}$ is the centroid of the $i$ -th routed expert in this layer; and Topk( $\cdot$ , $K$ ) denotes the set comprising $K$ highest scores among the affinity scores calculated for the $t$ -th token and all routed experts.

其中 $N_{s}$ 和 $N_{r}$ 分别表示共享专家和路由专家的数量；FFN $_{i}^{(s)} (\cdot)$ 和 FFN $_{i}^{(r)} (\cdot)$ 分别表示第 $i$ 个共享专家和第 $i$ 个路由专家； $K_{r}$ 表示激活的路由专家数量； $g_{i, t}$ 是第 $i$ 个专家的门控值； $s_{i, t}$ 是词元到专家的亲和度； $e_{i}$ 是该层第 $i$ 个路由专家的质心；Topk( $\cdot$ , $K$ ) 表示由第 $t$ 个词元与所有路由专家计算的亲和度分数中， $K$ 个最高分数构成的集合。

2.2.2. Device-Limited Routing

We design a device-limited routing mechanism to bound MoE-related communication costs. When expert parallelism is employed, the routed experts will be distributed across multiple devices. For each token, its MoE-related communication frequency is proportional to the number of devices covered by its target experts. Due to the fine-grained expert segmentation in DeepSeekMoE, the number of activated experts can be large, so the MoE-related communication will be more costly if we apply expert parallelism.

2.2.2. 设备受限路由

我们设计了一种设备受限路由机制来限制与MoE相关的通信成本。当采用专家并行时，路由专家将分布在多个设备上。对于每个词元，其与MoE相关的通信频率与其目标专家所覆盖的设备数量成正比。由于DeepSeekMoE中的细粒度专家分割，激活的专家数量可能很大，因此如果应用专家并行，与MoE相关的通信成本将更高。

For DeepSeek-V2, beyond the naive top-K selection of routed experts, we additionally ensure that the target experts of each token will be distributed on at most $M$ devices. To be specific, for each token, we first select $M$ devices that have experts with the highest affinity scores in them. Then, we perform top-K selection among experts on these $M$ devices. In practice, we find that when $M \geq 3$ , the device-limited routing can achieve a good performance roughly aligned with the unrestricted top-K routing.

对于DeepSeek-V2，除了简单的路由专家top-K选择外，我们还额外确保每个词元的目标专家最多分布在 $M$ 个设备上。具体来说，对于每个词元，我们首先选择 $M$ 个设备，这些设备中包含亲和度分数最高的专家。然后，我们在这些 $M$ 个设备上的专家中进行top-K选择。在实践中，我们发现当 $M \geq 3$ 时，设备受限路由可以达到与无限制top-K路由大致相当的良好性能。

2.2.3. Auxiliary Loss for Load Balance

We take the load balance into consideration for automatically learned routing strategies. Firstly, unbalanced load will raise the risk of routing collapse, preventing some experts being fully trained and utilized. Secondly, when expert parallelism is employed, unbalanced load will diminish computation efficiency. During the training of DeepSeek-V2, we design three kinds of auxiliary losses, for controlling expert-level load balance ( $L_{ExpBal}$ ), device-level load balance ( $L_{DevBal}$ ), and communication balance ( $L_{CommBal}$ ), respectively.

2.2.3. 负载均衡的辅助损失

我们在自动学习的路由策略中考虑了负载均衡。首先，不平衡的负载会增加路由崩溃的风险，阻碍某些专家的充分训练和利用。其次，当采用专家并行时，不平衡的负载会降低计算效率。在 DeepSeek-V2 的训练过程中，我们设计了三种辅助损失，分别用于控制专家级负载均衡、设备级负载均衡和通信均衡。

Expert-Level Balance Loss. We use an expert-level balance loss to mitigate the risk of routing collapse:

专家级均衡损失。 我们使用专家级均衡损失来减轻路由崩溃的风险：

\begin{matrix} (23) & L_{ExpBal} = α_{1} \sum_{i = 1}^{N_{r}} f_{i} P_{i} \end{matrix}

\begin{matrix} (24) & f_{i} = \frac{N_{r}}{K_{r} T} \sum_{t = 1}^{T} 1 (Token t selects Expert i) \end{matrix}

\begin{matrix} (25) & P_{i} = \frac{1}{T} \sum_{t = 1}^{T} s_{i, t} \end{matrix}

where $α_{1}$ is a hyper-parameter called expert-level balance factor; $1 (\cdot)$ denotes the indicator function; and $T$ denotes the number of tokens in a sequence.

其中 $α_{1}$ 是一个称为专家级平衡因子的超参数； $1 (\cdot)$ 表示指示函数； $T$ 表示一个序列中的词元数量。

Device-Level Balance Loss. In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices. In the training process of DeepSeek-V2, we partition all routed experts into $D$ groups ${E_{1}, E_{2}, . . ., E_{D}}$ , and deploy each group on a single device. The device-level balance loss is computed as follows:

设备级均衡损失。 除了专家级均衡损失，我们还额外设计了一个设备级均衡损失，以确保不同设备间的计算均衡。在 DeepSeek-V2 的训练过程中，我们将所有路由专家划分为 $D$ 个组 ${E_{1}, E_{2}, . . ., E_{D}}$ ，并将每组部署在一个单独的设备上。设备级均衡损失的计算如下：

\begin{matrix} (26) & L_{DevBal} = α_{2} \sum_{i = 1}^{D} f_{i}^{'} P_{i}^{'} \end{matrix}

\begin{matrix} (27) & f_{i}^{'} = \frac{1}{| E_{i} |} \sum_{j \in E_{i}} f_{j} \end{matrix}

\begin{matrix} (28) & P_{i}^{'} = \sum_{j \in E_{i}} P_{j} \end{matrix}

where $α_{2}$ is a hyper-parameter called device-level balance factor.

其中 $α_{2}$ 是一个称为设备级平衡因子的超参数。

Communication Balance Loss. Finally, we introduce a communication balance loss to ensure that the communication of each device is balanced. Although the device-limited routing mechanism guarantees that the sending communication of each device is bounded, if a certain device receives more tokens than other devices, the practical communication efficiency will also be affected. In order to mitigate this issue, we design a communication balance loss as follows:

通信均衡损失。 最后，我们引入了一个通信均衡损失，以确保每个设备的通信是均衡的。尽管设备受限路由机制保证了每个设备的发送通信量是有上限的，但如果某个设备接收到的词元比其他设备多，实际的通信效率也会受到影响。为了缓解这个问题，我们设计了如下通信均衡损失：

\begin{matrix} (29) & L_{CommBal} = α_{3} \sum_{i = 1}^{D} f_{i}^{″} P_{i}^{″} \end{matrix}

\begin{matrix} (30) & f_{i}^{″} = \frac{D}{M T} \sum_{t = 1}^{T} 1 (Token t is sent to Device i) \end{matrix}

\begin{matrix} (31) & P_{i}^{″} = \sum_{j \in E_{i}} P_{j} \end{matrix}

where $α_{3}$ is a hyper-parameter called communication balance factor. The device-limited routing mechanism operates on the principle of ensuring that each device transmits at most $M T$ hidden states to other devices. Simultaneously, the communication balance loss is employed to encourage each device to receive around $M T$ hidden states from other devices. The communication balance loss guarantees a balanced exchange of information among devices, promoting efficient communications.

其中 $α_{3}$ 是一个称为通信平衡因子的超参数。设备受限路由机制的操作原则是确保每个设备最多向其他设备传输 $M T$ 个隐藏状态。同时，采用通信均衡损失来鼓励每个设备从其他设备接收大约 $M T$ 个隐藏状态。通信均衡损失保证了设备间信息交换的均衡，促进了高效通信。

2.2.4. Token-Dropping Strategy

While balance losses aim to encourage a balanced load, it is important to acknowledge that they cannot guarantee a strict load balance. In order to further mitigate the computation wastage caused by unbalanced load, we introduce a device-level token-dropping strategy during training. This approach first computes the average computational budget for each device, which means that the capacity factor for each device is equivalent to 1.0. Then, inspired by Riquelme et al., we drop tokens with the lowest affinity scores on each device until reaching the computational budget. In addition, we ensure that the tokens belonging to approximately 10% of the training sequences will never be dropped. In this way, we can flexibly decide whether to drop tokens during inference according to the efficiency requirements, and always ensure consistency between training and inference.

2.2.4. 词元丢弃策略

虽然均衡损失旨在鼓励负载均衡，但必须认识到它们不能保证严格的负载均衡。为了进一步减轻因负载不均导致的计算浪费，我们在训练期间引入了一种设备级的词元丢弃策略。该方法首先计算每个设备的平均计算预算，这意味着每个设备的容量因子相当于 1.0。然后，受 Riquelme 等人的启发，我们在每个设备上丢弃亲和度分数最低的词元，直到达到计算预算。此外，我们确保属于大约 10% 训练序列的词元永远不会被丢弃。通过这种方式，我们可以根据效率需求灵活决定在推理期间是否丢弃词元，并始终确保训练和推理之间的一致性。

Alignment

4.1. Supervised Fine-Tuning

Building upon our prior research, we curate our instruction tuning datasets to include 1.5M instances, comprising 1.2M instances for helpfulness and 0.3M instances for safety. In comparison to the initial version, we improve the data quality to mitigate hallucinatory responses and enhance writing proficiency. We fine-tune DeepSeek-V2 with 2 epochs, and the learning rate is set to $5 \times 10^{- 6}$ . For the evaluation of DeepSeek-V2 Chat (SFT), we mainly include generation-based benchmarks, except for several representative multiple-choice tasks (MMLU and ARC). We also conduct an instruction-following evaluation (IFEval) for DeepSeek-V2 Chat (SFT), using prompt-level loose accuracy as the metric. Moreover, we employ LiveCodeBench questions from September 1st, 2023 to April 1st, 2024 to evaluate chat models. In addition to the standard benchmarks, we further evaluate our model on open-ended conversation benchmarks including MT-Bench, AlpacaEval 2.0, and AlignBench. For comparison, we also evaluate Qwen1.5 72B Chat, LLaMA-3-70B Instruct, and Mistral-8x22B Instruct in our evaluation framework and settings. As for DeepSeek 67B Chat, we directly refer to the evaluation results reported in our previous release.

4.1. 监督微调

在我们先前研究的基础上，我们精心整理了包含150万个实例的指令微调数据集，其中包括120万个有用性实例和30万个安全性实例。与初始版本相比，我们提高了数据质量以减少幻觉响应并提升写作能力。我们对DeepSeek-V2进行了2个epoch的微调，学习率设置为 $5 \times 10^{- 6}$ 。对于DeepSeek-V2 Chat的评估，除了几个具有代表性的多项选择任务外，我们主要包括基于生成的基准。我们还对DeepSeek-V2 Chat进行了指令跟随评估，使用提示级宽松准确率作为指标。此外，我们采用2023年9月1日至2024年4月1日的LiveCodeBench问题来评估聊天模型。除了标准基准，我们还在开放式对话基准上进一步评估了我们的模型，包括MT-Bench、AlpacaEval 2.0和AlignBench。为了比较，我们还在我们的评估框架和设置中评估了Qwen1.5 72B Chat、LLaMA-3-70B Instruct和Mistral-8x22B Instruct。至于DeepSeek 67B Chat，我们直接参考了先前发布中报告的评估结果。

4.2. Reinforcement Learning

In order to further unlock the potential of DeepSeek-V2 and align it with human preference, we conduct Reinforcement Learning (RL) to adjust its preference.

4.2. 强化学习

为了进一步释放DeepSeek-V2的潜力并使其与人类偏好对齐，我们进行了强化学习以调整其偏好。

Reinforcement Learning Algorithm. In order to save the training costs of RL, we adopt Group Relative Policy Optimization (GRPO), which foregoes the critic model that is typically with the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question $q$ , GRPO samples a group of outputs ${o_{1}, o_{2}, \dots, o_{G}}$ from the old policy $π_{θ_{o l d}}$ and then optimizes the policy model $π_{θ}$ by maximizing the following objective:

强化学习算法。 为了节省强化学习的训练成本，我们采用了群组相对策略优化，它摒弃了通常与策略模型同等规模的评论家模型，而是从群组分数中估计基线。具体来说，对于每个问题 $q$ ，GRPO从旧策略 $π_{θ_{o l d}}$ 中采样一组输出 ${o_{1}, o_{2}, \dots, o_{G}}$ ，然后通过最大化以下目标来优化策略模型 $π_{θ}$ ：

\begin{aligned} J_{G R P O} (θ) = E [q \sim P (Q), {o_{i}}_{i = 1}^{G} \sim π_{θ_{o l d}} (O | q)] \\ (32) & \frac{1}{G} \sum_{i = 1}^{G} (min (\frac{π_{θ} (o_{i} | q)}{π_{θ_{o l d}} (o_{i} | q)} A_{i}, clip (\frac{π_{θ} (o_{i} | q)}{π_{θ_{o l d}} (o_{i} | q)}, 1 - ε, 1 + ε) A_{i}) - β D_{K L} (π_{θ} ∥ π_{r e f})) \end{aligned}

\begin{matrix} (33) & D_{K L} (π_{θ} | | π_{r e f}) = \frac{π_{r e f} (o_{i} | q)}{π_{θ} (o_{i} | q)} - \log \frac{π_{r e f} (o_{i} | q)}{π_{θ} (o_{i} | q)} - 1 \end{matrix}

where $ϵ$ and $β$ are hyper-parameters; and $A_{i}$ is the advantage, computed using a group of rewards ${r_{1}, r_{2}, \dots, r_{G}}$ corresponding to the outputs within each group:

其中 $ϵ$ 和 $β$ 是超参数； $A_{i}$ 是优势，使用与每组内输出相对应的一组奖励 ${r_{1}, r_{2}, \dots, r_{G}}$ 计算：

\begin{matrix} (34) & A_{i} = \frac{r_{i} - mean ({r_{1}, r_{2}, \dots, r_{G}})}{std ({r_{1}, r_{2}, \dots, r_{G}})} \end{matrix}

Training Strategy. In our preliminary experiments, we find that the RL training on reasoning data, such as code and math prompts, exhibits unique characteristics that are distinct from the training on general data. For example, the mathematical and coding abilities of our model can keep improving over a longer period of training steps. Therefore, we employ a two-stage RL training strategy, which first performs reasoning alignment, and then performs human preference alignment. In the first reasoning alignment stage, we train a reward model $R M_{reasoning}$ for code and math reasoning tasks, and optimize the policy model with the feedback of $R M_{reasoning}$ :

训练策略。 在我们的初步实验中，我们发现对推理数据（如代码和数学提示）进行强化学习训练，展现出与通用数据训练不同的独特特征。例如，我们模型的数学和编码能力可以在更长的训练步骤中持续提升。因此，我们采用了两阶段的强化学习训练策略，首先进行推理对齐，然后进行人类偏好对齐。在第一个推理对齐阶段，我们为代码和数学推理任务训练了一个奖励模型 $R M_{reasoning}$ ，并根据 $R M_{reasoning}$ 的反馈优化策略模型：

\begin{matrix} (35) & r_{i} = R M_{reasoning} (o_{i}) \end{matrix}

In the second human preference alignment stage, we adopt a multi-reward framework, which acquires rewards from a helpful reward model $R M_{helpful}$ , a safety reward model $R M_{safety}$ , and a rule-based reward model $R M_{rule}$ . The final reward of a response $o_{i}$ is

在第二个人类偏好对齐阶段，我们采用了一个多奖励框架，从有用性奖励模型 $R M_{helpful}$ 、安全性奖励模型 $R M_{safety}$ 和基于规则的奖励模型 $R M_{rule}$ 获取奖励。响应 $o_{i}$ 的最终奖励为

\begin{matrix} (36) & r_{i} = c_{1} \cdot R M_{helpful} (o_{i}) + c_{2} \cdot R M_{safety} (o_{i}) + c_{3} \cdot R M_{rule} (o_{i}) \end{matrix}

where $c_{1}, c_{2},$ and $c_{3}$ are corresponding coefficients.

其中 $c_{1}, c_{2},$ 和 $c_{3}$ 是对应的系数。

In order to obtain reliable reward models that play crucial roles in the RL training, we carefully collect preference data, and meticulously conduct quality filtering and proportion adjustments. We obtain code preference data based on compiler-feedback, and mathematical preference data based on the ground-truth labels. For reward model training, we initialize the reward models with DeepSeek-V2 Chat (SFT) and train them with either a point-wise or a pair-wise loss. In our experiments, we observe that the RL training can fully tap into and activate the potential of our model, enabling it to select the correct and satisfactory answer from possible responses.

为了获得在强化学习训练中起关键作用的可靠奖励模型，我们仔细收集了偏好数据，并细致地进行质量过滤和比例调整。我们基于编译器反馈获得代码偏好数据，基于真实标签获得数学偏好数据。对于奖励模型训练，我们使用DeepSeek-V2 Chat初始化奖励模型，并使用逐点或成对损失对其进行训练。在我们的实验中，我们观察到强化学习训练可以充分挖掘和激活我们模型的潜力，使其能够从可能的响应中选择正确且令人满意的答案。

Optimizations for Training Efficiency. Conducting RL training on extremely large models places high demands on the training framework. It requires careful engineering optimization to manage the GPU memory and RAM pressure, and meanwhile maintain a fast training speed. For this goal, we implement the following engineering optimizations. (1) Firstly, we propose a hybrid engine that adopts different parallel strategies for training and inference respectively to achieve higher GPU utilization. (2) Secondly, we leverage vLLM with large batch sizes as our inference backend to accelerate the inference speed. (3) Thirdly, we carefully design a scheduling strategy for offloading models to CPUs and loading models back to GPUs, which achieves a near-optimal balance between the training speed and memory consumption.

训练效率优化 在超大规模模型上进行强化学习训练对训练框架提出了很高的要求。它需要精心的工程优化来管理GPU内存和RAM压力，同时保持快速的训练速度。为此，我们实施了以下工程优化。(1) 首先，我们提出了一种混合引擎，分别为训练和推理采用不同的并行策略，以实现更高的GPU利用率。(2) 其次，我们利用vLLM和大批量大小作为推理后端，以加速推理速度。(3) 第三，我们精心设计了一种将模型卸载到CPU和将模型加载回GPU的调度策略，实现了训练速度和内存消耗之间接近最优的平衡。

4.3. Evaluation Results

Evaluations on Standard Benchmarks. Initially, we evaluate DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) on standard benchmarks. Notably, DeepSeek-V2 Chat (SFT) demonstrates substantial improvements in GSM8K, MATH, and HumanEval evaluations compared with its base version. This progress can be attributed to the inclusion of our SFT data, which comprises a considerable volume of math and code related content. In addition, DeepSeek-V2 Chat (RL) further boosts the performance on math and code benchmarks. We show more code and math evaluations in Appendix F.

4.3. 评估结果

标准基准评估。 首先，我们在标准基准上评估了DeepSeek-V2 Chat和DeepSeek-V2 Chat。值得注意的是，与其基础版本相比，DeepSeek-V2 Chat在GSM8K、MATH和HumanEval评估中表现出显著的改进。这一进步归功于我们SFT数据的纳入，其中包含了相当数量的数学和代码相关内容。此外，DeepSeek-V2 Chat进一步提升了数学和代码基准的性能。我们在附录F中展示了更多的代码和数学评估。

As for the comparisons with other models, we first compare DeepSeek-V2 Chat (SFT) with Qwen1.5 72B Chat, and find that DeepSeek-V2 Chat (SFT) surpasses Qwen1.5 72B Chat on almost all of English, math, and code benchmarks. On Chinese benchmarks, DeepSeek-V2 Chat (SFT) demonstrates slightly lower scores than Qwen1.5 72B Chat on multi-subject multiple-choice tasks, consistent with the performance observed from their base versions. When compared with the state-of-the-art open-source MoE model, Mixtral 8x22B Instruct, DeepSeek-V2 Chat (SFT) exhibits better performance on most benchmarks, except for NaturalQuestions and IFEval. Furthermore, in comparison to the state-of-the-art open-source model LLaMA3 70B Chat, DeepSeek-V2 Chat (SFT) shows similar performance in code and math related benchmarks. LLaMA3 70B Chat exhibits better performance on MMLU and IFEval, while DeepSeek-V2 Chat (SFT) showcases stronger performance on Chinese tasks. Ultimately, DeepSeek-V2 Chat (RL) demonstrates further enhanced performance in both mathematical and coding tasks compared with DeepSeek-V2 Chat (SFT). These comparisons highlight the strengths of DeepSeek-V2 Chat in relation to other language models in various domains and languages.

关于与其他模型的比较，我们首先将DeepSeek-V2 Chat与Qwen1.5 72B Chat进行比较，发现DeepSeek-V2 Chat在几乎所有英语、数学和代码基准上都超过了Qwen1.5 72B Chat。在中文基准上，DeepSeek-V2 Chat在多学科多项选择任务上的得分略低于Qwen1.5 72B Chat，这与它们基础版本观察到的性能一致。与最先进的开源MoE模型Mixtral 8x22B Instruct相比，DeepSeek-V2 Chat在大多数基准上表现出更好的性能，除了NaturalQuestions和IFEval。此外，与最先进的开源模型LLaMA3 70B Chat相比，DeepSeek-V2 Chat在代码和数学相关基准上表现出相似的性能。LLaMA3 70B Chat在MMLU和IFEval上表现更好，而DeepSeek-V2 Chat在中文任务上展现出更强的性能。最终，与DeepSeek-V2 Chat相比，DeepSeek-V2 Chat在数学和编码任务上都展现了进一步增强的性能。这些比较突出了DeepSeek-V2 Chat在不同领域和语言中相对于其他语言模型的优势。

Evaluations on Open-Ended Generation. We proceed with additional evaluations of our models on open-ended conversation benchmarks. For English open-ended conversation generation, we utilize MT-Bench and AlpacaEval 2.0 as the benchmarks. Evaluation results presented in Table 4 demonstrate a significant performance advantage of DeepSeek-V2 Chat (RL) over DeepSeek-V2 Chat (SFT). This outcome showcases the effectiveness of our RL training in achieving improved alignment. In comparison to other open-source models, DeepSeek-V2 Chat (RL) demonstrates superior performance over Mistral 8x22B Instruct and Qwen1.5 72B Chat on both benchmarks. When compared with LLaMA3 70B Instruct, DeepSeek-V2 Chat (RL) showcases competitive performance on MT-Bench and notably outperforms it on AlpacaEval 2.0. These results highlight the strong performance of DeepSeek-V2 Chat (RL) in generating high-quality and contextually relevant responses, particularly in instruction-based conversation tasks.

开放式生成评估。 我们继续在开放式对话基准上对我们的模型进行额外评估。对于英语开放式对话生成，我们使用MT-Bench和AlpacaEval 2.0作为基准。表4中呈现的评估结果表明，DeepSeek-V2 Chat相较于DeepSeek-V2 Chat具有显著的性能优势。这一结果展示了我们的RL训练在实现更好对齐方面的有效性。与其他开源模型相比，DeepSeek-V2 Chat在两个基准上都表现出优于Mistral 8x22B Instruct和Qwen1.5 72B Chat的性能。当与LLaMA3 70B Instruct相比时，DeepSeek-V2 Chat在MT-Bench上展现出有竞争力的性能，并在AlpacaEval 2.0上显著优于它。这些结果凸显了DeepSeek-V2 Chat在生成高质量和上下文相关响应方面的强大性能，尤其是在基于指令的对话任务中。

In addition, we evaluate the Chinese open-ended generation capability based on AlignBench. As presented in Table 5, DeepSeek-V2 Chat (RL) exhibits a slight advantage over DeepSeek-V2 Chat (SFT). Notably, DeepSeek-V2 Chat (SFT) surpasses all open-source Chinese models by a significant margin. It significantly outperforms the second-best open-source model, Qwen1.5 72B Chat on both Chinese reasoning and language. Moreover, both DeepSeek-V2 Chat (SFT) and DeepSeek-V2 Chat (RL) outperform GPT-4-0613 and ERNIEBot 4.0, solidifying the position of our models in the top-tier LLMs that support Chinese. Specifically, DeepSeek-V2 Chat (RL) shows remarkable performance in Chinese language understanding, which outperforms all models including GPT-4-Turbo-1106-Preview. On the other hand, the reasoning capability of DeepSeek-V2 Chat (RL) still lags behind giant models, such as Erniebot-4.0 and GPT-4s.

此外，我们基于AlignBench评估了中文开放式生成能力。如表5所示，DeepSeek-V2 Chat相对于DeepSeek-V2 Chat展现出轻微的优势。值得注意的是，DeepSeek-V2 Chat以显著优势超越了所有开源中文模型。它在中文推理和语言方面都显著优于第二好的开源模型Qwen1.5 72B Chat。此外，DeepSeek-V2 Chat和DeepSeek-V2 Chat都优于GPT-4-0613和文心一言4.0，巩固了我们模型在支持中文的顶级LLM中的地位。具体来说，DeepSeek-V2 Chat在中文语言理解方面表现出色，优于包括GPT-4-Turbo-1106-Preview在内的所有模型。另一方面，DeepSeek-V2 Chat的推理能力仍然落后于巨型模型，如文心一言4.0和GPT-4系列。

4.4. Discussion

Amount of SFT Data. The discussion surrounding the necessity of a large SFT corpus has been a topic of intense debate. Previous works argue that fewer than 10K instances of SFT data are enough to produce satisfactory results. However, in our experiments, we observe a significant performance decline on the IFEval benchmark if we use fewer than 10K instances. A possible explanation is that, a language model necessitates a certain amount of data to develop specific skills. Although the requisite data amount may diminish with the model size increasing, it cannot be entirely eliminated. Our observation underscores the critical need for sufficient data to equip an LLM with desired capabilities. Moreover, the quality of SFT data is also crucial, especially for tasks involving writing or open-ended questions.

4.4. 讨论

SFT数据量。 围绕大规模SFT语料库必要性的讨论一直是一个激烈争论的话题。先前的研究认为，少于1万个SFT数据实例就足以产生令人满意的结果。然而，在我们的实验中，如果使用少于1万个实例，我们观察到IFEval基准上的性能显著下降。一个可能的解释是，语言模型需要一定数量的数据来发展特定技能。尽管所需数据量可能随着模型规模的增大而减少，但它无法被完全消除。我们的观察强调了充足数据对于赋予LLM期望能力的关键需求。此外，SFT数据的质量也至关重要，尤其是对于涉及写作或开放式问题的任务。

Alignment Tax of Reinforcement Learning. During human preference alignment, we observe a significant performance enhancement on the open-ended generation benchmarks, in terms of the scores rated by both AI and human evaluators. However, we also notice a phenomenon of "alignment tax", i.e., the alignment process can negatively impact the performance on some standard benchmarks such as BBH. In order to alleviate the alignment tax, during the RL stage, we make significant efforts in data processing and improving training strategies, finally achieving a tolerable trade-off between the performance on standard and open-ended benchmarks. Exploring how to align a model with human preferences without compromising its general performance presents a valuable direction for future research.

强化学习的对齐税。 在人类偏好对齐过程中，我们观察到，根据AI和人类评估者的评分，开放式生成基准的性能显著提升。然而，我们也注意到一种"对齐税"现象，即对齐过程可能会对某些标准基准（如BBH）的性能产生负面影响。为了减轻对齐税，在RL阶段，我们在数据处理和改进训练策略方面付出了巨大努力，最终在标准和开放式基准的性能之间实现了可容忍的权衡。探索如何在不对其一般性能造成损害的情况下使模型与人类偏好对齐，是一个有价值的未来研究方向。

Online Reinforcement Learning. In our preference alignment experiments, we find that the online approach significantly outperforms the offline approach. Therefore, we invest tremendous efforts in implementing an online RL framework for aligning DeepSeek-V2. The conclusion about online or offline preference alignment can vary in different contexts, and we reserve a more thorough comparison and analysis between them for future work.

在线强化学习。 在我们的偏好对齐实验中，我们发现在线方法显著优于离线方法。因此，我们投入了大量精力来实现用于对齐DeepSeek-V2的在线RL框架。关于在线或离线偏好对齐的结论可能因不同情境而异，我们将其更全面的比较和分析留待未来工作。

Conclusion, Limitation, and Future Work

In this paper, we introduce DeepSeek-V2, a large MoE language model that supports 128K context length. In addition to strong performance, it is also characterized by economical training and efficient inference, benefiting from its innovative architecture including MLA and DeepSeekMoE. In practice, compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. Evaluation results further demonstrate that with only 21B activated parameters, DeepSeek-V2 achieves top-tier performance among open-source models and becomes the strongest open-source MoE model.

在本文中，我们介绍了DeepSeek-V2，一个支持128K上下文长度的大型MoE语言模型。除了强大的性能外，它还以经济训练和高效推理为特点，这得益于其包含多头潜在注意力和DeepSeekMoE的创新架构。实际上，与DeepSeek 67B相比，DeepSeek-V2实现了显著更强的性能，同时节省了42.5%的训练成本，减少了93.3%的KV缓存，并将最大生成吞吐量提升了5.76倍。评估结果进一步表明，仅用21B激活参数，DeepSeek-V2就在开源模型中实现了顶级性能，并成为最强大的开源MoE模型。

DeepSeek-V2 and its chat versions share the acknowledged limitations commonly found in other LLMs, including the lack of ongoing knowledge updates after pre-training, the possibility of generating non-factual information such as unverified advice, and a chance to produce hallucinations. In addition, since our data primarily consist of Chinese and English content, our model may exhibit limited proficiency in other languages. In scenarios beyond Chinese and English, it should be used with caution.

DeepSeek-V2及其聊天版本与其他LLM一样，存在公认的局限性，包括预训练后缺乏持续的知识更新、可能生成非事实性信息（如未经证实的建议）以及可能产生幻觉。此外，由于我们的数据主要由中文和英文内容组成，我们的模型在其他语言中的熟练度可能有限。在中英文以外的场景中，应谨慎使用。

DeepSeek will continuously invest in open-source large models with longtermism, aiming to progressively approach the goal of artificial general intelligence.

DeepSeek将秉持长期主义，持续投入开源大模型，旨在逐步接近通用人工智能的目标。

In our ongoing exploration, we are dedicated to devising methods that enable further scaling up MoE models while maintaining economical training and inference costs. The goal of our next step is to achieve performance on par with GPT-4 in our upcoming release.
Our alignment team continuously strives to enhance our models, aiming to develop a model that is not only helpful but also honest and safe for worldwide users. Our ultimate objective is to align the values of our model with human values, while minimizing the need for human supervision. By prioritizing ethical considerations and responsible development, we are dedicated to creating a positive and beneficial impact on society.
Currently, DeepSeek-V2 is designed to support the text modality exclusively. In our forward-looking agenda, we intend to enable our model to support multiple modalities, enhancing its versatility and utility in a wider range of scenarios.

在我们持续的探索中，我们致力于设计能够在保持经济训练和推理成本的同时，进一步扩展MoE模型的方法。我们下一步的目标是在即将发布的版本中实现与GPT-4相当的性能。
我们的对齐团队不断努力增强我们的模型，旨在为全球用户开发一个不仅有益，而且诚实、安全的模型。我们的最终目标是在最大限度地减少人类监督需求的同时，使模型价值观与人类价值观对齐。通过优先考虑道德考量和负责任开发，我们致力于为社会创造积极和有益的影响。
目前，DeepSeek-V2设计为仅支持文本模态。在我们的前瞻性议程中，我们打算使我们的模型支持多种模态，增强其在更广泛场景中的多功能性和实用性。

🤖 Rasa

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model ​

DeepSeek-V2：一个强大、经济且高效的混合专家语言模型 ​

Abstract ​

Introduction ​

Architecture ​

2.1. Multi-Head Latent Attention: Boosting Inference Efficiency ​

2.1. 多头潜在注意力：提升推理效率 ​

2.1.1. Preliminaries: Standard Multi-Head Attention ​

2.1.1. 预备知识：标准多头注意力 ​

2.1.2. Low-Rank Key-Value Joint Compression ​

2.1.2. 低秩键值联合压缩 ​

2.1.3. Decoupled Rotary Position Embedding ​

2.1.3. 解耦旋转位置嵌入 ​

2.1.4. Comparison of Key-Value Cache ​

2.1.4. 键值缓存对比 ​

2.2. DeepSeekMoE: Training Strong Models at Economical Costs ​

2.2.1. Basic Architecture ​

2.2. DeepSeekMoE：以经济成本训练强大模型 ​

2.2.1. 基础架构 ​

2.2.2. Device-Limited Routing ​

2.2.2. 设备受限路由 ​

2.2.3. Auxiliary Loss for Load Balance ​

2.2.3. 负载均衡的辅助损失 ​

2.2.4. Token-Dropping Strategy ​

2.2.4. 词元丢弃策略 ​

Alignment ​

4.1. Supervised Fine-Tuning ​

4.1. 监督微调 ​

4.2. Reinforcement Learning ​

4.2. 强化学习 ​

4.3. Evaluation Results ​

4.3. 评估结果 ​

4.4. Discussion ​

4.4. 讨论 ​

Conclusion, Limitation, and Future Work ​