Bag of Tricks and A Strong Baseline for Deep Person Re-identification

Luo H, Gu Y, Liao X, Lai S, Jiang W. Bag of Tricks and A Strong Baseline for Deep Person Re-identification. CVPR Workshops 2019.
https://github.com/michuanhaohao/reid-strong-baseline

Person ReIDBaselineTraining Tricks1980+CVPR Workshop 2019ZJUCASXJTU

用于深度行人重识别的一组技巧与强基线

Abstract

This paper explores a simple and efficient baseline for person re-identification (ReID). Person re-identification (ReID) with deep neural networks has made progress and achieved high performance in recent years. However, many state-of-the-arts methods design complex network structure and concatenate multi-branch features. In the literature, some effective training tricks are briefly appeared in several papers or source codes. This paper will collect and evaluate these effective training tricks in person ReID. By combining these tricks together, the model achieves 94.5% rank-1 and 85.9% mAP on Market1501 with only using global features.

本文探索了一个用于行人重识别（ReID）的简单而高效的基线。近年来，使用深度神经网络的行人重识别（ReID）已经取得进展并达到很高性能。然而，许多最先进方法会设计复杂网络结构并拼接多分支特征。在文献中，一些有效训练技巧只是简要出现在若干论文或源码中。本文将收集并评估这些用于行人 ReID 的有效训练技巧。通过把这些技巧组合起来，该模型仅使用全局特征就在 Market1501 上达到 94.5% 的 rank-1 和 85.9% 的 mAP。

1. Introduction

Person re-identification (ReID) with deep neural networks has made progress and achieved high performance in recent years. However, many state-of-the-arts methods design complex network structure and concatenate multi-branch features. In the literature, some effective training tricks or refinements are briefly appeared in several papers or source codes. This paper will collect and evaluate such effective training tricks in person ReID. With involved in all training tricks, ResNet50 reaches 94.5% rank-1 accuracy and 85.9% mAP on Market1501. It is worth mentioning that it achieves such surprising performance with global features of the model.

近年来，使用深度神经网络的行人重识别（ReID）已经取得进展并达到很高性能。然而，许多最先进方法会设计复杂网络结构并拼接多分支特征。在文献中，一些有效训练技巧或改进只是简要出现在若干论文或源码中。本文将收集并评估这类用于行人 ReID 的有效训练技巧。在加入所有训练技巧后，ResNet50 在 Market1501 上达到 94.5% 的 rank-1 准确率和 85.9% 的 mAP。值得一提的是，它是用模型的全局特征取得这样惊人的性能。

Market1501 baseline comparison — Market1501

DukeMTMC-reID baseline comparison — Market1501

For comparison, we surveyed articles published at ECCV2018 and CVPR2018 of the past year. As shown in Figure Figure 1, most of previous works were expanded on poor baselines. On Market1501, only two baselines in 23 baselines surpassed 90% rank-1 accuracy. The rank-1 accuracies of four baselines even lower than 80%. On DukeMTMC-reID, all baselines did not surpass 80% rank-1 accuracy or 65% mAP. We think a strong baseline is very important to promote the development of research. Therefore, we modified the standard baseline with some training tricks to acquire a strong baseline. The code of our strong baseline has been open sourced.

为了进行比较，作者调研了过去一年发表在 ECCV2018 和 CVPR2018 上的文章。如图1 所示，大多数先前工作都是在较弱基线之上扩展的。在 Market1501 上，23 个基线中只有 2 个超过 90% 的 rank-1 准确率。有 4 个基线的 rank-1 准确率甚至低于 80%。在 DukeMTMC-reID 上，所有基线都没有超过 80% 的 rank-1 准确率或 65% 的 mAP。作者认为，一个强基线对推动研究发展非常重要。因此，作者用一些训练技巧修改标准基线，以获得一个强基线。作者的强基线代码已经开源。

In addition, we also found that some works were unfairly compared with other state-of-the-arts methods. Specifically, the improvements were mainly from training tricks rather than methods themselves. But the training tricks were understated in the paper so that readers ignored them. It would make the effectiveness of the method exaggerated. We suggest that reviewers need to take into account these tricks when commenting academic papers.

此外，作者还发现，一些工作与其他最先进方法的比较并不公平。具体来说，提升主要来自训练技巧，而不是方法本身。但这些训练技巧在论文中被弱化，以至于读者忽略了它们。这会夸大方法的有效性。作者建议审稿人在评论学术论文时需要考虑这些技巧。

Apart from aforementioned reasons, another consideration is that the industry prefers to simple and effective models rather than concatenating lots of local features in the inference stage. In pursuit of high accuracy, researchers in the academic always combine several local features or utilize the semantic information from pose estimation or segmentation models. Such methods bring too much extra consumption. Large features also greatly reduce the speed of retrieval process. Thus, we hope to use some tricks to improve the ability of the ReID model and only use global features to achieve high performance. The purposes of this paper are summarized as follow:

We surveyed many works published on top conferences and found most of them were expanded on poor baselines.

For the academia, we hope to provide a strong baseline for researchers to achieve higher accuracies in person ReID.

For the community, we hope to give reviewers some references that what tricks will affect the performance of the ReID model. We suggest that when comparing the performance of the different methods, reviewers need to take these tricks into account.

For the industry, we hope to provide some effective tricks to acquire better models without too much extra consumption.

除了上述原因，另一个考虑是工业界更偏好简单而有效的模型，而不是在推理阶段拼接大量局部特征。为了追求高准确率，学术界研究者总是组合若干局部特征，或利用姿态估计、分割模型中的语义信息。这类方法带来过多额外消耗。大特征也会显著降低检索过程的速度。因此，作者希望使用一些技巧来提升 ReID 模型能力，并且只使用全局特征就达到高性能。本文目的总结如下：

作者调研了许多发表在顶级会议上的工作，发现其中大多数都是在较弱基线之上扩展的。

对学术界而言，作者希望为研究者提供一个强基线，以便在行人 ReID 中取得更高准确率。

对社区而言，作者希望给审稿人一些参考，说明哪些技巧会影响 ReID 模型性能。作者建议，在比较不同方法性能时，审稿人需要考虑这些技巧。

对工业界而言，作者希望提供一些有效技巧，在没有过多额外消耗的情况下获得更好的模型。

Fortunately, a lot of effective training tricks have been present in some papers or open-sourced projects. We collect many tricks and evaluate each of them on ReID datasets. After a lot of experiments, we choose six tricks to introduce in this paper. Some of them were designed or modified by us. We add these tricks into a widely used baseline to get our modified baseline, which achieves 94.5% rank-1 and 85.9% mAP on Market1501. Moreover, we found different works choose different image sizes and numbers of batch size, as a supplement, we also explore their impacts on model performance. In summary, the contributions of this paper are concluded as follow:

We collect some effective training tricks for person ReID. Among them, we design a new neck structure named as BNNeck. In addition, we evaluate the improvements from each trick on two widely used datasets.

We provide a strong ReID baseline, which achieves 94.5% and 85.9% mAP on Market1501. It is worth mentioned that the results are obtained with global features provided by ResNet50 backbone. To our best knowledge, it is the best performance acquired by global features in person ReID.

As a supplement, we evaluate the influences of the image size and the number of batch size on the performance of ReID models.

幸运的是，许多有效训练技巧已经出现在一些论文或开源项目中。作者收集了许多技巧，并在 ReID 数据集上逐一评估。经过大量实验后，作者选择六个技巧在本文中介绍。其中一些技巧由作者设计或修改。作者把这些技巧加入一个广泛使用的基线，得到修改后的基线，它在 Market1501 上达到 94.5% 的 rank-1 和 85.9% 的 mAP。此外，作者发现不同工作会选择不同图像尺寸和 batch size 数量；作为补充，作者还探索了它们对模型性能的影响。总之，本文贡献总结如下：

作者收集了一些用于行人 ReID 的有效训练技巧。其中，作者设计了一种名为 BNNeck 的新 neck 结构。此外，作者在两个广泛使用的数据集上评估了每个技巧带来的提升。

作者提供了一个强 ReID 基线，它在 Market1501 上达到 94.5% 和 85.9% 的 mAP。值得一提的是，这些结果是使用 ResNet50 主干提供的全局特征获得的。据作者所知，这是行人 ReID 中由全局特征取得的最佳性能。

作为补充，作者评估了图像尺寸和 batch size 数量对 ReID 模型性能的影响。

The pipeline of our modified baseline — The pipeline of the standard baseline.

2. Standard Baseline

We follow a widely used open-source as our standard baseline. The backbone of the standard baseline is ResNet50. During the training stage, the pipeline includes following steps:

1. We initialize the ResNet50 with pre-trained parameters on ImageNet and change the dimension of the fully connected layer to $N$ . $N$ denotes the number of identities in the training dataset.

2. We randomly sample $P$ identities and $K$ images of per person to constitute a training batch. Finally the batch size equals to $B = P \times K$ . In this paper, we set $P = 16$ and $K = 4$ .

3. We resize each image into $256 \times 128$ pixels and pad the resized image 10 pixels with zero values. Then randomly crop it into a $256 \times 128$ rectangular image.

4. Each image is flipped horizontally with 0.5 probability.

5. Each image is decoded into 32-bit floating point raw pixel values in $[0, 1]$ . Then we normalize RGB channels by subtracting 0.485, 0.456, 0.406 and dividing by 0.229, 0.224, 0.225, respectively.

6. The model outputs ReID features $f$ and ID prediction logits $p$ .

7. ReID features $f$ is used to calculate triplet loss. ID prediction logits $p$ is used to calculated cross entropy loss. The margin $m$ of triplet loss is set to be 0.3.

8. Adam method is adopted to optimize the model. The initial learning rate is set to be 0.00035 and is decreased by 0.1 at the 40th epoch and 70th epoch respectively. Totally there are 120 training epochs.

作者采用一个广泛使用的开源实现作为标准基线。标准基线的主干是 ResNet50。在训练阶段，流程包括以下步骤：

1. 作者用 ImageNet 上的预训练参数初始化 ResNet50，并把全连接层的维度改为 $N$ 。 $N$ 表示训练数据集中的身份数量。

2. 作者随机采样 $P$ 个身份，并为每个人采样 $K$ 张图像，构成一个训练 batch。最终 batch size 等于 $B = P \times K$ 。在本文中，作者设置 $P = 16$ 和 $K = 4$ 。

3. 作者把每张图像缩放到 $256 \times 128$ 像素，并用零值在缩放后的图像周围填充 10 个像素。然后随机裁剪为一张 $256 \times 128$ 的矩形图像。

4. 每张图像以 0.5 的概率进行水平翻转。

5. 每张图像被解码为 $[0, 1]$ 中的 32 位浮点原始像素值。然后作者通过分别减去 0.485、0.456、0.406，并除以 0.229、0.224、0.225，对 RGB 通道进行归一化。

6. 模型输出 ReID 特征 $f$ 和 ID 预测 logit $p$ 。

7. ReID 特征 $f$ 用于计算 triplet loss。ID 预测 logit $p$ 用于计算交叉熵损失。triplet loss 的 margin $m$ 设置为 0.3。

8. 作者采用 Adam 方法优化模型。初始学习率设置为 0.00035，并分别在第 40 个 epoch 和第 70 个 epoch 降低 0.1。总共训练 120 个 epoch。

3. Training Tricks

This section will introduce some effective training tricks in person ReID. Most of such tricks can be expanded on the standard baseline without changing the model architecture. The Figure Figure 2 (b) shows training strategies and the model architecture appeared in this section.

本节将介绍一些用于行人 ReID 的有效训练技巧。大多数这类技巧都可以在不改变模型架构的情况下扩展到标准基线上。图2 (b) 展示了本节出现的训练策略和模型架构。

3.1. Warmup Learning Rate

Learning rate has a great impact for the performance of a ReID model. Standard baseline is initially trained with a large and constant learning rate. In Spherereid, a warmup strategy is applied to bootstrap the network for better performance. In practice, As shown in Figure Figure 3, we spent 10 epochs linearly increasing the learning rate from $3.5 \times 10^{- 5}$ to $3.5 \times 10^{- 4}$ . Then, the learning rate is decayed to $3.5 \times 10^{- 5}$ and $3.5 \times 10^{- 6}$ at 40th epoch and 70th epoch respectively. The learning rate $l r (t)$ at epoch $t$ is compute as;

学习率对 ReID 模型性能有很大影响。标准基线最初使用较大且恒定的学习率进行训练。在 Spherereid 中，warmup 策略用于引导网络，以获得更好性能。实践中，如图3 所示，作者用 10 个 epoch 将学习率从 $3.5 \times 10^{- 5}$ 线性增加到 $3.5 \times 10^{- 4}$ 。然后，学习率分别在第 40 个 epoch 和第 70 个 epoch 衰减到 $3.5 \times 10^{- 5}$ 和 $3.5 \times 10^{- 6}$ 。第 $t$ 个 epoch 的学习率 $l r (t)$ 计算如下：

l r (t) = {\begin{cases} 3.5 \times 10^{- 5} \times \frac{t}{10}, & if t \leq 10, \\ 3.5 \times 10^{- 4}, & if 10 < t \leq 40, \\ 3.5 \times 10^{- 5}, & if 40 < t \leq 70, \\ 3.5 \times 10^{- 6}, & if 70 < t \leq 120. \end{cases}

图3：Comparison of learning rate schedules. With warmup strategy, the learning rate is linearly increased in the first 10 epochs.

3.2. Random Erasing Augmentation

In person ReID, persons in the images are sometimes occluded by other objects. To address the occlusion problem and improve the generalization ability of ReID models, Zhong et al. proposed a new data augmentation approach named as Random Erasing Augmentation (REA). In practice, for an image $I$ in a mini-batch, the probability of it undergoing Random Erasing is $p_{e}$ , and the probability of it being kept unchanged is $1 - p_{e}$ . Then, REA randomly selects a rectangle region $I_{e}$ with size $(W_{e}, H_{e})$ in image $I$ , and erases its pixels with random values. Assuming the area of image $I$ and region $I_{e}$ are $S = W \times H$ and $S_{e} = W_{e} \times H_{e}$ respectively, we denote $r_{e} = \frac{S_{e}}{S}$ as the area ratio of erasing rectangle region. In addition, the aspect ratio of region $I_{e}$ is randomly initialized between $r_{1}$ and $r_{2}$ . To determine a unique region, REA randomly initializes a point $P = (x_{e}, y_{e})$ . If $x_{e} + W_{e} \leq W$ and $y_{e} + H_{e} \leq H$ , we set the region, $I_{e} = (x_{e}, y_{e}, x_{e} + W_{e}, y_{e} + H_{e})$ , as the selected rectangle region. Otherwise we repeat the above process until an appropriate $I_{e}$ is selected. With the selected erasing region $I_{e}$ , each pixel in $I_{e}$ is assigned to the mean value of image $I$ , respectively.

在行人 ReID 中，图像中的行人有时会被其他物体遮挡。为了解决遮挡问题并提高 ReID 模型的泛化能力，Zhong 等人提出了一种名为随机擦除增强（Random Erasing Augmentation，REA）的新数据增强方法。实践中，对于 mini-batch 中的一张图像 $I$ ，它经历随机擦除的概率为 $p_{e}$ ，保持不变的概率为 $1 - p_{e}$ 。然后，REA 在图像 $I$ 中随机选择一个大小为 $(W_{e}, H_{e})$ 的矩形区域 $I_{e}$ ，并用随机值擦除其像素。假设图像 $I$ 和区域 $I_{e}$ 的面积分别为 $S = W \times H$ 和 $S_{e} = W_{e} \times H_{e}$ ，作者把 $r_{e} = \frac{S_{e}}{S}$ 表示为擦除矩形区域的面积比例。此外，区域 $I_{e}$ 的长宽比在 $r_{1}$ 和 $r_{2}$ 之间随机初始化。为了确定一个唯一的区域，REA 随机初始化一个点 $P = (x_{e}, y_{e})$ 。如果 $x_{e} + W_{e} \leq W$ 且 $y_{e} + H_{e} \leq H$ ，作者将区域 $I_{e} = (x_{e}, y_{e}, x_{e} + W_{e}, y_{e} + H_{e})$ 设置为选中的矩形区域。否则，作者重复上述过程，直到选中合适的 $I_{e}$ 。对于选中的擦除区域 $I_{e}$ ， $I_{e}$ 中每个像素分别被赋值为图像 $I$ 的均值。

In this study, we set hyper-parameters to $p = 0.5, 0.02 < S_{e} < 0.4, r_{1} = 0.3, r_{2} = 3.33$ , respectively. Some examples are shown in Figure Figure 4.

在本研究中，作者分别设置超参数为 $p = 0.5, 0.02 < S_{e} < 0.4, r_{1} = 0.3, r_{2} = 3.33$ 。一些示例如图4 所示。

图4：Sampled examples of random erasing augmentation. The first row shows five original training images. The processed images are presented in the second low.

3.3. Label Smoothing

ID Embedding (IDE) network is a basic baseline in person ReID. The last layer of IDE, which outputs the ID prediction logits of images, is a fully-connected layer with a hidden size being equal to numbers of persons $N$ . Given an image, we denote $y$ as truth ID label and $p_{i}$ as ID prediction logits of class $i$ . The cross entropy loss is computed as:

ID Embedding（IDE）网络是行人 ReID 中的一个基本基线。 IDE 的最后一层输出图像的 ID 预测 logit，是一个隐藏大小等于行人数量 $N$ 的全连接层。给定一张图像，作者用 $y$ 表示真实 ID 标签，用 $p_{i}$ 表示类别 $i$ 的 ID 预测 logit。交叉熵损失计算如下：

\begin{aligned} L (I D) & = \sum_{i = 1}^{N} - q_{i} \log (p_{i}), \\ q_{i} & = {\begin{cases} 0, & y \neq i, \\ 1, & y = i . \end{cases} \end{aligned}

Because the category of the classification is determined by the person ID, we call such loss function as ID loss in this paper. Nevertheless, person ReID can be regard as one-shot learning task because person IDs of the testing set have not appeared in the training set. So it is pretty important to prevent the ReID model from overfitting training IDs. Label smoothing (LS) proposed in Rethinking the Inception Architecture for Computer Vision is a widely used method to prevent overfitting for a classification task. It changes the construction of $q_{i}$ to:

由于分类类别由行人 ID 决定，本文把这种损失函数称为 ID loss。然而，行人 ReID 可以被视为 one-shot learning 任务，因为测试集中的行人 ID 没有出现在训练集中。因此，防止 ReID 模型过拟合训练 ID 非常重要。《Rethinking the Inception Architecture for Computer Vision》中提出的标签平滑（LS）是一种广泛用于防止分类任务过拟合的方法。它将 $q_{i}$ 的构造改为：

q_{i} = {\begin{cases} 1 - \frac{N - 1}{N} ε, & if i = y, \\ ε / N, & otherwise. \end{cases}

where $ε$ is a small constant to encourage the model to be less confident on the training set. In this study, $ε$ is set to be $0.1$ . When the training set is not very large, LS can significantly improve the performance of the model.

其中 $ε$ 是一个小常数，用于鼓励模型在训练集上不要过度自信。在本研究中， $ε$ 设置为 $0.1$ 。当训练集不是很大时，LS 可以显著提升模型性能。

3.4. Last Stride

Higher spatial resolution always enriches the granularity of feature. In Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline), Sun et al. removed the last spatial down-sampling operation in the backbone network to increase the size of the feature map. For convenience, we denote the last spatial down-sampling operation in the backbone network as last stride. The last stride of ResNet50 is set to be 2. When fed into a image of $256 \times 128$ size, the backbone of ResNet50 outputs a feature map with the spatial size of $8 \times 4$ . If change last stride from 2 to 1, we can get a feature map with higher spatial size ( $16 \times 8$ ). This manipulation only increases very light computation cost and does not involve extra training parameters. However, higher spatial resolution brings significant improvement.

更高的空间分辨率总能丰富特征粒度。在《Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline)》中，Sun 等人移除了主干网络中的最后一次空间下采样操作，以增大特征图尺寸。为方便起见，作者把主干网络中的最后一次空间下采样操作称为 last stride。 ResNet50 的 last stride 设置为 2。当输入一张 $256 \times 128$ 大小的图像时，ResNet50 主干输出空间尺寸为 $8 \times 4$ 的特征图。如果把 last stride 从 2 改为 1，作者可以得到空间尺寸更高的特征图（ $16 \times 8$ ）。这一操作只增加很轻微的计算成本，并且不涉及额外训练参数。然而，更高的空间分辨率会带来显著提升。

3.5. BNNeck

Our designed BNNeck — The neck of the standard baseline.

图6：Two-dimensional visualization of sample distribution in the embedding space supervised by (a) ID Loss, (b) Triplet Loss, (c) ID + triplet loss and (d) ID + triplet loss + BNNeck. Points of different colors represent embedding features from different classes. The yellow dotted lines stand for the supposed classification hyperplanes.

Most of works combined ID loss and triplet loss together to train ReID models. As shown in Figure Figure 5(a), in the standard baseline, ID loss and triplet loss constrain the same feature $f$ . However, the targets of these two losses are inconsistent in the embedding space.

大多数工作会把 ID loss 和 triplet loss 结合起来训练 ReID 模型。如图5(a) 所示，在标准基线中，ID loss 和 triplet loss 约束同一个特征 $f$ 。然而，这两个损失在嵌入空间中的目标并不一致。

As shown in Figure Figure 6(a), ID loss constructs several hyperplanes to separate the embedding space into different sub-spaces. The features of each class are distributed in different subspaces. In this case, cosine distance is more suitable than Euclidean distance for the model optimized by ID loss in the inference stage. On the other hand, as shown in Figure Figure 6(b), triplet loss enhances the intra-class compactness and inter-class separability in the Euclidean space. Because triplet loss can not provide globally optimal constraint, inter-class distance sometimes is smaller than intra-class distance. A widely used method is to combine ID loss and triplet loss to train the model together. This approach let the model learn more discriminative features. Nevertheless, for image pairs in the embedding space, ID loss mainly optimizes the cosine distances while triplet loss focuses on the Euclidean distances. If we use these two losses to simultaneously optimize a feature vector, their goals may be inconsistent. In the training process, a possible phenomenon is that one loss is reduced, while the other loss is oscillating or even increased.

如图6(a) 所示，ID loss 构造多个超平面，把嵌入空间分割为不同子空间。每个类别的特征分布在不同子空间中。在这种情况下，对于由 ID loss 优化的模型，推理阶段余弦距离比欧氏距离更合适。另一方面，如图6(b) 所示，triplet loss 在欧氏空间中增强类内紧凑性和类间可分性。由于 triplet loss 不能提供全局最优约束，类间距离有时会小于类内距离。一种广泛使用的方法是结合 ID loss 和 triplet loss 一起训练模型。这种方法让模型学习更具判别性的特征。然而，对于嵌入空间中的图像对，ID loss 主要优化余弦距离，而 triplet loss 关注欧氏距离。如果作者使用这两个损失同时优化一个特征向量，它们的目标可能不一致。在训练过程中，一种可能现象是一个损失降低，而另一个损失震荡甚至升高。

To overcome the aforementioned problem, we design a structure named as BNNeck shown in Figure Figure 5(b). BNNeck only adds a batch normalization (BN) layer after features (and before classifier FC layers). The feature before the BN layer is denoted as $f_{t}$ . We let $f_{t}$ pass through a BN layer to acquire the normalized feature $f_{i}$ . In the training stage, $f_{t}$ and $f_{i}$ are used to compute triplet loss and ID loss, respectively. Normalization balances each dimension of $f_{i}$ . The features are gaussianly distributed near the surface of the hypersphere. This distribution makes the ID loss easier to converge. In addition, BNNeck reduces the constraint of the ID loss on $f_{t}$ . Less constraint from ID loss leads to triplet loss easier to converge at the same time. Thirdly, normalization keeps the compact distribution of features that belong to one same person.

为克服上述问题，作者设计了如图5(b) 所示的结构，命名为 BNNeck。 BNNeck 只是在特征之后（分类器全连接层之前）添加一个批归一化（BN）层。 BN 层之前的特征记为 $f_{t}$ 。作者让 $f_{t}$ 经过 BN 层，以获得归一化特征 $f_{i}$ 。在训练阶段， $f_{t}$ 和 $f_{i}$ 分别用于计算 triplet loss 和 ID loss。归一化会平衡 $f_{i}$ 的每个维度。这些特征以高斯方式分布在超球面附近。这种分布使 ID loss 更容易收敛。此外，BNNeck 减弱了 ID loss 对 $f_{t}$ 的约束。来自 ID loss 的约束更少，也使 triplet loss 更容易同时收敛。第三，归一化保持了属于同一行人的特征紧凑分布。

Because the hypersphere is almost symmetric about the origin of the coordinate axis, another trick of BNNeck is removing the bias of classifier FC layer. It constrains the classification hyperplanes to pass through the origin of the coordinate axis. We initialize the FC layer with Kaiming initialization proposed in Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.

由于超球面几乎关于坐标轴原点对称，BNNeck 的另一个技巧是移除分类器全连接层的偏置。它约束分类超平面穿过坐标轴原点。作者使用《Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification》中提出的 Kaiming 初始化来初始化全连接层。

In the inference stage, we choose $f_{i}$ to do the person ReID task. Cosine distance metric can achieve better performance than Euclidean distance metric. Experimental results in Table Table 1 show that BNNeck can improve performance of the ReID model by a large margin.

在推理阶段，作者选择 $f_{i}$ 来执行行人 ReID 任务。余弦距离度量可以取得比欧氏距离度量更好的性能。表1 中的实验结果表明，BNNeck 可以大幅提升 ReID 模型性能。

3.6. Center Loss

Triplet loss is computed as:

triplet loss 计算如下：

L_{T r i} = [d_{p} - d_{n} + α]_{+} .

where $d_{p}$ and $d_{n}$ are feature distances of positive pair and negative pair. $α$ is the margin of triplet loss, and $[z]_{+}$ equals to $m a x (z, 0)$ . In this paper, $α$ is set to $0.3$ . However, triplet loss only considers the difference between $d_{p}$ and $d_{n}$ and ignores the absolute values of them. For instance, when $d_{p} = 0.3, d_{n} = 0.5$ , the triplet loss is $0.1$ . For another case, when $d_{p} = 1.3, d_{n} = 1.5$ , the triplet loss also is $0.1$ . Triplet loss is determined by two person IDs sampled randomly. It is difficult to ensure that $d_{p} < d_{n}$ in the whole training dataset.

其中 $d_{p}$ 和 $d_{n}$ 是正样本对和负样本对的特征距离。 $α$ 是 triplet loss 的 margin，且 $[z]_{+}$ 等于 $m a x (z, 0)$ 。在本文中， $α$ 设置为 $0.3$ 。然而，triplet loss 只考虑 $d_{p}$ 和 $d_{n}$ 的差值，而忽略它们的绝对值。例如，当 $d_{p} = 0.3, d_{n} = 0.5$ 时，triplet loss 为 $0.1$ 。在另一种情况下，当 $d_{p} = 1.3, d_{n} = 1.5$ 时，triplet loss 也是 $0.1$ 。 triplet loss 由随机采样的两个行人 ID 决定。很难确保整个训练数据集中都有 $d_{p} < d_{n}$ 。

Center loss, which simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers, makes up for the drawbacks of the triplet loss. The center loss function is formulated as:

center loss 同时为每个类别的深度特征学习一个中心，并惩罚深度特征与其对应类别中心之间的距离，从而弥补 triplet loss 的缺点。 center loss 函数表示为：

L_{C} = \frac{1}{2} \sum_{j = 1}^{B} {‖ f_{t_{j}} - c_{y_{j}} ‖}_{2}^{2} .

where $y_{j}$ is the label of the $j$ th image in a mini-batch. $c_{y_{j}}$ denotes the $y_{i}$ th class center of deep features. $B$ is the number of batch size. The formulation effectively characterizes the intra-class variations. Minimizing center loss increases intra-class compactness. Our model totally includes three losses as follow:

其中 $y_{j}$ 是 mini-batch 中第 $j$ 张图像的标签。 $c_{y_{j}}$ 表示深度特征的第 $y_{i}$ 个类别中心。 $B$ 是 batch size 的数量。该公式有效刻画了类内变化。最小化 center loss 会增加类内紧凑性。作者的模型总共包括以下三个损失：

L = L_{I D} + L_{T r i p l e t} + β L_{C} .

$β$ is the balanced weight of center loss. In our experiments, $β$ is set to be $0.0005$ .

$β$ 是 center loss 的平衡权重。在作者的实验中， $β$ 设置为 $0.0005$ 。

4. Experimental Results

In this section, we will evaluate our models on Market1501 and DukeMTMC-reID datasets. The Rank-1 accuracy and mean Average Precision (mAP) are reported as evaluation metrics. We add tricks on the standard baseline successively and do not change any training settings. The results of ablation studies present the performance boost from each trick. In order to prevent being misled by overfitting, we also show the results of cross-domain experiments.

在本节中，作者将在 Market1501 和 DukeMTMC-reID 数据集上评估模型。作者报告 Rank-1 准确率和 mean Average Precision（mAP）作为评估指标。作者在标准基线上依次添加技巧，并且不改变任何训练设置。消融研究结果展示了每个技巧带来的性能提升。为了避免被过拟合误导，作者还展示了跨域实验结果。

4.1. Influences of Each Trick (Same domain)

表1：The performance of different models is evaluated on Market1501 and DukeMTMC-reID datasets. Baseline-S stands for the standard baseline introduced in section 2.

Model	Market1501		DukeMTMC
Model	r = 1	mAP	r = 1	mAP
Baseline-S	87.7	74.0	79.7	63.7
+warmup	88.7	75.2	80.6	65.1
+REA	91.3	79.3	81.5	68.3
+LS	91.4	80.3	82.4	69.3
+stride=1	92.0	81.7	82.6	70.6
+BNNeck	94.1	85.7	86.2	75.9
+center loss	94.5	85.9	86.4	76.4

The standard baseline introduced in section 2 achieves 87.7% and 79.7% rank-1 accuracies on Market1501 and DukeMTMC-reID, respectively. The performance of standard baseline is similar with most of baselines reported in other papers. Then, we add warmup strategy, random erasing augmentation, label smoothing, stride change, BNNeck and center loss to the model training process, one by one. Our designed BNNeck boosts more performance than other tricks, especially on DukeMTMC-reID. Finally, these tricks make baseline acquire 94.5% rank-1 accuracy and 85.9% mAP on Market1501. On DukeMTMC-reID, it reaches 86.4% rank-1 accuracy and 76.4% mAP. In other works, these training tricks boost the performance of the standard baseline by more than 10% mAP. In addition, to get such improvement, we only involve an extra BN layer and do not increase training time.

第 2 节介绍的标准基线在 Market1501 和 DukeMTMC-reID 上分别达到 87.7% 和 79.7% 的 rank-1 准确率。标准基线的性能与其他论文报告的大多数基线相似。然后，作者逐一把 warmup 策略、随机擦除增强、标签平滑、stride 改动、BNNeck 和 center loss 加入模型训练过程。作者设计的 BNNeck 比其他技巧带来更多性能提升，尤其是在 DukeMTMC-reID 上。最终，这些技巧使基线在 Market1501 上获得 94.5% 的 rank-1 准确率和 85.9% 的 mAP。在 DukeMTMC-reID 上，它达到 86.4% 的 rank-1 准确率和 76.4% 的 mAP。在其他工作中，这些训练技巧使标准基线性能提升超过 10% mAP。此外，为了获得这样的提升，作者只引入一个额外 BN 层，并没有增加训练时间。

4.2. Analysis of BNNeck

表2：The ablation study of BNNeck. f (w/o BNNeck) is baseline without BNNeck. BNNeck includes two features f_t and f_i. We evaluate the performance of them with Euclidean distance and cosine distance, respectively.

Feature	Metric	Market1501		DukeMTMC
Feature	Metric	r = 1	mAP	r = 1	mAP
f (w/o BNNeck)	Euclidean	92.0	81.7	82.6	70.6
f_t	Euclidean	94.2	85.5	85.7	74.4
f_t	Cosine	94.2	85.7	85.5	74.6
f_i	Euclidean	93.8	83.7	86.6	73.0
f_i	Cosine	94.1	85.7	86.2	75.9

In this section, we evaluate the performance of two different features ( $f_{t}$ and $f_{i}$ ) with Euclidean distance metric and cosine distance metric. All models are trained without center loss in Table Table 2. We observe that cosine distance metric performs better than Euclidean distance metric for $f_{t}$ . Because ID loss directly constrains the features followed the BN layer, $f_{i}$ can be separated by several hyperplanes clearly. The cosine distance can measure the angle between two feature vectors, so cosine distance metric is more suitable than Euclidean distance metric for $f_{i}$ . However, $f_{t}$ is close to triplet loss and is constrained by ID loss at the same time. Two kinds of metrics achieve similar performance for $f_{t}$ .

在本节中，作者使用欧氏距离度量和余弦距离度量评估两个不同特征（ $f_{t}$ 和 $f_{i}$ ）的性能。表2 中所有模型都在没有 center loss 的情况下训练。作者观察到，对于 $f_{t}$ ，余弦距离度量优于欧氏距离度量。由于 ID loss 直接约束经过 BN 层之后的特征， $f_{i}$ 可以被多个超平面清晰分开。余弦距离可以衡量两个特征向量之间的角度，因此对于 $f_{i}$ ，余弦距离度量比欧氏距离度量更合适。然而， $f_{t}$ 接近 triplet loss，同时也受到 ID loss 约束。两种度量对 $f_{t}$ 取得相似性能。

In overall, BNNeck significantly improve the performance of ReID models. We choose $f_{i}$ with cosine distance metric to do the retrieval in the inference stage.

总体而言，BNNeck 显著提升了 ReID 模型性能。作者在推理阶段选择 $f_{i}$ 并使用余弦距离度量进行检索。

4.3. Influences of Each Trick (Cross domain)

表3：The performance of different models is evaluated on cross-domain datasets. M→D means that we train the model on Market1501 and evaluate it on DukeMTMC-reID.

Model	M→D		D→M
Model	r = 1	mAP	r = 1	mAP
Baseline	24.4	12.9	34.2	14.5
+warmup	26.3	14.1	39.7	17.4
+REA	21.5	10.2	32.5	13.5
+LS	23.2	11.3	36.5	14.9
+stride=1	23.1	11.8	37.1	15.4
+BNNeck	26.7	15.2	47.7	21.6
+center loss	27.5	15.0	47.4	21.4
-REA	41.4	25.7	54.3	25.5

To further explore effectiveness, we also present the results of cross-domain experiments in Table Table 3. In overview, three tricks including warmup strategy, label smoothing and BNNeck significantly boost the cross-domain performance of ReID models. Stride change and center loss seem to have no big impact on the performance. However, REA does harm to models in cross-domain ReID task. In particularly, when our modified baseline is trained without REA, it achieves 41.4% and 54.3% rank-1 accuracies on Market1501 and DukeMTMC-reID datasets, respectively. Its performance surpass the ones of the standard baseline by a large margin. We infer that REA masking the regions of training images lets the model learn more knowledge in the training domain. It causes the model to perform worse in the testing domain.

为了进一步探索有效性，作者还在表3 中展示了跨域实验结果。总体来看，warmup 策略、标签平滑和 BNNeck 三个技巧显著提升了 ReID 模型的跨域性能。 stride 改动和 center loss 似乎对性能没有很大影响。然而，REA 会损害模型在跨域 ReID 任务上的表现。特别是，当作者修改后的基线在没有 REA 的情况下训练时，它在 Market1501 和 DukeMTMC-reID 数据集上分别达到 41.4% 和 54.3% 的 rank-1 准确率。其性能大幅超过标准基线。作者推断，REA 对训练图像区域的遮挡使模型在训练域中学习到更多知识。这会导致模型在测试域中表现更差。

4.4. Comparison of State-of-the-Arts

表4：Comparison of state-or-the-arts methods. N_f is the number of features used in the inference stage. RK stands for k-reciprocal re-ranking method.

Type	Method	N_f	Market1501		DukeMTMC
Type	Method	N_f	r = 1	mAP	r = 1	mAP
Pose-guided	GLAD	4	89.9	73.9	-	-
	PIE	3	87.7	69.0	79.8	62.0
	PSE	3	78.7	56.0	-	-
Mask-guided	SPReID	5	92.5	81.3	84.4	71.0
Mask-guided	MaskReID	3	90.0	75.3	78.8	61.9
Stripe-based	AlignedReID	1	90.6	77.7	81.2	67.4
	SCPNet	1	91.2	75.2	80.3	62.6
	PCB	6	93.8	81.6	83.3	69.2
	Pyramid	1	92.8	82.1	-	-
	Pyramid	21	95.7	88.2	89.0	79.0
	BFE	2	94.5	85.0	88.7	75.8
Attention-based	Mancs	1	93.1	82.3	84.9	71.8
	DuATM	1	91.4	76.6	81.2	62.3
	HA-CNN	4	91.2	75.7	80.5	63.8
GAN-based	Camstyle	1	88.1	68.7	75.3	53.5
GAN-based	PN-GAN	9	89.4	72.6	73.6	53.2
Global feature	IDE	1	79.5	59.9	-	-
	SVDNet	1	82.3	62.1	76.7	56.8
	TriNet	1	84.9	69.1	-	-
	AWTL	1	89.5	75.7	79.8	63.4
	Ours	1	94.5	85.9	86.4	76.4
	Ours(RK)	1	95.4	94.2	90.3	89.1

We compare out strong baseline with state-of-the-arts methods in Table Table 4. All methods have been divided into different types. Pyramid achieves surprising performance on two datasets. However, it concatenates 21 local features of different scale. If only utilizing the global feature, it obtains 92.8% rank-1 accuracy and 82.1% mAP on Market1501. Ours strong baseline can reach 94.5% rank-1 accuracy and 85.9% mAP on Market1501. BFE obtains similar performance with our strong baseline. But it combines features of two branches. Throughout all methods that only use global features, our strong baseline beats AWTL by more than 10% mAP on both Market1501 and DukeMTMC-reID. With $k$ -reciprocal re-ranking method to boost the performance, our method reaches 94.1% mAP and 89.1% mAP on Market1501 and DukeMTMC-reID, respectively. To our best knowledge, our baseline achieves best performance in the case of only using global features.

作者在表4 中将强基线与最先进方法进行比较。所有方法被分成不同类型。 Pyramid 在两个数据集上取得了惊人性能。然而，它拼接了 21 个不同尺度的局部特征。如果只使用全局特征，它在 Market1501 上获得 92.8% 的 rank-1 准确率和 82.1% 的 mAP。作者的强基线在 Market1501 上可以达到 94.5% 的 rank-1 准确率和 85.9% 的 mAP。 BFE 取得了与作者强基线相似的性能。但它组合了两个分支的特征。在所有只使用全局特征的方法中，作者的强基线在 Market1501 和 DukeMTMC-reID 上都以超过 10% mAP 的幅度超过 AWTL。借助 $k$ -reciprocal re-ranking 方法提升性能，作者方法在 Market1501 和 DukeMTMC-reID 上分别达到 94.1% mAP 和 89.1% mAP。据作者所知，在只使用全局特征的情况下，作者基线取得了最佳性能。

5. Supplementary Experiments

We observed that some previous works were done with different the numbers of batch size or image sizes. In this section, as a supplementary we explore the affects of them on model performance.

作者观察到，一些先前工作使用了不同 batch size 数量或图像尺寸。在本节中，作为补充，作者探索它们对模型性能的影响。

5.1. Influences of the Number of Batch Size

表5：Performance of ReID models with different numbers of batch size.

Batch Size	Market1501		DukeMTMC
P × K	r = 1	mAP	r = 1	mAP
8×3	92.6	79.2	84.4	68.1
8×4	92.9	80.0	84.7	69.4
8×6	93.5	81.6	85.1	70.7
8×8	93.9	82.0	85.8	71.5
16×3	93.8	83.1	86.8	72.1
16×4	93.8	83.7	86.6	73.0
16×6	94.0	82.8	85.1	69.9
16×8	93.1	81.6	86.7	72.1
32×3	94.5	84.1	86.0	71.4
32×4	93.2	82.8	86.5	73.1

The mini-batch of triplet loss includes $B = P \times K$ images. $P$ and $K$ denote the number of different persons and the number of different images per person, respectively. A mini-batch can only contain up to 128 images in one GPU, so that we can not do the experiments with $P = 32, K = 6$ or $P = 32, K = 8$ . We removed center loss to clearly find the relation between triplet loss and batch size. The results are present in Table Table 5. However, there are not specific conclusions to show the effect of $B$ on performance. A slight trend we observed is that larger batch size is beneficial for the model performance. We infer that large $K$ helps to mine hard positive pairs while large $P$ helps to mining hard negative pairs.

triplet loss 的 mini-batch 包含 $B = P \times K$ 张图像。 $P$ 和 $K$ 分别表示不同行人的数量以及每个行人的不同图像数量。一个 mini-batch 在一张 GPU 中最多只能包含 128 张图像，因此作者无法进行 $P = 32, K = 6$ 或 $P = 32, K = 8$ 的实验。作者移除了 center loss，以清楚找到 triplet loss 与 batch size 之间的关系。结果展示在表5 中。然而，没有具体结论显示 $B$ 对性能的影响。作者观察到的一个轻微趋势是，更大的 batch size 有利于模型性能。作者推断，较大的 $K$ 有助于挖掘 hard positive pairs，而较大的 $P$ 有助于挖掘 hard negative pairs。

5.2. Influences of Image Size

表6：Performance of ReID models with different image sizes.

Image Size	Market1501		DukeMTMC
Image Size	r = 1	mAP	r = 1	mAP
256×128	93.8	83.7	86.6	73.0
224×224	94.2	83.3	86.1	72.2
384×128	94.0	82.7	86.4	73.2
384×192	93.8	83.1	87.1	72.9

We trained models without center loss and set $P = 16, K = 4$ . As shown in Table Table 6, four models achieve similar performances on both datasets. In our opinion, the image size is not a pretty importance factor for the performance of ReID models.

作者在没有 center loss 的情况下训练模型，并设置 $P = 16, K = 4$ 。如表6 所示，四个模型在两个数据集上取得相似性能。作者认为，图像尺寸不是影响 ReID 模型性能的非常重要因素。

6. Conclusions and Outlooks

In this paper, we collect some effective training tricks and design a strong baseline for person ReID. To demonstrate the influences of each trick on the performance of ReID models, we do a lot of experiments on both same-domain and cross-domain ReID tasks. Finally, only using global features, our strong baseline achieve 94.5% rank-1 accuracy and 85.9% mAP on Market1501. We hope that this work can promote the ReID research in academia and industry.

在本文中，作者收集了一些有效训练技巧，并为行人 ReID 设计了一个强基线。为了展示每个技巧对 ReID 模型性能的影响，作者在同域和跨域 ReID 任务上做了大量实验。最终，仅使用全局特征，作者的强基线就在 Market1501 上达到 94.5% 的 rank-1 准确率和 85.9% 的 mAP。作者希望这项工作能够推动学术界和工业界的 ReID 研究。

However, the purpose of our work is not to improve performance roughly. Compared with face recognition, person ReID still has a long way to explore. We think some training tricks can speed up the exploration and there are many effective tricks not discovered. We welcome researchers to share some other effective tricks with us. We will evaluate them based on this work.

然而，作者工作的目的并不是粗暴提升性能。与人脸识别相比，行人 ReID 仍有很长的路要探索。作者认为，一些训练技巧可以加快这种探索，并且仍有许多有效技巧尚未被发现。作者欢迎研究者与作者分享其他有效技巧。作者将基于这项工作对它们进行评估。

In the future, we will continue to design more experiments to analyze the principles of these trciks. For example, when we replace the BNNeck with L2 normalization, what does the performance of this network become? In addition, whether can some state-of-the-arts methods such as PCB, MGN and AlignedReID, etc. be expanded on our strong baseline? More visualization also is helpful for others to understand this work.

未来，作者将继续设计更多实验来分析这些技巧的原理。例如，当作者用 L2 归一化替换 BNNeck 时，这个网络的性能会变成什么样？此外，PCB、MGN 和 AlignedReID 等一些最先进方法是否可以扩展到作者的强基线上？更多可视化也有助于他人理解这项工作。

🤖 Rasa

Bag of Tricks and A Strong Baseline for Deep Person Re-identification ​

用于深度行人重识别的一组技巧与强基线 ​

Abstract ​

1. Introduction ​

2. Standard Baseline ​

3. Training Tricks ​

3.1. Warmup Learning Rate ​

3.2. Random Erasing Augmentation ​

3.3. Label Smoothing ​

3.4. Last Stride ​

3.5. BNNeck ​

3.6. Center Loss ​

4. Experimental Results ​

4.1. Influences of Each Trick (Same domain) ​

4.2. Analysis of BNNeck ​

4.3. Influences of Each Trick (Cross domain) ​

4.4. Comparison of State-of-the-Arts ​

5. Supplementary Experiments ​

5.1. Influences of the Number of Batch Size ​

5.2. Influences of Image Size ​

6. Conclusions and Outlooks ​