Skip to content


Mask R-CNN

Abstract

We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.

本文提出了一个概念简单、灵活且通用的目标实例分割框架。 该方法能够高效检测图像中的目标,同时为每个实例生成高质量的分割掩码。 这个方法称为 Mask R-CNN,它通过增加一个用于预测目标掩码的分支来扩展 Faster R-CNN,并且该分支与已有的边界框识别分支并行工作。 Mask R-CNN 易于训练,并且只给 Faster R-CNN 增加很小的额外开销,运行速度为 5 fps。 此外,Mask R-CNN 很容易泛化到其他任务,例如允许我们在同一框架中估计人体姿态。 我们在 COCO 挑战套件的全部三个赛道上都取得顶尖结果,包括实例分割、边界框目标检测和人体关键点检测。 在不使用复杂技巧的情况下,Mask R-CNN 在每一项任务上都超过了所有已有的单模型参赛方法,包括 COCO 2016 挑战赛冠军。 我们希望这个简单而有效的方法能够成为坚实基线,并帮助降低未来实例级识别研究的难度。

1. Introduction

The Mask R-CNN framework for instance segmentation
图1:The Mask R-CNN framework for instance segmentation.
Mask R-CNN results on the COCO test set
图2:Mask R-CNN results on the COCO test set. These results are based on ResNet-101, achieving a mask AP of 35.7 and running at 5 fps. Masks are shown in color, and bounding box, category, and confidences are also shown.

The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster R-CNN and Fully Convolutional Network (FCN) frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.

视觉领域在较短时间内快速提升了目标检测和语义分割的结果。 这些进展在很大程度上由强大的基线系统推动,例如分别用于目标检测和语义分割的 Fast/Faster R-CNN 与全卷积网络(FCN)框架。 这些方法在概念上直观,并且兼具灵活性、鲁棒性以及快速训练和推理时间。 本文的目标是开发一个同样能够推动研究的实例分割框架。

Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection. Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results.

实例分割具有挑战性,因为它既要求正确检测图像中的所有目标,又要求精确分割每一个实例。 因此,它结合了经典计算机视觉任务中的目标检测语义分割要素:目标检测的目标是对各个目标进行分类并用边界框定位每个目标,而语义分割的目标是在不区分目标实例的情况下把每个像素分类到固定类别集合中。 按照常用术语,我们用目标检测表示通过边界框而不是掩码进行检测,并用语义分割表示不区分实例的逐像素分类。 但我们也指出,实例分割既是语义任务,也是一种检测形式。 鉴于这一点,人们可能会预期需要复杂方法才能取得好结果。 然而,我们展示了一个出人意料地简单、灵活且快速的系统,它能够超过以往最先进的实例分割结果。

Our method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.

我们的方法称为 Mask R-CNN,它通过为每个感兴趣区域(RoI)增加一个预测分割掩码的分支来扩展 Faster R-CNN,并且该分支与已有的分类和边界框回归分支并行工作(图1)。 掩码分支是一个应用于每个 RoI 的小型 FCN,以像素到像素的方式预测分割掩码。 在 Faster R-CNN 框架下,Mask R-CNN 易于实现和训练,这使得多种灵活的架构设计成为可能。 此外,掩码分支只增加很小的计算开销,从而支持快速系统和快速实验。

In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network's RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation.

原则上,Mask R-CNN 是 Faster R-CNN 的直观扩展,但正确构造掩码分支对取得好结果至关重要。 最重要的是,Faster R-CNN 并不是为网络输入与输出之间的像素级对齐而设计的。 这一点在 RoIPool 中最为明显;RoIPool 是关注实例的事实标准核心操作,它在特征提取时会进行粗糙的空间量化。 为了解决错位问题,我们提出了一个简单、无量化的层,称为 RoIAlign,它能够忠实保留精确的空间位置。 尽管这看似只是一个小改动,RoIAlign 的影响却很大:它将掩码准确率相对提升 10% 到 50%,并且在更严格的定位指标下收益更大。 其次,我们发现解耦掩码预测与类别预测是必要的:我们为每个类别独立预测一个二值掩码,类别之间不存在竞争,并依赖网络的 RoI 分类分支来预测类别。 相比之下,FCN 通常执行逐像素多类别分类,这会耦合分割和分类;根据我们的实验,这种做法在实例分割上效果较差。

Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task, including the heavily-engineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.

在不使用复杂技巧的情况下,Mask R-CNN 超过了此前所有 COCO 实例分割任务上的最先进单模型结果,包括 2016 年竞赛冠军中大量工程化的参赛方法。 作为副产物,我们的方法在 COCO 目标检测任务上也表现出色。 在消融实验中,我们评估了多个基础实例化版本,从而展示其鲁棒性并分析核心因素的影响。

Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework's flexibility and accuracy, will benefit and ease future research on instance segmentation.

我们的模型在 GPU 上每帧约 200ms 即可运行,在单台 8-GPU 机器上用 COCO 训练需要一到两天。 我们相信,快速的训练和测试速度,以及该框架的灵活性和准确性,将有益于并降低未来实例分割研究的难度。

Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.

最后,我们通过 COCO 关键点数据集上的人体姿态估计任务展示了框架的通用性。 通过把每个关键点视为一个 one-hot 二值掩码,只需进行很少修改,Mask R-CNN 就可用于检测实例特定的姿态。 Mask R-CNN 超过了 2016 年 COCO 关键点竞赛冠军,同时运行速度为 5 fps。 因此,Mask R-CNN 可以更广义地视为一个灵活的实例级识别框架,并且可以很容易扩展到更复杂的任务。

We have released code to facilitate future research.

我们已经发布代码,以促进未来研究。

R-CNN: The Region-based CNN (R-CNN) approach to bounding-box object detection is to attend to a manageable number of candidate object regions and evaluate convolutional networks independently on each RoI. R-CNN was extended to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements, and is the current leading framework in several benchmarks.

R-CNN: 用于边界框目标检测的基于区域的 CNN(R-CNN)方法,是关注数量可控的候选目标区域,并在每个 RoI 上独立评估卷积网络。 R-CNN 随后被扩展为可以使用 RoIPool 在特征图上关注 RoI,从而获得更快速度和更高准确率。 Faster R-CNN 通过使用区域提议网络(RPN)学习注意机制,推进了这一方向。 Faster R-CNN 对许多后续改进具有灵活性和鲁棒性,并且是若干基准上的当前领先框架。

Instance Segmentation: Driven by the effectiveness of R-CNN, many approaches to instance segmentation are based on segment proposals. Earlier methods resorted to bottom-up segments. DeepMask and following works learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.

实例分割: 受 R-CNN 有效性的推动,许多实例分割方法都基于分割提议 较早的方法依赖自底向上的分割片段。 DeepMask 及其后续工作学习提出候选分割片段,然后由 Fast R-CNN 对其进行分类。 在这些方法中,分割先于识别,这既慢又不够准确。 类似地,Dai 等人提出了一个复杂的多阶段级联系统,它从边界框提议中预测分割提议,然后再进行分类。 相反,我们的方法基于掩码与类别标签的并行预测,因此更简单也更灵活。

Most recently, Li et al. combined the segment proposal system and object detection system for "fully convolutional instance segmentation" (FCIS). The common idea is to predict a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.

最近,Li 等人把分割提议系统与目标检测系统结合起来,用于“全卷积实例分割”(FCIS)。 其共同思想是以全卷积方式预测一组位置敏感的输出通道。 这些通道同时处理目标类别、边界框和掩码,使系统速度较快。 但 FCIS 在重叠实例上表现出系统性错误,并产生伪边缘(图6),这表明它受到实例分割基本困难的挑战。

Another family of solutions to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.

另一类实例分割方案受到语义分割成功的推动。 这些方法从逐像素分类结果(例如 FCN 输出)出发,试图把同一类别的像素切分成不同实例。 与这些方法的分割优先策略相反,Mask R-CNN 基于实例优先策略。 我们预计,未来会研究对这两种策略的更深入融合。

3. Mask R-CNN

RoIAlign
图3:RoIAlign: The dashed grid represents a feature map, the solid lines an RoI (with 2×2 bins in this example), and the dots the 4 sampling points in each bin. RoIAlign computes the value of each sampling point by bilinear interpolation from the nearby grid points on the feature map. No quantization is performed on any coordinates involved in the RoI, its bins, or the sampling points.
Head Architecture
图4:Head Architecture: We extend two existing Faster R-CNN heads. Left/Right panels show the heads for the ResNet C4 and FPN backbones, respectively, to which a mask branch is added. Numbers denote spatial resolution and channels. Arrows denote either conv, deconv, or fc layers as can be inferred from context.

Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.

Mask R-CNN 在概念上很简单:Faster R-CNN 对每个候选目标有两个输出,即类别标签和边界框偏移;我们在此基础上增加第三个分支,用于输出目标掩码。 因此,Mask R-CNN 是一个自然且直观的想法。 但额外的掩码输出不同于类别输出和边界框输出,它要求提取目标更精细的空间布局。 接下来,我们介绍 Mask R-CNN 的关键要素,包括像素到像素对齐,这是 Fast/Faster R-CNN 中主要缺失的一环。

Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to Huang et al. for latest, comprehensive comparisons between Faster R-CNN and other frameworks.

Faster R-CNN: 我们首先简要回顾 Faster R-CNN 检测器。 Faster R-CNN 由两个阶段组成。 第一阶段称为区域提议网络(RPN),它提出候选目标边界框。 第二阶段本质上是 Fast R-CNN,它使用 RoIPool 从每个候选框中提取特征,并执行分类和边界框回归。 两个阶段使用的特征可以共享,以加快推理。 关于 Faster R-CNN 与其他框架之间最新、全面的比较,我们请读者参考 Huang 等人的工作。

Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions. Our approach follows the spirit of Fast R-CNN that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN).

Mask R-CNN: Mask R-CNN 采用相同的两阶段流程,并具有完全相同的第一阶段(即 RPN)。 在第二阶段,Mask R-CNN 在预测类别和边界框偏移的同时,还为每个 RoI 并行输出一个二值掩码。 这与大多数近期系统形成对比,在那些系统中,分类依赖于掩码预测。 我们的方法遵循 Fast R-CNN 的思想,即并行执行边界框分类和回归(事实证明,这大大简化了原始 R-CNN 的多阶段流程)。

Formally, during training, we define a multi-task loss on each sampled RoI as L=Lcls+Lbox+Lmask. The classification loss Lcls and bounding-box loss Lbox are identical as those defined in Fast R-CNN. The mask branch has a Km2-dimensional output for each RoI, which encodes K binary masks of resolution m×m, one for each of the K classes. To this we apply a per-pixel sigmoid, and define Lmask as the average binary cross-entropy loss. For an RoI associated with ground-truth class k, Lmask is only defined on the k-th mask (other mask outputs do not contribute to the loss).

形式上,在训练期间,我们在每个采样 RoI 上定义多任务损失为 L=Lcls+Lbox+Lmask 分类损失 Lcls 与边界框损失 Lbox 与 Fast R-CNN 中定义的相同。 对于每个 RoI,掩码分支具有 Km2 维输出,它编码 K 个分辨率为 m×m 的二值掩码,每个类别对应一个。 我们对其应用逐像素 sigmoid,并把 Lmask 定义为平均二元交叉熵损失。 对于与真实类别 k 相关联的 RoI,Lmask 只在第 k 个掩码上定义(其他掩码输出不参与损失)。

Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.

我们对 Lmask 的定义允许网络为每个类别生成掩码,并且类别之间没有竞争;我们依赖专门的分类分支来预测用于选择输出掩码的类别标签。 解耦了掩码预测与类别预测。 这不同于把 FCN 用于语义分割时的常见做法,后者通常使用逐像素 softmax多项式交叉熵损失。 在那种情况下,不同类别的掩码相互竞争;而在我们的方法中,由于使用逐像素 sigmoid二元损失,它们不会竞争。 我们通过实验证明,这一形式是取得良好实例分割结果的关键。

Mask Representation: A mask encodes an input object's spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.

掩码表示: 掩码编码输入目标的空间布局。 因此,不同于类别标签或边界框偏移会不可避免地被全连接(fc)层压缩为较短输出向量,提取掩码的空间结构可以自然地通过卷积提供的像素到像素对应来处理。

Specifically, we predict an m×m mask from each RoI using an FCN. This allows each layer in the mask branch to maintain the explicit m×m object spatial layout without collapsing it into a vector representation that lacks spatial dimensions. Unlike previous methods that resort to fc layers for mask prediction, our fully convolutional representation requires fewer parameters, and is more accurate as demonstrated by experiments.

具体而言,我们使用 FCN 从每个 RoI 预测一个 m×m 掩码。 这允许掩码分支中的每一层都保持显式的 m×m 目标空间布局,而不会把它压缩成缺少空间维度的向量表示。 不同于以往依赖 fc 层进行掩码预测的方法,我们的全卷积表示需要更少参数,并且实验表明它更准确。

This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.

这种像素到像素的行为要求我们的 RoI 特征本身作为小特征图能够良好对齐,从而忠实保留显式的逐像素空间对应。 这促使我们开发下面的 RoIAlign 层;该层在掩码预测中发挥关键作用。

RoIAlign: RoIPool is a standard operation for extracting a small feature map (e.g., 7×7) from each RoI. RoIPool first quantizes a floating-number RoI to the discrete granularity of the feature map, this quantized RoI is then subdivided into spatial bins which are themselves quantized, and finally feature values covered by each bin are aggregated (usually by max pooling). Quantization is performed, e.g., on a continuous coordinate x by computing [x/16], where 16 is a feature map stride and [] is rounding; likewise, quantization is performed when dividing into bins (e.g., 7×7). These quantizations introduce misalignments between the RoI and the extracted features. While this may not impact classification, which is robust to small translations, it has a large negative effect on predicting pixel-accurate masks.

RoIAlign: RoIPool 是从每个 RoI 中提取小特征图(例如 7×7)的标准操作。 RoIPool 首先把浮点数 RoI 量化到特征图的离散粒度,然后把这个量化后的 RoI 细分为同样会被量化的空间 bin,最后聚合每个 bin 覆盖的特征值(通常使用最大池化)。 例如,对于连续坐标 x,量化通过计算 [x/16] 完成,其中 16 是特征图步幅,[] 表示取整;同样,在划分 bin(例如 7×7)时也会执行量化。 这些量化会在 RoI 和提取出的特征之间引入错位。 虽然这可能不影响对小平移具有鲁棒性的分类,但它会对预测像素精确的掩码产生很大的负面影响。

To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use x/16 instead of [x/16]). We use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Figure Figure 3 for details. We note that the results are not sensitive to the exact sampling locations, or how many points are sampled, as long as no quantization is performed.

为了解决这一问题,我们提出了 RoIAlign 层,它去除 RoIPool 中粗糙的量化,并把提取出的特征与输入正确对齐 我们提出的改动很简单:避免对 RoI 边界或 bin 进行任何量化(即使用 x/16 而不是 [x/16])。 我们使用双线性插值来计算每个 RoI bin 中四个规则采样位置上的输入特征精确值,并聚合结果(使用最大值或平均值);细节见 图3 我们指出,只要不执行量化,结果对精确采样位置或采样点数量并不敏感。

RoIAlign leads to large improvements as we show in Section 4.2. We also compare to the RoIWarp operation proposed in MNC. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in MNC as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by spatial transformer networks, it performs on par with RoIPool as shown by experiments (more details in Table Table 2), demonstrating the crucial role of alignment.

如我们在第 4.2 节中所示,RoIAlign 带来了很大提升。 我们还将其与 MNC 中提出的 RoIWarp 操作进行比较。 不同于 RoIAlign,RoIWarp 忽视了对齐问题,并且在 MNC 中像 RoIPool 一样通过量化 RoI 实现。 因此,尽管 RoIWarp 也采用了受空间变换网络启发的双线性重采样,实验显示它的表现与 RoIPool 相当(更多细节见 表2),这证明了对齐的关键作用。

Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI.

网络架构: 为展示方法的通用性,我们用多种架构实例化 Mask R-CNN。 为清楚起见,我们区分两部分:(i)用于对整幅图像进行特征提取的卷积主干架构;(ii)分别应用于每个 RoI、用于边界框识别(分类和回归)以及掩码预测的网络头部

We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet and ResNeXt networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in previous work.

我们使用网络-深度-特征的命名方式表示主干架构。 我们评估深度为 50 层或 101 层的 ResNet 和 ResNeXt 网络。 使用 ResNet 的 Faster R-CNN 原始实现从第 4 阶段最后一个卷积层提取特征,我们称其为 C4。 例如,这个使用 ResNet-50 的主干被表示为 ResNet-50-C4。 这是此前工作中使用的常见选择。

We also explore another more effective backbone recently proposed by Lin et al., called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask R-CNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to Lin et al.

我们还探索了 Lin 等人近期提出的另一个更有效主干,称为特征金字塔网络(FPN)。 FPN 使用带有横向连接的自顶向下架构,从单尺度输入中构建网络内部的特征金字塔。 具有 FPN 主干的 Faster R-CNN 会根据尺度从特征金字塔的不同层级提取 RoI 特征,但除此之外,其余方法与普通 ResNet 类似。 在 Mask R-CNN 中使用 ResNet-FPN 主干进行特征提取,在准确率和速度上都带来优秀收益。 关于 FPN 的更多细节,我们请读者参考 Lin 等人的工作。

For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet and FPN papers. Details are shown in Figure Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer res5), which is compute-intensive. For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters.

对于网络头部,我们紧密沿用以往工作中的架构,并在其上增加一个全卷积掩码预测分支。 具体而言,我们扩展了 ResNet 和 FPN 论文中的 Faster R-CNN 边界框头部。 细节如 图4 所示。 ResNet-C4 主干上的头部包含 ResNet 的第 5 阶段(即 9 层的 res5),计算量很大。 对于 FPN,主干已经包含 res5,因此可以使用滤波器更少、更高效的头部。

We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.

我们指出,掩码分支具有直接的结构。 更复杂的设计可能进一步提升性能,但这不是本文重点。

3.1 Implementation Details

More results of Mask R-CNN on COCO test images
图5:More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
表1:Instance segmentation mask AP on COCO test-dev. MNC and FCIS are the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includes multi-scale train/test, horizontal flip test, and OHEM. All entries are single-model results.
MethodBackboneAPAP50AP75APSAPMAPL
MNCResNet-101-C424.644.324.84.725.943.6
FCIS +OHEMResNet-101-C5-dilated29.249.5-7.131.350.0
FCIS+++ +OHEMResNet-101-C5-dilated33.654.5----
Mask R-CNNResNet-101-C433.154.934.812.135.651.1
Mask R-CNNResNet-101-FPN35.758.037.815.538.152.4
Mask R-CNNResNeXt-101-FPN37.160.039.416.939.953.5

We set hyper-parameters following existing Fast/Faster R-CNN work. Although these decisions were made for object detection in original papers, we found our instance segmentation system is robust to them.

我们按照已有 Fast/Faster R-CNN 工作设置超参数。 虽然这些决策最初是在目标检测论文中做出的,但我们发现其实例分割系统对它们具有鲁棒性。

Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss Lmask is defined only on positive RoIs. The mask target is the intersection between an RoI and its associated ground-truth mask.

训练: 与 Fast R-CNN 一样,如果一个 RoI 与真实框的 IoU 至少为 0.5,则将其视为正样本,否则视为负样本。 掩码损失 Lmask 只在正 RoI 上定义。 掩码目标是 RoI 与其关联真实掩码之间的交集。

We adopt image-centric training. Images are resized such that their scale (shorter edge) is 800 pixels. Each mini-batch has 2 images per GPU and each image has N sampled RoIs, with a ratio of 1:3 of positive to negatives. N is 64 for the C4 backbone and 512 for FPN. We train on 8 GPUs (so effective mini-batch size is 16) for 160k iterations, with a learning rate of 0.02 which is decreased by 10 at the 120k iteration. We use a weight decay of 0.0001 and momentum of 0.9. With ResNeXt, we train with 1 image per GPU and the same number of iterations, with a starting learning rate of 0.01.

我们采用以图像为中心的训练。 图像会被调整尺寸,使其尺度(短边)为 800 像素。 每个小批量在每块 GPU 上包含 2 张图像,并且每张图像有 N 个采样 RoI,正负样本比例为 1:3。 对于 C4 主干,N 为 64;对于 FPN,N 为 512。 我们在 8 块 GPU 上训练 160k 次迭代(因此有效小批量大小为 16),学习率为 0.02,并在第 120k 次迭代时降低 10 倍。 我们使用 0.0001 的权重衰减和 0.9 的动量。 使用 ResNeXt 时,我们在每块 GPU 上用 1 张图像训练,迭代次数相同,初始学习率为 0.01。

The RPN anchors span 5 scales and 3 aspect ratios, following FPN. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.

RPN 锚框按照 FPN 设定,覆盖 5 种尺度和 3 种长宽比。 为方便消融,除非特别说明,RPN 会单独训练且不与 Mask R-CNN 共享特征。 对于本文中的每个条目,RPN 和 Mask R-CNN 都具有相同主干,因此它们是可以共享的。

Inference: At test time, the proposal number is 300 for the C4 backbone and 1000 for FPN. We run the box prediction branch on these proposals, followed by non-maximum suppression. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict K masks per RoI, but we only use the k-th mask, where k is the predicted class by the classification branch. The m×m floating-number mask output is then resized to the RoI size, and binarized at a threshold of 0.5.

推理: 测试时,C4 主干的提议数量为 300,FPN 的提议数量为 1000。 我们在这些提议上运行边界框预测分支,然后执行非极大值抑制。 随后,将掩码分支应用于得分最高的 100 个检测框。 虽然这不同于训练中使用的并行计算,但它加快了推理并提升了准确率(因为使用了更少且更准确的 RoI)。 掩码分支可以为每个 RoI 预测 K 个掩码,但我们只使用第 k 个掩码,其中 k 是分类分支预测的类别。 然后,将 m×m 的浮点掩码输出调整到 RoI 大小,并以 0.5 阈值二值化。

Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ~20% on typical models).

注意,由于我们只在前 100 个检测框上计算掩码,Mask R-CNN 相对于对应的 Faster R-CNN 只增加很小的开销(例如,在典型模型上约为 20%)。

4. Experiments: Instance Segmentation

FCIS+++ versus Mask R-CNN
图6:FCIS+++ (top) vs Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
表2:Ablations. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.

(a) Backbone Architecture. Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.

net-depth-featuresAPAP50AP75
ResNet-50-C430.351.231.5
ResNet-101-C432.754.234.3
ResNet-50-FPN33.655.235.3
ResNet-101-FPN35.457.337.5
ResNeXt-101-FPN36.759.538.9

(b) Multinomial vs Independent Masks. Decoupling via per-class binary masks (sigmoid) gives large gains over multinomial masks (softmax).

APAP50AP75
softmax24.844.125.1
sigmoid30.351.231.5
+5.5+7.1+6.4

(c) RoIAlign (ResNet-50-C4). Mask results with various RoI layers. Our RoIAlign layer improves AP by ~3 points and AP75 by ~5 points.

Layeralign?bilinear?agg.APAP50AP75
RoIPoolmax26.948.826.4
RoIWarpmax27.249.227.1
RoIWarpave27.148.927.1
RoIAlignmax30.251.031.8
RoIAlignave30.351.231.5

(d) RoIAlign (ResNet-50-C5, stride 32). Mask-level and box-level AP using large-stride features.

LayerAPAP50AP75APbbAPbb50APbb75
RoIPool23.646.521.628.252.726.9
RoIAlign30.951.832.134.055.336.4
+7.3+5.3+10.5+5.8+2.6+9.5

(e) Mask Branch (ResNet-50-FPN). Fully convolutional networks (FCN) vs multi-layer perceptrons (MLP, fully-connected) for mask prediction.

Typemask branchAPAP50AP75
MLPfc: 1024→1024→80·28231.553.732.8
MLPfc: 1024→1024→1024→80·28231.554.032.6
FCNconv: 256→256→256→256→256→8033.655.235.3

We perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset. We report the standard COCO metrics including AP (averaged over IoU thresholds), AP50, AP75, and APS, APM, APL (AP at different scales). Unless noted, AP is evaluating using mask IoU. As in previous work, we train using the union of 80k train images and a 35k subset of val images (trainval35k), and report ablations on the remaining 5k val images (minival). We also report results on test-dev.

我们在 COCO 数据集上将 Mask R-CNN 与最先进方法进行全面比较,并进行综合消融。 我们报告标准 COCO 指标,包括 AP(在 IoU 阈值上平均)、AP50、AP75,以及 APS、APM、APL(不同尺度上的 AP)。 除非特别说明,AP 使用掩码 IoU 进行评估。 与以往工作一样,我们使用 80k 张训练图像和 35k 张验证图像子集的并集(trainval35k)训练,并在剩余 5k 张验证图像(minival)上报告消融。 我们还报告 test-dev 上的结果。

4.1 Main Results

We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table Table 1. All instantiations of our model outperform baseline variants of previous state-of-the-art models. This includes MNC and FCIS, the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++, which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM). While outside the scope of this work, we expect many such improvements to be applicable to ours.

我们在 表1 中将 Mask R-CNN 与实例分割中的最先进方法进行比较。 我们的所有模型实例化版本都超过了以往最先进模型的基线变体。 这包括分别获得 COCO 2015 和 2016 分割挑战赛冠军的 MNC 与 FCIS。 在不使用复杂技巧的情况下,使用 ResNet-101-FPN 主干的 Mask R-CNN 超过了 FCIS+++;后者包含多尺度训练/测试、水平翻转测试和在线难例挖掘(OHEM)。 虽然这些改进超出了本文范围,但我们预计其中许多也可用于我们的方法。

Mask R-CNN outputs are visualized in Figures Figure 2 and Figure 5. Mask R-CNN achieves good results even under challenging conditions. In Figure Figure 6 we compare our Mask R-CNN baseline and FCIS+++. FCIS+++ exhibits systematic artifacts on overlapping instances, suggesting that it is challenged by the fundamental difficulty of instance segmentation. Mask R-CNN shows no such artifacts.

Mask R-CNN 的输出在 图2图5 中可视化展示。 即使在具有挑战性的条件下,Mask R-CNN 也能取得良好结果。 图6 中,我们比较了自己的 Mask R-CNN 基线与 FCIS+++。 FCIS+++ 在重叠实例上表现出系统性伪影,这表明它受到实例分割基本困难的挑战。 Mask R-CNN 没有表现出这类伪影。

4.2 Ablation Experiments

We run a number of ablations to analyze Mask R-CNN. Results are shown in Table Table 2 and discussed in detail next.

我们进行了多项消融实验来分析 Mask R-CNN。 结果如 表2 所示,并在下文详细讨论。

Architecture: Table Table 2 shows Mask R-CNN with various backbones. It benefits from deeper networks (50 vs. 101) and advanced designs including FPN and ResNeXt. We note that not all frameworks automatically benefit from deeper or advanced networks (see benchmarking in Huang et al.).

架构: 表2 展示了使用不同主干的 Mask R-CNN。 它能从更深网络(50 对 101)和包括 FPN、ResNeXt 在内的先进设计中获益。 我们指出,并非所有框架都会自动从更深或更先进的网络中获益(见 Huang 等人的基准测试)。

Multinomial vs Independent Masks: Mask R-CNN decouples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sigmoid and a binary loss). In Table Table 2, we compare this to using a per-pixel softmax and a multinomial loss (as commonly used in FCN). This alternative couples the tasks of mask and class prediction, and results in a severe loss in mask AP (5.5 points). This suggests that once the instance has been classified as a whole (by the box branch), it is sufficient to predict a binary mask without concern for the categories, which makes the model easier to train.

多项式掩码与独立掩码: Mask R-CNN 解耦掩码预测和类别预测:由于已有边界框分支会预测类别标签,我们为每个类别生成一个掩码,不让类别之间竞争(通过逐像素 sigmoid二元损失)。 表2 中,我们将其与使用逐像素 softmax多项式损失的方案进行比较(这在 FCN 中很常见)。 这个替代方案会耦合掩码预测和类别预测两个任务,并导致掩码 AP 严重下降(5.5 个点)。 这表明,一旦实例整体已经由边界框分支完成分类,预测一个不关心类别的二值掩码就足够了,这会使模型更容易训练。

Class-Specific vs Class-Agnostic Masks: Our default instantiation predicts class-specific masks, i.e., one m×m mask per class. Interestingly, Mask R-CNN with class-agnostic masks (i.e., predicting a single m×m output regardless of class) is nearly as effective: it has 29.7 mask AP vs. 30.3 for the class-specific counterpart on ResNet-50-C4. This further highlights the division of labor in our approach which largely decouples classification and segmentation.

类别特定掩码与类别无关掩码: 我们的默认实例化版本预测类别特定掩码,即每个类别一个 m×m 掩码。 有趣的是,使用类别无关掩码的 Mask R-CNN(即不考虑类别,只预测单个 m×m 输出)几乎同样有效:在 ResNet-50-C4 上它取得 29.7 的掩码 AP,而类别特定版本为 30.3。 这进一步凸显了我们方法中的分工,它在很大程度上解耦了分类和分割。

表3:Object detection single-model results (bounding box AP), vs state-of-the-art on test-dev. Mask R-CNN using ResNet-101-FPN outperforms the base variants of all previous state-of-the-art models. The gains of Mask R-CNN over Faster R-CNN w FPN come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb).
MethodBackboneAPbbAPbb50APbb75APbbSAPbbMAPbbL
Faster R-CNN+++ResNet-101-C434.955.737.415.638.750.9
Faster R-CNN w FPNResNet-101-FPN36.259.139.018.239.048.2
Faster R-CNN by G-RMIInception-ResNet-v234.755.536.713.538.152.0
Faster R-CNN w TDMInception-ResNet-v2-TDM36.857.739.216.239.852.1
Faster R-CNN, RoIAlignResNet-101-FPN37.359.640.319.840.248.8
Mask R-CNNResNet-101-FPN38.260.341.720.141.150.2
Mask R-CNNResNeXt-101-FPN39.862.343.422.143.251.2

RoIAlign: An evaluation of our proposed RoIAlign layer is shown in Table Table 2. For this experiment we use the ResNet-50-C4 backbone, which has stride 16. RoIAlign improves AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP75). RoIAlign is insensitive to max/average pool; we use average in the rest of the paper.

RoIAlign: 我们提出的 RoIAlign 层的评估结果见 表2 该实验使用步幅为 16 的 ResNet-50-C4 主干。 与 RoIPool 相比,RoIAlign 将 AP 提升约 3 个点,其中很大部分收益来自高 IoU(AP75)。 RoIAlign 对最大池化/平均池化不敏感;我们在论文其余部分使用平均池化。

Additionally, we compare with RoIWarp proposed in MNC that also adopt bilinear sampling. As discussed in Section 3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table Table 2, RoIWarp performs on par with RoIPool and much worse than RoIAlign. This highlights that proper alignment is key.

此外,我们还与 MNC 中提出、同样采用双线性采样的 RoIWarp 进行比较。 如第 3 节所讨论,RoIWarp 仍然会量化 RoI,从而丢失与输入的对齐。 表2 所示,RoIWarp 的表现与 RoIPool 相当,并且远差于 RoIAlign。 这凸显了正确对齐是关键。

We also evaluate RoIAlign with a ResNet-50-C5 backbone, which has an even larger stride of 32 pixels. We use the same head as in Figure Figure 4 (right), as the res5 head is not applicable. Table Table 2 shows that RoIAlign improves mask AP by a massive 7.3 points, and mask AP75 by 10.5 points (50% relative improvement). Moreover, we note that with RoIAlign, using stride-32 C5 features (30.9 AP) is more accurate than using stride-16 C4 features (30.3 AP, Table Table 2). RoIAlign largely resolves the long-standing challenge of using large-stride features for detection and segmentation.

我们还在 ResNet-50-C5 主干上评估 RoIAlign,该主干具有更大的 32 像素步幅。 由于 res5 头部并不适用,我们使用与 图4(右)相同的头部。 表2 显示,RoIAlign 将掩码 AP 大幅提升 7.3 个点,并将掩码 AP75 提升 10.5 个点(50% 相对提升)。 此外,我们指出,在使用 RoIAlign 时,采用步幅 32 的 C5 特征(30.9 AP)比采用步幅 16 的 C4 特征(30.3 AP,表2)更准确。 RoIAlign 在很大程度上解决了使用大步幅特征进行检测和分割这一长期挑战。

Finally, RoIAlign shows a gain of 1.5 mask AP and 0.5 box AP when used with FPN, which has finer multi-level strides. For keypoint detection that requires finer alignment, RoIAlign shows large gains even with FPN (Table Table 6).

最后,当与具有更细多层步幅的 FPN 一起使用时,RoIAlign 带来 1.5 掩码 AP 和 0.5 边界框 AP 的增益。 对于需要更精细对齐的关键点检测,即使使用 FPN,RoIAlign 也展现出很大收益(表6)。

Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table Table 2, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP.

掩码分支: 分割是像素到像素的任务,我们通过使用 FCN 来利用掩码的空间布局。 表2 中,我们使用 ResNet-50-FPN 主干比较多层感知机(MLP)和 FCN。 使用 FCN 相比 MLP 带来 2.1 掩码 AP 的提升。 我们指出,之所以选择这个主干,是为了让 FCN 头部的卷积层没有经过预训练,从而与 MLP 进行公平比较。

4.3 Bounding Box Detection Results

We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table Table 3. For this result, even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101-FPN outperforms the base variants of all previous state-of-the-art models, including the single-model variant of G-RMI, the winner of the COCO 2016 Detection Challenge. Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from Faster R-CNN w TDM (which used Inception-ResNet-v2-TDM).

我们在 表3 中将 Mask R-CNN 与 COCO 边界框目标检测的最先进方法进行比较。 对于这个结果,尽管训练的是完整 Mask R-CNN 模型,但推理时只使用分类和边界框输出(忽略掩码输出)。 使用 ResNet-101-FPN 的 Mask R-CNN 超过了此前所有最先进模型的基础变体,包括 COCO 2016 检测挑战赛冠军 G-RMI 的单模型变体。 使用 ResNeXt-101-FPN 时,Mask R-CNN 进一步提升结果,相比此前最佳单模型 Faster R-CNN w TDM(使用 Inception-ResNet-v2-TDM)高出 3.0 个边界框 AP 点。

As a further comparison, we trained a version of Mask R-CNN but without the mask branch, denoted by "Faster R-CNN, RoIAlign" in Table Table 3. This model performs better than the model presented in FPN due to RoIAlign. On the other hand, it is 0.9 points box AP lower than Mask R-CNN. This gap of Mask R-CNN on box detection is therefore due solely to the benefits of multi-task training.

作为进一步比较,我们还训练了一个没有掩码分支的 Mask R-CNN 版本,在 表3 中记为 “Faster R-CNN, RoIAlign”。 由于 RoIAlign,这个模型优于 FPN 中提出的模型。 另一方面,它比 Mask R-CNN 低 0.9 个边界框 AP 点。 因此,Mask R-CNN 在边界框检测上的这一差距完全来自多任务训练的收益。

Lastly, we note that Mask R-CNN attains a small gap between its mask and box AP: e.g., 2.7 points between 37.1 (mask, Table Table 1) and 39.8 (box, Table Table 3). This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.

最后,我们指出,Mask R-CNN 的掩码 AP 与边界框 AP 之间差距很小:例如 37.1(掩码,表1)与 39.8(边界框,表3)之间只差 2.7 个点。 这表明我们的方法在很大程度上缩小了目标检测与更具挑战性的实例分割任务之间的差距。

4.4 Timing

Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN. This model runs at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution), and achieves statistically the same mask AP as the unshared one. We also report that the ResNet-101-C4 variant takes ~400ms as it has a heavier box head (Figure Figure 4), so we do not recommend using the C4 variant in practice.

推理: 我们按照 Faster R-CNN 的 4 步训练方式,训练了一个在 RPN 与 Mask R-CNN 阶段之间共享特征的 ResNet-101-FPN 模型。 这个模型在 Nvidia Tesla M40 GPU 上每张图像运行 195ms(另外需要 15ms CPU 时间把输出调整到原始分辨率),并取得与不共享版本在统计上相同的掩码 AP。 我们还报告,由于 ResNet-101-C4 变体具有更重的边界框头部(图4),它需要约 400ms,因此我们不建议在实践中使用 C4 变体。

Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy trade-offs could be achieved, e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper.

虽然 Mask R-CNN 很快,但我们指出,其设计并没有针对速度进行优化,更好的速度/准确率权衡可以通过改变图像尺寸和提议数量等方式实现,这超出了本文范围。

Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16-image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic.

训练: Mask R-CNN 的训练也很快。 在我们同步的 8-GPU 实现中,用 ResNet-50-FPN 在 COCO trainval35k 上训练需要 32 小时(每个 16 图像小批量 0.72s),用 ResNet-101-FPN 则需要 44 小时。 事实上,在 train 集上训练时,快速原型开发可以在不到一天内完成。 我们希望这种快速训练能够移除该领域的一大障碍,并鼓励更多人研究这一具有挑战性的话题。

5. Mask R-CNN for Human Pose Estimation

Keypoint detection results on COCO test using Mask R-CNN
图7:Keypoint detection results on COCO test using Mask R-CNN (ResNet-50-FPN), with person segmentation masks predicted from the same model. This model has a keypoint AP of 63.1 and runs at 5 fps.

Our framework can easily be extended to human pose estimation. We model a keypoint's location as a one-hot mask, and adopt Mask R-CNN to predict K masks, one for each of K keypoint types (e.g., left shoulder, right elbow). This task helps demonstrate the flexibility of Mask R-CNN.

我们的框架可以很容易扩展到人体姿态估计。 我们把关键点位置建模为 one-hot 掩码,并采用 Mask R-CNN 预测 K 个掩码,每种 K 个关键点类型对应一个掩码(例如左肩、右肘)。 这项任务有助于展示 Mask R-CNN 的灵活性。

We note that minimal domain knowledge for human pose is exploited by our system, as the experiments are mainly to demonstrate the generality of the Mask R-CNN framework. We expect that domain knowledge (e.g., modeling structures) will be complementary to our simple approach.

我们指出,系统只利用了极少量人体姿态领域知识,因为实验主要是为了展示 Mask R-CNN 框架的通用性。 我们预计,领域知识(例如结构建模)将与这个简单方法互补。

Implementation Details: We make minor modifications to the segmentation system when adapting it for keypoints. For each of the K keypoints of an instance, the training target is a one-hot m×m binary mask where only a single pixel is labeled as foreground. During training, for each visible ground-truth keypoint, we minimize the cross-entropy loss over an m2-way softmax output (which encourages a single point to be detected). We note that as in instance segmentation, the K keypoints are still treated independently.

实现细节: 在把分割系统适配到关键点任务时,我们进行了少量修改。 对于一个实例的每个 K 个关键点,训练目标是一个 one-hot 的 m×m 二值掩码,其中只有单个像素被标记为前景。 训练期间,对于每个可见真实关键点,我们在 m2 路 softmax 输出上最小化交叉熵损失(这鼓励检测单个点)。 我们指出,与实例分割中一样,K 个关键点仍然被独立处理。

We adopt the ResNet-FPN variant, and the keypoint head architecture is similar to that in Figure Figure 4 (right). The keypoint head consists of a stack of eight 3×3 512-d conv layers, followed by a deconv layer and 2× bilinear upscaling, producing an output resolution of 56×56. We found that a relatively high resolution output (compared to masks) is required for keypoint-level localization accuracy.

我们采用 ResNet-FPN 变体,并且关键点头部架构与 图4(右)类似。 关键点头部由八个 3×3、512 维卷积层堆叠而成,随后接一个反卷积层和 2× 双线性上采样,生成 56×56 的输出分辨率。 我们发现,为了达到关键点级定位准确率,需要相对较高的输出分辨率(相对于掩码而言)。

Models are trained on all COCO trainval35k images that contain annotated keypoints. To reduce overfitting, as this training set is smaller, we train using image scales randomly sampled from [640, 800] pixels; inference is on a single scale of 800 pixels. We train for 90k iterations, starting from a learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations. We use bounding-box NMS with a threshold of 0.5. Other details are identical as in Section 3.1.

模型在所有包含关键点标注的 COCO trainval35k 图像上训练。 由于该训练集较小,为减少过拟合,我们使用从 [640, 800] 像素随机采样的图像尺度进行训练;推理时使用单一的 800 像素尺度。 我们训练 90k 次迭代,初始学习率为 0.02,并在 60k 和 80k 次迭代时将其降低 10 倍。 我们使用阈值为 0.5 的边界框 NMS。 其他细节与第 3.1 节相同。

表4:Keypoint detection AP on COCO test-dev. Ours is a single model (ResNet-50-FPN) that runs at 5 fps. CMU-Pose+++ is the 2016 competition winner that uses multi-scale testing, post-processing with CPM, and filtering with an object detector, adding a cumulative ~5 points. †: G-RMI was trained on COCO plus MPII (25k images), using two models.
MethodAPkpAPkp50APkp75APkpMAPkpL
CMU-Pose+++61.884.967.557.168.2
G-RMI†62.484.068.559.168.1
Mask R-CNN, keypoint-only62.787.068.457.471.1
Mask R-CNN, keypoint & mask63.187.368.757.871.4

Main Results and Ablations: We evaluate the person keypoint AP (APkp) and experiment with a ResNet-50-FPN backbone; more backbones will be studied in the appendix. Table Table 4 shows that our result (62.7 APkp) is 0.9 points higher than the COCO 2016 keypoint detection winner that uses a multi-stage processing pipeline (see caption of Table Table 4). Our method is considerably simpler and faster.

主结果与消融: 我们评估人体关键点 AP(APkp),并使用 ResNet-50-FPN 主干进行实验;更多主干将在附录中研究。 表4 显示,我们的结果(62.7 APkp)比采用多阶段处理流程的 COCO 2016 关键点检测冠军高 0.9 个点(见 表4 的说明)。 我们的方法明显更简单且更快。

表5:Multi-task learning of box, mask, and keypoint about the person category, evaluated on minival. All entries are trained on the same data for fair comparisons. The backbone is ResNet-50-FPN.
MethodAPbbpersonAPmaskpersonAPkp
Faster R-CNN52.5--
Mask R-CNN, mask-only53.645.8-
Mask R-CNN, keypoint-only50.7-64.2
Mask R-CNN, keypoint & mask52.045.164.7

More importantly, we have a unified model that can simultaneously predict boxes, segments, and keypoints while running at 5 fps. Adding a segment branch (for the person category) improves the APkp to 63.1 (Table Table 4) on test-dev. More ablations of multi-task learning on minival are in Table Table 5. Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks. However, adding the keypoint branch reduces the box/mask AP slightly, suggesting that while keypoint detection benefits from multitask training, it does not in turn help the other tasks. Nevertheless, learning all three tasks jointly enables a unified system to efficiently predict all outputs simultaneously (Figure Figure 7).

更重要的是,我们拥有一个能够同时预测边界框、分割和关键点的统一模型,并且运行速度为 5 fps。 test-dev 上,增加一个(针对 person 类别的)分割分支将 APkp 提升到 63.1(表4)。 关于 minival 上多任务学习的更多消融见 表5 在仅有边界框(即 Faster R-CNN)或仅有关键点的版本中增加掩码分支,都会持续改进这些任务。 然而,增加关键点分支会略微降低边界框/掩码 AP,这表明关键点检测虽然受益于多任务训练,但反过来并不会帮助其他任务。 尽管如此,联合学习全部三个任务使一个统一系统能够同时高效预测所有输出(图7)。

表6:RoIAlign vs RoIPool for keypoint detection on minival. The backbone is ResNet-50-FPN.
LayerAPkpAPkp50APkp75APkpMAPkpL
RoIPool59.886.266.755.167.4
RoIAlign64.286.669.758.773.0

We also investigate the effect of RoIAlign on keypoint detection (Table Table 6). Though this ResNet-50-FPN backbone has finer strides (e.g., 4 pixels on the finest level), RoIAlign still shows significant improvement over RoIPool and increases APkp by 4.4 points. This is because keypoint detections are more sensitive to localization accuracy. This again indicates that alignment is essential for pixel-level localization, including masks and keypoints.

我们还研究了 RoIAlign 对关键点检测的影响(表6)。 虽然这个 ResNet-50-FPN 主干具有更细的步幅(例如最细层级上为 4 像素),RoIAlign 仍然相比 RoIPool 显示出显著提升,并将 APkp 提升 4.4 个点。 这是因为关键点检测对定位准确率更敏感。 这再次表明,对齐对于像素级定位至关重要,包括掩码和关键点。

Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.

鉴于 Mask R-CNN 在提取目标边界框、掩码和关键点方面的有效性,我们预计它也会成为其他实例级任务的有效框架。