Mask R-CNN
Instance Segmentation51270+ICCV 2017FAIRHe K, Gkioxari G, Dollár P, Girshick R. Mask R-CNN. ICCV 2017.
Mask R-CNN
Abstract
We present a conceptually simple, flexible, and general framework for object instance segmentation. Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance. The method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5 fps. Moreover, Mask R-CNN is easy to generalize to other tasks, e.g., allowing us to estimate human poses in the same framework. We show top results in all three tracks of the COCO suite of challenges, including instance segmentation, bounding-box object detection, and person keypoint detection. Without bells and whistles, Mask R-CNN outperforms all existing, single-model entries on every task, including the COCO 2016 challenge winners. We hope our simple and effective approach will serve as a solid baseline and help ease future research in instance-level recognition.
本文提出了一个概念简单、灵活且通用的目标实例分割框架。 该方法能够高效检测图像中的目标,同时为每个实例生成高质量的分割掩码。 这个方法称为 Mask R-CNN,它通过增加一个用于预测目标掩码的分支来扩展 Faster R-CNN,并且该分支与已有的边界框识别分支并行工作。 Mask R-CNN 易于训练,并且只给 Faster R-CNN 增加很小的额外开销,运行速度为 5 fps。 此外,Mask R-CNN 很容易泛化到其他任务,例如允许我们在同一框架中估计人体姿态。 我们在 COCO 挑战套件的全部三个赛道上都取得顶尖结果,包括实例分割、边界框目标检测和人体关键点检测。 在不使用复杂技巧的情况下,Mask R-CNN 在每一项任务上都超过了所有已有的单模型参赛方法,包括 COCO 2016 挑战赛冠军。 我们希望这个简单而有效的方法能够成为坚实基线,并帮助降低未来实例级识别研究的难度。
1. Introduction


The vision community has rapidly improved object detection and semantic segmentation results over a short period of time. In large part, these advances have been driven by powerful baseline systems, such as the Fast/Faster R-CNN and Fully Convolutional Network (FCN) frameworks for object detection and semantic segmentation, respectively. These methods are conceptually intuitive and offer flexibility and robustness, together with fast training and inference time. Our goal in this work is to develop a comparably enabling framework for instance segmentation.
视觉领域在较短时间内快速提升了目标检测和语义分割的结果。 这些进展在很大程度上由强大的基线系统推动,例如分别用于目标检测和语义分割的 Fast/Faster R-CNN 与全卷积网络(FCN)框架。 这些方法在概念上直观,并且兼具灵活性、鲁棒性以及快速训练和推理时间。 本文的目标是开发一个同样能够推动研究的实例分割框架。
Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance. It therefore combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances. Following common terminology, we use object detection to denote detection via bounding boxes, not masks, and semantic segmentation to denote per-pixel classification without differentiating instances. Yet we note that instance segmentation is both semantic and a form of detection. Given this, one might expect a complex method is required to achieve good results. However, we show that a surprisingly simple, flexible, and fast system can surpass prior state-of-the-art instance segmentation results.
实例分割具有挑战性,因为它既要求正确检测图像中的所有目标,又要求精确分割每一个实例。 因此,它结合了经典计算机视觉任务中的目标检测与语义分割要素:目标检测的目标是对各个目标进行分类并用边界框定位每个目标,而语义分割的目标是在不区分目标实例的情况下把每个像素分类到固定类别集合中。 按照常用术语,我们用目标检测表示通过边界框而不是掩码进行检测,并用语义分割表示不区分实例的逐像素分类。 但我们也指出,实例分割既是语义任务,也是一种检测形式。 鉴于这一点,人们可能会预期需要复杂方法才能取得好结果。 然而,我们展示了一个出人意料地简单、灵活且快速的系统,它能够超过以往最先进的实例分割结果。
Our method, called Mask R-CNN, extends Faster R-CNN by adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression (Figure Figure 1). The mask branch is a small FCN applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner. Mask R-CNN is simple to implement and train given the Faster R-CNN framework, which facilitates a wide range of flexible architecture designs. Additionally, the mask branch only adds a small computational overhead, enabling a fast system and rapid experimentation.
我们的方法称为 Mask R-CNN,它通过为每个感兴趣区域(RoI)增加一个预测分割掩码的分支来扩展 Faster R-CNN,并且该分支与已有的分类和边界框回归分支并行工作(图1)。 掩码分支是一个应用于每个 RoI 的小型 FCN,以像素到像素的方式预测分割掩码。 在 Faster R-CNN 框架下,Mask R-CNN 易于实现和训练,这使得多种灵活的架构设计成为可能。 此外,掩码分支只增加很小的计算开销,从而支持快速系统和快速实验。
In principle Mask R-CNN is an intuitive extension of Faster R-CNN, yet constructing the mask branch properly is critical for good results. Most importantly, Faster R-CNN was not designed for pixel-to-pixel alignment between network inputs and outputs. This is most evident in how RoIPool, the de facto core operation for attending to instances, performs coarse spatial quantization for feature extraction. To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations. Despite being a seemingly minor change, RoIAlign has a large impact: it improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics. Second, we found it essential to decouple mask and class prediction: we predict a binary mask for each class independently, without competition among classes, and rely on the network's RoI classification branch to predict the category. In contrast, FCNs usually perform per-pixel multi-class categorization, which couples segmentation and classification, and based on our experiments works poorly for instance segmentation.
原则上,Mask R-CNN 是 Faster R-CNN 的直观扩展,但正确构造掩码分支对取得好结果至关重要。 最重要的是,Faster R-CNN 并不是为网络输入与输出之间的像素级对齐而设计的。 这一点在 RoIPool 中最为明显;RoIPool 是关注实例的事实标准核心操作,它在特征提取时会进行粗糙的空间量化。 为了解决错位问题,我们提出了一个简单、无量化的层,称为 RoIAlign,它能够忠实保留精确的空间位置。 尽管这看似只是一个小改动,RoIAlign 的影响却很大:它将掩码准确率相对提升 10% 到 50%,并且在更严格的定位指标下收益更大。 其次,我们发现解耦掩码预测与类别预测是必要的:我们为每个类别独立预测一个二值掩码,类别之间不存在竞争,并依赖网络的 RoI 分类分支来预测类别。 相比之下,FCN 通常执行逐像素多类别分类,这会耦合分割和分类;根据我们的实验,这种做法在实例分割上效果较差。
Without bells and whistles, Mask R-CNN surpasses all previous state-of-the-art single-model results on the COCO instance segmentation task, including the heavily-engineered entries from the 2016 competition winner. As a by-product, our method also excels on the COCO object detection task. In ablation experiments, we evaluate multiple basic instantiations, which allows us to demonstrate its robustness and analyze the effects of core factors.
在不使用复杂技巧的情况下,Mask R-CNN 超过了此前所有 COCO 实例分割任务上的最先进单模型结果,包括 2016 年竞赛冠军中大量工程化的参赛方法。 作为副产物,我们的方法在 COCO 目标检测任务上也表现出色。 在消融实验中,我们评估了多个基础实例化版本,从而展示其鲁棒性并分析核心因素的影响。
Our models can run at about 200ms per frame on a GPU, and training on COCO takes one to two days on a single 8-GPU machine. We believe the fast train and test speeds, together with the framework's flexibility and accuracy, will benefit and ease future research on instance segmentation.
我们的模型在 GPU 上每帧约 200ms 即可运行,在单台 8-GPU 机器上用 COCO 训练需要一到两天。 我们相信,快速的训练和测试速度,以及该框架的灵活性和准确性,将有益于并降低未来实例分割研究的难度。
Finally, we showcase the generality of our framework via the task of human pose estimation on the COCO keypoint dataset. By viewing each keypoint as a one-hot binary mask, with minimal modification Mask R-CNN can be applied to detect instance-specific poses. Mask R-CNN surpasses the winner of the 2016 COCO keypoint competition, and at the same time runs at 5 fps. Mask R-CNN, therefore, can be seen more broadly as a flexible framework for instance-level recognition and can be readily extended to more complex tasks.
最后,我们通过 COCO 关键点数据集上的人体姿态估计任务展示了框架的通用性。 通过把每个关键点视为一个 one-hot 二值掩码,只需进行很少修改,Mask R-CNN 就可用于检测实例特定的姿态。 Mask R-CNN 超过了 2016 年 COCO 关键点竞赛冠军,同时运行速度为 5 fps。 因此,Mask R-CNN 可以更广义地视为一个灵活的实例级识别框架,并且可以很容易扩展到更复杂的任务。
We have released code to facilitate future research.
我们已经发布代码,以促进未来研究。
2. Related Work
R-CNN: The Region-based CNN (R-CNN) approach to bounding-box object detection is to attend to a manageable number of candidate object regions and evaluate convolutional networks independently on each RoI. R-CNN was extended to allow attending to RoIs on feature maps using RoIPool, leading to fast speed and better accuracy. Faster R-CNN advanced this stream by learning the attention mechanism with a Region Proposal Network (RPN). Faster R-CNN is flexible and robust to many follow-up improvements, and is the current leading framework in several benchmarks.
R-CNN: 用于边界框目标检测的基于区域的 CNN(R-CNN)方法,是关注数量可控的候选目标区域,并在每个 RoI 上独立评估卷积网络。 R-CNN 随后被扩展为可以使用 RoIPool 在特征图上关注 RoI,从而获得更快速度和更高准确率。 Faster R-CNN 通过使用区域提议网络(RPN)学习注意机制,推进了这一方向。 Faster R-CNN 对许多后续改进具有灵活性和鲁棒性,并且是若干基准上的当前领先框架。
Instance Segmentation: Driven by the effectiveness of R-CNN, many approaches to instance segmentation are based on segment proposals. Earlier methods resorted to bottom-up segments. DeepMask and following works learn to propose segment candidates, which are then classified by Fast R-CNN. In these methods, segmentation precedes recognition, which is slow and less accurate. Likewise, Dai et al. proposed a complex multiple-stage cascade that predicts segment proposals from bounding-box proposals, followed by classification. Instead, our method is based on parallel prediction of masks and class labels, which is simpler and more flexible.
实例分割: 受 R-CNN 有效性的推动,许多实例分割方法都基于分割提议。 较早的方法依赖自底向上的分割片段。 DeepMask 及其后续工作学习提出候选分割片段,然后由 Fast R-CNN 对其进行分类。 在这些方法中,分割先于识别,这既慢又不够准确。 类似地,Dai 等人提出了一个复杂的多阶段级联系统,它从边界框提议中预测分割提议,然后再进行分类。 相反,我们的方法基于掩码与类别标签的并行预测,因此更简单也更灵活。
Most recently, Li et al. combined the segment proposal system and object detection system for "fully convolutional instance segmentation" (FCIS). The common idea is to predict a set of position-sensitive output channels fully convolutionally. These channels simultaneously address object classes, boxes, and masks, making the system fast. But FCIS exhibits systematic errors on overlapping instances and creates spurious edges (Figure Figure 6), showing that it is challenged by the fundamental difficulties of segmenting instances.
最近,Li 等人把分割提议系统与目标检测系统结合起来,用于“全卷积实例分割”(FCIS)。 其共同思想是以全卷积方式预测一组位置敏感的输出通道。 这些通道同时处理目标类别、边界框和掩码,使系统速度较快。 但 FCIS 在重叠实例上表现出系统性错误,并产生伪边缘(图6),这表明它受到实例分割基本困难的挑战。
Another family of solutions to instance segmentation are driven by the success of semantic segmentation. Starting from per-pixel classification results (e.g., FCN outputs), these methods attempt to cut the pixels of the same category into different instances. In contrast to the segmentation-first strategy of these methods, Mask R-CNN is based on an instance-first strategy. We expect a deeper incorporation of both strategies will be studied in the future.
另一类实例分割方案受到语义分割成功的推动。 这些方法从逐像素分类结果(例如 FCN 输出)出发,试图把同一类别的像素切分成不同实例。 与这些方法的分割优先策略相反,Mask R-CNN 基于实例优先策略。 我们预计,未来会研究对这两种策略的更深入融合。
3. Mask R-CNN


Mask R-CNN is conceptually simple: Faster R-CNN has two outputs for each candidate object, a class label and a bounding-box offset; to this we add a third branch that outputs the object mask. Mask R-CNN is thus a natural and intuitive idea. But the additional mask output is distinct from the class and box outputs, requiring extraction of much finer spatial layout of an object. Next, we introduce the key elements of Mask R-CNN, including pixel-to-pixel alignment, which is the main missing piece of Fast/Faster R-CNN.
Mask R-CNN 在概念上很简单:Faster R-CNN 对每个候选目标有两个输出,即类别标签和边界框偏移;我们在此基础上增加第三个分支,用于输出目标掩码。 因此,Mask R-CNN 是一个自然且直观的想法。 但额外的掩码输出不同于类别输出和边界框输出,它要求提取目标更精细的空间布局。 接下来,我们介绍 Mask R-CNN 的关键要素,包括像素到像素对齐,这是 Fast/Faster R-CNN 中主要缺失的一环。
Faster R-CNN: We begin by briefly reviewing the Faster R-CNN detector. Faster R-CNN consists of two stages. The first stage, called a Region Proposal Network (RPN), proposes candidate object bounding boxes. The second stage, which is in essence Fast R-CNN, extracts features using RoIPool from each candidate box and performs classification and bounding-box regression. The features used by both stages can be shared for faster inference. We refer readers to Huang et al. for latest, comprehensive comparisons between Faster R-CNN and other frameworks.
Faster R-CNN: 我们首先简要回顾 Faster R-CNN 检测器。 Faster R-CNN 由两个阶段组成。 第一阶段称为区域提议网络(RPN),它提出候选目标边界框。 第二阶段本质上是 Fast R-CNN,它使用 RoIPool 从每个候选框中提取特征,并执行分类和边界框回归。 两个阶段使用的特征可以共享,以加快推理。 关于 Faster R-CNN 与其他框架之间最新、全面的比较,我们请读者参考 Huang 等人的工作。
Mask R-CNN: Mask R-CNN adopts the same two-stage procedure, with an identical first stage (which is RPN). In the second stage, in parallel to predicting the class and box offset, Mask R-CNN also outputs a binary mask for each RoI. This is in contrast to most recent systems, where classification depends on mask predictions. Our approach follows the spirit of Fast R-CNN that applies bounding-box classification and regression in parallel (which turned out to largely simplify the multi-stage pipeline of original R-CNN).
Mask R-CNN: Mask R-CNN 采用相同的两阶段流程,并具有完全相同的第一阶段(即 RPN)。 在第二阶段,Mask R-CNN 在预测类别和边界框偏移的同时,还为每个 RoI 并行输出一个二值掩码。 这与大多数近期系统形成对比,在那些系统中,分类依赖于掩码预测。 我们的方法遵循 Fast R-CNN 的思想,即并行执行边界框分类和回归(事实证明,这大大简化了原始 R-CNN 的多阶段流程)。
Formally, during training, we define a multi-task loss on each sampled RoI as
形式上,在训练期间,我们在每个采样 RoI 上定义多任务损失为
Our definition of
我们对
Mask Representation: A mask encodes an input object's spatial layout. Thus, unlike class labels or box offsets that are inevitably collapsed into short output vectors by fully-connected (fc) layers, extracting the spatial structure of masks can be addressed naturally by the pixel-to-pixel correspondence provided by convolutions.
掩码表示: 掩码编码输入目标的空间布局。 因此,不同于类别标签或边界框偏移会不可避免地被全连接(fc)层压缩为较短输出向量,提取掩码的空间结构可以自然地通过卷积提供的像素到像素对应来处理。
Specifically, we predict an
具体而言,我们使用 FCN 从每个 RoI 预测一个
This pixel-to-pixel behavior requires our RoI features, which themselves are small feature maps, to be well aligned to faithfully preserve the explicit per-pixel spatial correspondence. This motivated us to develop the following RoIAlign layer that plays a key role in mask prediction.
这种像素到像素的行为要求我们的 RoI 特征本身作为小特征图能够良好对齐,从而忠实保留显式的逐像素空间对应。 这促使我们开发下面的 RoIAlign 层;该层在掩码预测中发挥关键作用。
RoIAlign: RoIPool is a standard operation for extracting a small feature map (e.g.,
RoIAlign: RoIPool 是从每个 RoI 中提取小特征图(例如
To address this, we propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input. Our proposed change is simple: we avoid any quantization of the RoI boundaries or bins (i.e., we use
为了解决这一问题,我们提出了 RoIAlign 层,它去除 RoIPool 中粗糙的量化,并把提取出的特征与输入正确对齐。 我们提出的改动很简单:避免对 RoI 边界或 bin 进行任何量化(即使用
RoIAlign leads to large improvements as we show in Section 4.2. We also compare to the RoIWarp operation proposed in MNC. Unlike RoIAlign, RoIWarp overlooked the alignment issue and was implemented in MNC as quantizing RoI just like RoIPool. So even though RoIWarp also adopts bilinear resampling motivated by spatial transformer networks, it performs on par with RoIPool as shown by experiments (more details in Table Table 2), demonstrating the crucial role of alignment.
如我们在第 4.2 节中所示,RoIAlign 带来了很大提升。 我们还将其与 MNC 中提出的 RoIWarp 操作进行比较。 不同于 RoIAlign,RoIWarp 忽视了对齐问题,并且在 MNC 中像 RoIPool 一样通过量化 RoI 实现。 因此,尽管 RoIWarp 也采用了受空间变换网络启发的双线性重采样,实验显示它的表现与 RoIPool 相当(更多细节见 表2),这证明了对齐的关键作用。
Network Architecture: To demonstrate the generality of our approach, we instantiate Mask R-CNN with multiple architectures. For clarity, we differentiate between: (i) the convolutional backbone architecture used for feature extraction over an entire image, and (ii) the network head for bounding-box recognition (classification and regression) and mask prediction that is applied separately to each RoI.
网络架构: 为展示方法的通用性,我们用多种架构实例化 Mask R-CNN。 为清楚起见,我们区分两部分:(i)用于对整幅图像进行特征提取的卷积主干架构;(ii)分别应用于每个 RoI、用于边界框识别(分类和回归)以及掩码预测的网络头部。
We denote the backbone architecture using the nomenclature network-depth-features. We evaluate ResNet and ResNeXt networks of depth 50 or 101 layers. The original implementation of Faster R-CNN with ResNets extracted features from the final convolutional layer of the 4-th stage, which we call C4. This backbone with ResNet-50, for example, is denoted by ResNet-50-C4. This is a common choice used in previous work.
我们使用网络-深度-特征的命名方式表示主干架构。 我们评估深度为 50 层或 101 层的 ResNet 和 ResNeXt 网络。 使用 ResNet 的 Faster R-CNN 原始实现从第 4 阶段最后一个卷积层提取特征,我们称其为 C4。 例如,这个使用 ResNet-50 的主干被表示为 ResNet-50-C4。 这是此前工作中使用的常见选择。
We also explore another more effective backbone recently proposed by Lin et al., called a Feature Pyramid Network (FPN). FPN uses a top-down architecture with lateral connections to build an in-network feature pyramid from a single-scale input. Faster R-CNN with an FPN backbone extracts RoI features from different levels of the feature pyramid according to their scale, but otherwise the rest of the approach is similar to vanilla ResNet. Using a ResNet-FPN backbone for feature extraction with Mask R-CNN gives excellent gains in both accuracy and speed. For further details on FPN, we refer readers to Lin et al.
我们还探索了 Lin 等人近期提出的另一个更有效主干,称为特征金字塔网络(FPN)。 FPN 使用带有横向连接的自顶向下架构,从单尺度输入中构建网络内部的特征金字塔。 具有 FPN 主干的 Faster R-CNN 会根据尺度从特征金字塔的不同层级提取 RoI 特征,但除此之外,其余方法与普通 ResNet 类似。 在 Mask R-CNN 中使用 ResNet-FPN 主干进行特征提取,在准确率和速度上都带来优秀收益。 关于 FPN 的更多细节,我们请读者参考 Lin 等人的工作。
For the network head we closely follow architectures presented in previous work to which we add a fully convolutional mask prediction branch. Specifically, we extend the Faster R-CNN box heads from the ResNet and FPN papers. Details are shown in Figure Figure 4. The head on the ResNet-C4 backbone includes the 5-th stage of ResNet (namely, the 9-layer res5), which is compute-intensive. For FPN, the backbone already includes res5 and thus allows for a more efficient head that uses fewer filters.
对于网络头部,我们紧密沿用以往工作中的架构,并在其上增加一个全卷积掩码预测分支。 具体而言,我们扩展了 ResNet 和 FPN 论文中的 Faster R-CNN 边界框头部。 细节如 图4 所示。 ResNet-C4 主干上的头部包含 ResNet 的第 5 阶段(即 9 层的 res5),计算量很大。 对于 FPN,主干已经包含 res5,因此可以使用滤波器更少、更高效的头部。
We note that our mask branches have a straightforward structure. More complex designs have the potential to improve performance but are not the focus of this work.
我们指出,掩码分支具有直接的结构。 更复杂的设计可能进一步提升性能,但这不是本文重点。
3.1 Implementation Details

| Method | Backbone | AP | AP50 | AP75 | APS | APM | APL |
|---|---|---|---|---|---|---|---|
| MNC | ResNet-101-C4 | 24.6 | 44.3 | 24.8 | 4.7 | 25.9 | 43.6 |
| FCIS +OHEM | ResNet-101-C5-dilated | 29.2 | 49.5 | - | 7.1 | 31.3 | 50.0 |
| FCIS+++ +OHEM | ResNet-101-C5-dilated | 33.6 | 54.5 | - | - | - | - |
| Mask R-CNN | ResNet-101-C4 | 33.1 | 54.9 | 34.8 | 12.1 | 35.6 | 51.1 |
| Mask R-CNN | ResNet-101-FPN | 35.7 | 58.0 | 37.8 | 15.5 | 38.1 | 52.4 |
| Mask R-CNN | ResNeXt-101-FPN | 37.1 | 60.0 | 39.4 | 16.9 | 39.9 | 53.5 |
We set hyper-parameters following existing Fast/Faster R-CNN work. Although these decisions were made for object detection in original papers, we found our instance segmentation system is robust to them.
我们按照已有 Fast/Faster R-CNN 工作设置超参数。 虽然这些决策最初是在目标检测论文中做出的,但我们发现其实例分割系统对它们具有鲁棒性。
Training: As in Fast R-CNN, an RoI is considered positive if it has IoU with a ground-truth box of at least 0.5 and negative otherwise. The mask loss
训练: 与 Fast R-CNN 一样,如果一个 RoI 与真实框的 IoU 至少为 0.5,则将其视为正样本,否则视为负样本。 掩码损失
We adopt image-centric training. Images are resized such that their scale (shorter edge) is 800 pixels. Each mini-batch has 2 images per GPU and each image has
我们采用以图像为中心的训练。 图像会被调整尺寸,使其尺度(短边)为 800 像素。 每个小批量在每块 GPU 上包含 2 张图像,并且每张图像有
The RPN anchors span 5 scales and 3 aspect ratios, following FPN. For convenient ablation, RPN is trained separately and does not share features with Mask R-CNN, unless specified. For every entry in this paper, RPN and Mask R-CNN have the same backbones and so they are shareable.
RPN 锚框按照 FPN 设定,覆盖 5 种尺度和 3 种长宽比。 为方便消融,除非特别说明,RPN 会单独训练且不与 Mask R-CNN 共享特征。 对于本文中的每个条目,RPN 和 Mask R-CNN 都具有相同主干,因此它们是可以共享的。
Inference: At test time, the proposal number is 300 for the C4 backbone and 1000 for FPN. We run the box prediction branch on these proposals, followed by non-maximum suppression. The mask branch is then applied to the highest scoring 100 detection boxes. Although this differs from the parallel computation used in training, it speeds up inference and improves accuracy (due to the use of fewer, more accurate RoIs). The mask branch can predict
推理: 测试时,C4 主干的提议数量为 300,FPN 的提议数量为 1000。 我们在这些提议上运行边界框预测分支,然后执行非极大值抑制。 随后,将掩码分支应用于得分最高的 100 个检测框。 虽然这不同于训练中使用的并行计算,但它加快了推理并提升了准确率(因为使用了更少且更准确的 RoI)。 掩码分支可以为每个 RoI 预测
Note that since we only compute masks on the top 100 detection boxes, Mask R-CNN adds a small overhead to its Faster R-CNN counterpart (e.g., ~20% on typical models).
注意,由于我们只在前 100 个检测框上计算掩码,Mask R-CNN 相对于对应的 Faster R-CNN 只增加很小的开销(例如,在典型模型上约为 20%)。
4. Experiments: Instance Segmentation

(a) Backbone Architecture. Better backbones bring expected gains: deeper networks do better, FPN outperforms C4 features, and ResNeXt improves on ResNet.
| net-depth-features | AP | AP50 | AP75 |
|---|---|---|---|
| ResNet-50-C4 | 30.3 | 51.2 | 31.5 |
| ResNet-101-C4 | 32.7 | 54.2 | 34.3 |
| ResNet-50-FPN | 33.6 | 55.2 | 35.3 |
| ResNet-101-FPN | 35.4 | 57.3 | 37.5 |
| ResNeXt-101-FPN | 36.7 | 59.5 | 38.9 |
(b) Multinomial vs Independent Masks. Decoupling via per-class binary masks (sigmoid) gives large gains over multinomial masks (softmax).
| AP | AP50 | AP75 | |
|---|---|---|---|
| softmax | 24.8 | 44.1 | 25.1 |
| sigmoid | 30.3 | 51.2 | 31.5 |
| +5.5 | +7.1 | +6.4 |
(c) RoIAlign (ResNet-50-C4). Mask results with various RoI layers. Our RoIAlign layer improves AP by ~3 points and AP75 by ~5 points.
| Layer | align? | bilinear? | agg. | AP | AP50 | AP75 |
|---|---|---|---|---|---|---|
| RoIPool | max | 26.9 | 48.8 | 26.4 | ||
| RoIWarp | ✓ | max | 27.2 | 49.2 | 27.1 | |
| RoIWarp | ✓ | ave | 27.1 | 48.9 | 27.1 | |
| RoIAlign | ✓ | ✓ | max | 30.2 | 51.0 | 31.8 |
| RoIAlign | ✓ | ✓ | ave | 30.3 | 51.2 | 31.5 |
(d) RoIAlign (ResNet-50-C5, stride 32). Mask-level and box-level AP using large-stride features.
| Layer | AP | AP50 | AP75 | APbb | APbb50 | APbb75 |
|---|---|---|---|---|---|---|
| RoIPool | 23.6 | 46.5 | 21.6 | 28.2 | 52.7 | 26.9 |
| RoIAlign | 30.9 | 51.8 | 32.1 | 34.0 | 55.3 | 36.4 |
| +7.3 | +5.3 | +10.5 | +5.8 | +2.6 | +9.5 |
(e) Mask Branch (ResNet-50-FPN). Fully convolutional networks (FCN) vs multi-layer perceptrons (MLP, fully-connected) for mask prediction.
| Type | mask branch | AP | AP50 | AP75 |
|---|---|---|---|---|
| MLP | fc: 1024→1024→80·282 | 31.5 | 53.7 | 32.8 |
| MLP | fc: 1024→1024→1024→80·282 | 31.5 | 54.0 | 32.6 |
| FCN | conv: 256→256→256→256→256→80 | 33.6 | 55.2 | 35.3 |
We perform a thorough comparison of Mask R-CNN to the state of the art along with comprehensive ablations on the COCO dataset. We report the standard COCO metrics including AP (averaged over IoU thresholds), APtrainval35k), and report ablations on the remaining 5k val images (minival). We also report results on test-dev.
我们在 COCO 数据集上将 Mask R-CNN 与最先进方法进行全面比较,并进行综合消融。 我们报告标准 COCO 指标,包括 AP(在 IoU 阈值上平均)、APtrainval35k)训练,并在剩余 5k 张验证图像(minival)上报告消融。 我们还报告 test-dev 上的结果。
4.1 Main Results
We compare Mask R-CNN to the state-of-the-art methods in instance segmentation in Table Table 1. All instantiations of our model outperform baseline variants of previous state-of-the-art models. This includes MNC and FCIS, the winners of the COCO 2015 and 2016 segmentation challenges, respectively. Without bells and whistles, Mask R-CNN with ResNet-101-FPN backbone outperforms FCIS+++, which includes multi-scale train/test, horizontal flip test, and online hard example mining (OHEM). While outside the scope of this work, we expect many such improvements to be applicable to ours.
我们在 表1 中将 Mask R-CNN 与实例分割中的最先进方法进行比较。 我们的所有模型实例化版本都超过了以往最先进模型的基线变体。 这包括分别获得 COCO 2015 和 2016 分割挑战赛冠军的 MNC 与 FCIS。 在不使用复杂技巧的情况下,使用 ResNet-101-FPN 主干的 Mask R-CNN 超过了 FCIS+++;后者包含多尺度训练/测试、水平翻转测试和在线难例挖掘(OHEM)。 虽然这些改进超出了本文范围,但我们预计其中许多也可用于我们的方法。
Mask R-CNN outputs are visualized in Figures Figure 2 and Figure 5. Mask R-CNN achieves good results even under challenging conditions. In Figure Figure 6 we compare our Mask R-CNN baseline and FCIS+++. FCIS+++ exhibits systematic artifacts on overlapping instances, suggesting that it is challenged by the fundamental difficulty of instance segmentation. Mask R-CNN shows no such artifacts.
Mask R-CNN 的输出在 图2 和 图5 中可视化展示。 即使在具有挑战性的条件下,Mask R-CNN 也能取得良好结果。 在 图6 中,我们比较了自己的 Mask R-CNN 基线与 FCIS+++。 FCIS+++ 在重叠实例上表现出系统性伪影,这表明它受到实例分割基本困难的挑战。 Mask R-CNN 没有表现出这类伪影。
4.2 Ablation Experiments
We run a number of ablations to analyze Mask R-CNN. Results are shown in Table Table 2 and discussed in detail next.
我们进行了多项消融实验来分析 Mask R-CNN。 结果如 表2 所示,并在下文详细讨论。
Architecture: Table Table 2 shows Mask R-CNN with various backbones. It benefits from deeper networks (50 vs. 101) and advanced designs including FPN and ResNeXt. We note that not all frameworks automatically benefit from deeper or advanced networks (see benchmarking in Huang et al.).
架构: 表2 展示了使用不同主干的 Mask R-CNN。 它能从更深网络(50 对 101)和包括 FPN、ResNeXt 在内的先进设计中获益。 我们指出,并非所有框架都会自动从更深或更先进的网络中获益(见 Huang 等人的基准测试)。
Multinomial vs Independent Masks: Mask R-CNN decouples mask and class prediction: as the existing box branch predicts the class label, we generate a mask for each class without competition among classes (by a per-pixel sigmoid and a binary loss). In Table Table 2, we compare this to using a per-pixel softmax and a multinomial loss (as commonly used in FCN). This alternative couples the tasks of mask and class prediction, and results in a severe loss in mask AP (5.5 points). This suggests that once the instance has been classified as a whole (by the box branch), it is sufficient to predict a binary mask without concern for the categories, which makes the model easier to train.
多项式掩码与独立掩码: Mask R-CNN 解耦掩码预测和类别预测:由于已有边界框分支会预测类别标签,我们为每个类别生成一个掩码,不让类别之间竞争(通过逐像素 sigmoid 和二元损失)。 在 表2 中,我们将其与使用逐像素 softmax 和多项式损失的方案进行比较(这在 FCN 中很常见)。 这个替代方案会耦合掩码预测和类别预测两个任务,并导致掩码 AP 严重下降(5.5 个点)。 这表明,一旦实例整体已经由边界框分支完成分类,预测一个不关心类别的二值掩码就足够了,这会使模型更容易训练。
Class-Specific vs Class-Agnostic Masks: Our default instantiation predicts class-specific masks, i.e., one
类别特定掩码与类别无关掩码: 我们的默认实例化版本预测类别特定掩码,即每个类别一个
| Method | Backbone | APbb | APbb50 | APbb75 | APbbS | APbbM | APbbL |
|---|---|---|---|---|---|---|---|
| Faster R-CNN+++ | ResNet-101-C4 | 34.9 | 55.7 | 37.4 | 15.6 | 38.7 | 50.9 |
| Faster R-CNN w FPN | ResNet-101-FPN | 36.2 | 59.1 | 39.0 | 18.2 | 39.0 | 48.2 |
| Faster R-CNN by G-RMI | Inception-ResNet-v2 | 34.7 | 55.5 | 36.7 | 13.5 | 38.1 | 52.0 |
| Faster R-CNN w TDM | Inception-ResNet-v2-TDM | 36.8 | 57.7 | 39.2 | 16.2 | 39.8 | 52.1 |
| Faster R-CNN, RoIAlign | ResNet-101-FPN | 37.3 | 59.6 | 40.3 | 19.8 | 40.2 | 48.8 |
| Mask R-CNN | ResNet-101-FPN | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 |
| Mask R-CNN | ResNeXt-101-FPN | 39.8 | 62.3 | 43.4 | 22.1 | 43.2 | 51.2 |
RoIAlign: An evaluation of our proposed RoIAlign layer is shown in Table Table 2. For this experiment we use the ResNet-50-C4 backbone, which has stride 16. RoIAlign improves AP by about 3 points over RoIPool, with much of the gain coming at high IoU (AP
RoIAlign: 我们提出的 RoIAlign 层的评估结果见 表2。 该实验使用步幅为 16 的 ResNet-50-C4 主干。 与 RoIPool 相比,RoIAlign 将 AP 提升约 3 个点,其中很大部分收益来自高 IoU(AP
Additionally, we compare with RoIWarp proposed in MNC that also adopt bilinear sampling. As discussed in Section 3, RoIWarp still quantizes the RoI, losing alignment with the input. As can be seen in Table Table 2, RoIWarp performs on par with RoIPool and much worse than RoIAlign. This highlights that proper alignment is key.
此外,我们还与 MNC 中提出、同样采用双线性采样的 RoIWarp 进行比较。 如第 3 节所讨论,RoIWarp 仍然会量化 RoI,从而丢失与输入的对齐。 如 表2 所示,RoIWarp 的表现与 RoIPool 相当,并且远差于 RoIAlign。 这凸显了正确对齐是关键。
We also evaluate RoIAlign with a ResNet-50-C5 backbone, which has an even larger stride of 32 pixels. We use the same head as in Figure Figure 4 (right), as the res5 head is not applicable. Table Table 2 shows that RoIAlign improves mask AP by a massive 7.3 points, and mask AP
我们还在 ResNet-50-C5 主干上评估 RoIAlign,该主干具有更大的 32 像素步幅。 由于 res5 头部并不适用,我们使用与 图4(右)相同的头部。 表2 显示,RoIAlign 将掩码 AP 大幅提升 7.3 个点,并将掩码 AP
Finally, RoIAlign shows a gain of 1.5 mask AP and 0.5 box AP when used with FPN, which has finer multi-level strides. For keypoint detection that requires finer alignment, RoIAlign shows large gains even with FPN (Table Table 6).
最后,当与具有更细多层步幅的 FPN 一起使用时,RoIAlign 带来 1.5 掩码 AP 和 0.5 边界框 AP 的增益。 对于需要更精细对齐的关键点检测,即使使用 FPN,RoIAlign 也展现出很大收益(表6)。
Mask Branch: Segmentation is a pixel-to-pixel task and we exploit the spatial layout of masks by using an FCN. In Table Table 2, we compare multi-layer perceptrons (MLP) and FCNs, using a ResNet-50-FPN backbone. Using FCNs gives a 2.1 mask AP gain over MLPs. We note that we choose this backbone so that the conv layers of the FCN head are not pre-trained, for a fair comparison with MLP.
掩码分支: 分割是像素到像素的任务,我们通过使用 FCN 来利用掩码的空间布局。 在 表2 中,我们使用 ResNet-50-FPN 主干比较多层感知机(MLP)和 FCN。 使用 FCN 相比 MLP 带来 2.1 掩码 AP 的提升。 我们指出,之所以选择这个主干,是为了让 FCN 头部的卷积层没有经过预训练,从而与 MLP 进行公平比较。
4.3 Bounding Box Detection Results
We compare Mask R-CNN to the state-of-the-art COCO bounding-box object detection in Table Table 3. For this result, even though the full Mask R-CNN model is trained, only the classification and box outputs are used at inference (the mask output is ignored). Mask R-CNN using ResNet-101-FPN outperforms the base variants of all previous state-of-the-art models, including the single-model variant of G-RMI, the winner of the COCO 2016 Detection Challenge. Using ResNeXt-101-FPN, Mask R-CNN further improves results, with a margin of 3.0 points box AP over the best previous single model entry from Faster R-CNN w TDM (which used Inception-ResNet-v2-TDM).
我们在 表3 中将 Mask R-CNN 与 COCO 边界框目标检测的最先进方法进行比较。 对于这个结果,尽管训练的是完整 Mask R-CNN 模型,但推理时只使用分类和边界框输出(忽略掩码输出)。 使用 ResNet-101-FPN 的 Mask R-CNN 超过了此前所有最先进模型的基础变体,包括 COCO 2016 检测挑战赛冠军 G-RMI 的单模型变体。 使用 ResNeXt-101-FPN 时,Mask R-CNN 进一步提升结果,相比此前最佳单模型 Faster R-CNN w TDM(使用 Inception-ResNet-v2-TDM)高出 3.0 个边界框 AP 点。
As a further comparison, we trained a version of Mask R-CNN but without the mask branch, denoted by "Faster R-CNN, RoIAlign" in Table Table 3. This model performs better than the model presented in FPN due to RoIAlign. On the other hand, it is 0.9 points box AP lower than Mask R-CNN. This gap of Mask R-CNN on box detection is therefore due solely to the benefits of multi-task training.
作为进一步比较,我们还训练了一个没有掩码分支的 Mask R-CNN 版本,在 表3 中记为 “Faster R-CNN, RoIAlign”。 由于 RoIAlign,这个模型优于 FPN 中提出的模型。 另一方面,它比 Mask R-CNN 低 0.9 个边界框 AP 点。 因此,Mask R-CNN 在边界框检测上的这一差距完全来自多任务训练的收益。
Lastly, we note that Mask R-CNN attains a small gap between its mask and box AP: e.g., 2.7 points between 37.1 (mask, Table Table 1) and 39.8 (box, Table Table 3). This indicates that our approach largely closes the gap between object detection and the more challenging instance segmentation task.
最后,我们指出,Mask R-CNN 的掩码 AP 与边界框 AP 之间差距很小:例如 37.1(掩码,表1)与 39.8(边界框,表3)之间只差 2.7 个点。 这表明我们的方法在很大程度上缩小了目标检测与更具挑战性的实例分割任务之间的差距。
4.4 Timing
Inference: We train a ResNet-101-FPN model that shares features between the RPN and Mask R-CNN stages, following the 4-step training of Faster R-CNN. This model runs at 195ms per image on an Nvidia Tesla M40 GPU (plus 15ms CPU time resizing the outputs to the original resolution), and achieves statistically the same mask AP as the unshared one. We also report that the ResNet-101-C4 variant takes ~400ms as it has a heavier box head (Figure Figure 4), so we do not recommend using the C4 variant in practice.
推理: 我们按照 Faster R-CNN 的 4 步训练方式,训练了一个在 RPN 与 Mask R-CNN 阶段之间共享特征的 ResNet-101-FPN 模型。 这个模型在 Nvidia Tesla M40 GPU 上每张图像运行 195ms(另外需要 15ms CPU 时间把输出调整到原始分辨率),并取得与不共享版本在统计上相同的掩码 AP。 我们还报告,由于 ResNet-101-C4 变体具有更重的边界框头部(图4),它需要约 400ms,因此我们不建议在实践中使用 C4 变体。
Although Mask R-CNN is fast, we note that our design is not optimized for speed, and better speed/accuracy trade-offs could be achieved, e.g., by varying image sizes and proposal numbers, which is beyond the scope of this paper.
虽然 Mask R-CNN 很快,但我们指出,其设计并没有针对速度进行优化,更好的速度/准确率权衡可以通过改变图像尺寸和提议数量等方式实现,这超出了本文范围。
Training: Mask R-CNN is also fast to train. Training with ResNet-50-FPN on COCO trainval35k takes 32 hours in our synchronized 8-GPU implementation (0.72s per 16-image mini-batch), and 44 hours with ResNet-101-FPN. In fact, fast prototyping can be completed in less than one day when training on the train set. We hope such rapid training will remove a major hurdle in this area and encourage more people to perform research on this challenging topic.
训练: Mask R-CNN 的训练也很快。 在我们同步的 8-GPU 实现中,用 ResNet-50-FPN 在 COCO trainval35k 上训练需要 32 小时(每个 16 图像小批量 0.72s),用 ResNet-101-FPN 则需要 44 小时。 事实上,在 train 集上训练时,快速原型开发可以在不到一天内完成。 我们希望这种快速训练能够移除该领域的一大障碍,并鼓励更多人研究这一具有挑战性的话题。
5. Mask R-CNN for Human Pose Estimation

Our framework can easily be extended to human pose estimation. We model a keypoint's location as a one-hot mask, and adopt Mask R-CNN to predict
我们的框架可以很容易扩展到人体姿态估计。 我们把关键点位置建模为 one-hot 掩码,并采用 Mask R-CNN 预测
We note that minimal domain knowledge for human pose is exploited by our system, as the experiments are mainly to demonstrate the generality of the Mask R-CNN framework. We expect that domain knowledge (e.g., modeling structures) will be complementary to our simple approach.
我们指出,系统只利用了极少量人体姿态领域知识,因为实验主要是为了展示 Mask R-CNN 框架的通用性。 我们预计,领域知识(例如结构建模)将与这个简单方法互补。
Implementation Details: We make minor modifications to the segmentation system when adapting it for keypoints. For each of the
实现细节: 在把分割系统适配到关键点任务时,我们进行了少量修改。 对于一个实例的每个
We adopt the ResNet-FPN variant, and the keypoint head architecture is similar to that in Figure Figure 4 (right). The keypoint head consists of a stack of eight
我们采用 ResNet-FPN 变体,并且关键点头部架构与 图4(右)类似。 关键点头部由八个
Models are trained on all COCO trainval35k images that contain annotated keypoints. To reduce overfitting, as this training set is smaller, we train using image scales randomly sampled from [640, 800] pixels; inference is on a single scale of 800 pixels. We train for 90k iterations, starting from a learning rate of 0.02 and reducing it by 10 at 60k and 80k iterations. We use bounding-box NMS with a threshold of 0.5. Other details are identical as in Section 3.1.
模型在所有包含关键点标注的 COCO trainval35k 图像上训练。 由于该训练集较小,为减少过拟合,我们使用从 [640, 800] 像素随机采样的图像尺度进行训练;推理时使用单一的 800 像素尺度。 我们训练 90k 次迭代,初始学习率为 0.02,并在 60k 和 80k 次迭代时将其降低 10 倍。 我们使用阈值为 0.5 的边界框 NMS。 其他细节与第 3.1 节相同。
| Method | APkp | APkp50 | APkp75 | APkpM | APkpL |
|---|---|---|---|---|---|
| CMU-Pose+++ | 61.8 | 84.9 | 67.5 | 57.1 | 68.2 |
| G-RMI† | 62.4 | 84.0 | 68.5 | 59.1 | 68.1 |
| Mask R-CNN, keypoint-only | 62.7 | 87.0 | 68.4 | 57.4 | 71.1 |
| Mask R-CNN, keypoint & mask | 63.1 | 87.3 | 68.7 | 57.8 | 71.4 |
Main Results and Ablations: We evaluate the person keypoint AP (AP
主结果与消融: 我们评估人体关键点 AP(AP
| Method | APbbperson | APmaskperson | APkp |
|---|---|---|---|
| Faster R-CNN | 52.5 | - | - |
| Mask R-CNN, mask-only | 53.6 | 45.8 | - |
| Mask R-CNN, keypoint-only | 50.7 | - | 64.2 |
| Mask R-CNN, keypoint & mask | 52.0 | 45.1 | 64.7 |
More importantly, we have a unified model that can simultaneously predict boxes, segments, and keypoints while running at 5 fps. Adding a segment branch (for the person category) improves the APtest-dev. More ablations of multi-task learning on minival are in Table Table 5. Adding the mask branch to the box-only (i.e., Faster R-CNN) or keypoint-only versions consistently improves these tasks. However, adding the keypoint branch reduces the box/mask AP slightly, suggesting that while keypoint detection benefits from multitask training, it does not in turn help the other tasks. Nevertheless, learning all three tasks jointly enables a unified system to efficiently predict all outputs simultaneously (Figure Figure 7).
更重要的是,我们拥有一个能够同时预测边界框、分割和关键点的统一模型,并且运行速度为 5 fps。 在 test-dev 上,增加一个(针对 person 类别的)分割分支将 APminival 上多任务学习的更多消融见 表5。 在仅有边界框(即 Faster R-CNN)或仅有关键点的版本中增加掩码分支,都会持续改进这些任务。 然而,增加关键点分支会略微降低边界框/掩码 AP,这表明关键点检测虽然受益于多任务训练,但反过来并不会帮助其他任务。 尽管如此,联合学习全部三个任务使一个统一系统能够同时高效预测所有输出(图7)。
| Layer | APkp | APkp50 | APkp75 | APkpM | APkpL |
|---|---|---|---|---|---|
| RoIPool | 59.8 | 86.2 | 66.7 | 55.1 | 67.4 |
| RoIAlign | 64.2 | 86.6 | 69.7 | 58.7 | 73.0 |
We also investigate the effect of RoIAlign on keypoint detection (Table Table 6). Though this ResNet-50-FPN backbone has finer strides (e.g., 4 pixels on the finest level), RoIAlign still shows significant improvement over RoIPool and increases AP
我们还研究了 RoIAlign 对关键点检测的影响(表6)。 虽然这个 ResNet-50-FPN 主干具有更细的步幅(例如最细层级上为 4 像素),RoIAlign 仍然相比 RoIPool 显示出显著提升,并将 AP
Given the effectiveness of Mask R-CNN for extracting object bounding boxes, masks, and keypoints, we expect it be an effective framework for other instance-level tasks.
鉴于 Mask R-CNN 在提取目标边界框、掩码和关键点方面的有效性,我们预计它也会成为其他实例级任务的有效框架。