SSD:单次多框检测器
Abstract
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For
我们提出了一种使用单个深度神经网络检测图像中物体的方法。我们的方法命名为 SSD,它将边界框的输出空间离散化为一组默认边界框,这些框在每个特征图位置上具有不同的长宽比和尺度。在预测时,网络为每个默认边界框中每个物体类别的存在生成分数,并对边界框进行调整以更好地匹配物体形状。此外,网络结合来自不同分辨率的多个特征图的预测,以自然地处理各种尺寸的物体。相对于需要物体提议的方法,SSD 非常简单,因为它完全消除了提议生成以及随后的像素或特征重采样阶段,并将所有计算封装在单个网络中。这使得 SSD 易于训练,并且可以直接集成到需要检测组件的系统中。在 PASCAL VOC、COCO 和 ILSVRC 数据集上的实验结果证实,SSD 与利用额外物体提议步骤的方法相比具有竞争力的准确性,并且速度更快,同时为训练和推理提供了一个统一的框架。对于
Introduction
Current state-of-the-art object detection systems are variants of the following approach: hypothesize bounding boxes, resample pixels or features for each box, and apply a high-quality classifier. This pipeline has prevailed on detection benchmarks since the Selective Search work through the current leading results on PASCAL VOC, COCO, and ILSVRC detection all based on Faster R-CNN albeit with deeper features such as those from residual networks. While accurate, these approaches have been too computationally intensive for embedded systems and, even with high-end hardware, too slow for real-time applications. Often detection speed for these approaches is measured in frames per second, and even the fastest high-accuracy detector, Faster R-CNN, operates at only 7 frames per second (FPS). There have been many attempts to build faster detectors by attacking each stage of the detection pipeline, but so far, significantly increased speed comes only at the cost of significantly decreased detection accuracy.
当前最先进的目标检测系统是以下方法的变体:假设边界框,为每个框重采样像素或特征,然后应用高质量分类器。从选择性搜索工作开始,到目前基于Faster R-CNN并在PASCAL VOC、COCO和ILSVRC检测上取得领先结果的方法,尽管使用了如残差网络那样的更深层特征,这一流程一直在检测基准中占主导地位。虽然准确,但这些方法对于嵌入式系统来说计算量太大,即使使用高端硬件,对于实时应用来说也太慢。通常,这些方法的检测速度以每秒帧数为单位,即使是最快的高精度检测器Faster R-CNN,也仅以每秒7帧的速度运行。人们曾多次尝试通过攻击检测流程的每个阶段来构建更快的检测器,但到目前为止,显著提高速度只能以显著降低检测精度为代价。
This paper presents the first deep network based object detector that does not resample pixels or features for bounding box hypotheses and is as accurate as approaches that do. This results in a significant improvement in speed for high-accuracy detection (59 FPS with mAP 74.3% on VOC2007 test, vs Faster R-CNN 7 FPS with mAP 73.2% or YOLO 45 FPS with mAP 63.4%). The fundamental improvement in speed comes from eliminating bounding box proposals and the subsequent pixel or feature resampling stage. We are not the first to do this, but by adding a series of improvements, we manage to increase the accuracy significantly over previous attempts. Our improvements include using a small convolutional filter to predict object categories and offsets in bounding box locations, using separate predictors (filters) for different aspect ratio detections, and applying these filters to multiple feature maps from the later stages of a network in order to perform detection at multiple scales. With these modifications—especially using multiple layers for prediction at different scales—we can achieve high-accuracy using relatively low resolution input, further increasing detection speed. While these contributions may seem small independently, we note that the resulting system improves accuracy on real-time detection for PASCAL VOC from 63.4% mAP for YOLO to 74.3% mAP for our SSD. This is a larger relative improvement in detection accuracy than that from the recent, very high-profile work on residual networks. Furthermore, significantly improving the speed of high-quality detection can broaden the range of settings where computer vision is useful.
本文提出了第一个基于深度网络的目标检测器,它不为边界框假设重采样像素或特征,并且与这样做的方法一样准确。这导致高精度检测的速度显著提高。速度的根本提升来自于消除了边界框提议以及随后的像素或特征重采样阶段。我们不是第一个这样做的,但通过添加一系列改进,我们设法显著提高了先前尝试的准确性。我们的改进包括使用小型卷积滤波器来预测对象类别和边界框位置的偏移,使用独立的预测器来检测不同的长宽比,并将这些滤波器应用于网络后期阶段的多个特征图,以便在多个尺度上执行检测。通过这些修改——特别是使用多个层在不同尺度上进行预测——我们能够使用相对低分辨率的输入实现高精度,进一步提高检测速度。虽然这些贡献单独看起来可能很小,但我们注意到,由此产生的系统将PASCAL VOC实时检测的精度从YOLO的63.4% mAP提高到我们SSD的74.3% mAP。这在检测精度上的相对改进比近期备受瞩目的残差网络工作更大。此外,显著提高高质量检测的速度可以拓宽计算机视觉有用的场景范围。
We summarize our contributions as follows:
我们总结贡献如下:
- We introduce SSD, a single-shot detector for multiple categories that is faster than the previous state-of-the-art for single shot detectors (YOLO), and significantly more accurate, in fact as accurate as slower techniques that perform explicit region proposals and pooling (including Faster R-CNN).
- The core of SSD is predicting category scores and box offsets for a fixed set of default bounding boxes using small convolutional filters applied to feature maps.
- To achieve high detection accuracy we produce predictions of different scales from feature maps of different scales, and explicitly separate predictions by aspect ratio.
- These design features lead to simple end-to-end training and high accuracy, even on low resolution input images, further improving the speed vs accuracy trade-off.
- Experiments include timing and accuracy analysis on models with varying input size evaluated on PASCAL VOC, COCO, and ILSVRC and are compared to a range of recent state-of-the-art approaches.
- 我们引入了SSD,一种用于多类别的单次检测器,它比先前最先进的单次检测器(YOLO)更快,并且明显更准确,实际上与执行显式区域提议和池化的较慢技术(包括Faster R-CNN)一样准确。
- SSD的核心是使用应用于特征图的小型卷积滤波器,为一组固定的默认边界框预测类别分数和框偏移。
- 为了实现高检测精度,我们从不同尺度的特征图产生不同尺度的预测,并按长宽比明确分离预测。
- 这些设计特性导致简单的端到端训练和高精度,即使在低分辨率输入图像上也是如此,进一步改善了速度与精度的权衡。
- 实验包括对在PASCAL VOC、COCO和ILSVRC上评估的具有不同输入大小的模型进行时间和精度分析,并与一系列近期最先进的方法进行比较。
The Single Shot Detector (SSD)
This section describes our proposed SSD framework for detection (Sect. 2.1) and the associated training methodology (Sect. 2.2). Afterwards, Sect. 3 presents dataset-specific model details and experimental results.
本节描述了我们提出的用于检测的SSD框架以及相关的训练方法。随后,第3节介绍特定于数据集的模型细节和实验结果。
2.1 Model
The SSD approach is based on a feed-forward convolutional network that produces a fixed-size collection of bounding boxes and scores for the presence of object class instances in those boxes, followed by a non-maximum suppression step to produce the final detections. The early network layers are based on a standard architecture used for high quality image classification (truncated before any classification layers), which we will call the base network. We then add auxiliary structure to the network to produce detections with the following key features:
2.1 模型
SSD方法基于一个前馈卷积网络,该网络生成一组固定大小的边界框以及这些框中存在对象类实例的分数,随后通过非极大值抑制步骤产生最终检测结果。早期网络层基于用于高质量图像分类的标准架构,我们称之为基础网络。然后,我们向网络添加辅助结构以产生具有以下关键特征的检测结果:
Multi-scale feature maps for detection. We add convolutional feature layers to the end of the truncated base network. These layers decrease in size progressively and allow predictions of detections at multiple scales. The convolutional model for predicting detections is different for each feature layer (cf Overfeat and YOLO that operate on a single scale feature map).
用于检测的多尺度特征图。 我们在截断的基础网络末端添加卷积特征层。这些层的尺寸逐渐减小,并允许在多个尺度上进行检测预测。用于预测检测的卷积模型对于每个特征层是不同的。
Convolutional predictors for detection. Each added feature layer (or optionally an existing feature layer from the base network) can produce a fixed set of detection predictions using a set of convolutional filters. These are indicated on top of the SSD network architecture in Fig. 2. For a feature layer of size
用于检测的卷积预测器。 每个添加的特征层可以使用一组卷积滤波器生成一组固定的检测预测。如图2中SSD网络架构顶部所示。对于一个大小为
Default boxes and aspect ratios. We associate a set of default bounding boxes with each feature map cell, for multiple feature maps at the top of the network. The default boxes tile the feature map in a convolutional manner, so that the position of each box relative to its corresponding cell is fixed. At each feature map cell, we predict the offsets relative to the default box shapes in the cell, as well as the per-class scores that indicate the presence of a class instance in each of those boxes. Specifically, for each box out of
默认边界框和长宽比。 我们将一组默认边界框与网络顶部的多个特征图中的每个特征图单元格相关联。默认边界框以卷积方式平铺在特征图上,因此每个框相对于其对应单元格的位置是固定的。在每个特征图单元格处,我们预测相对于该单元格中默认边界框形状的偏移,以及指示每个框中是否存在类实例的每类分数。具体来说,对于给定位置处的
2.2 Training
The key difference between training SSD and training a typical detector that uses region proposals, is that ground truth information needs to be assigned to specific outputs in the fixed set of detector outputs. Some version of this is also required for training in YOLO and for the region proposal stage of Faster R-CNN and MultiBox. Once this assignment is determined, the loss function and back propagation are applied end-to-end. Training also involves choosing the set of default boxes and scales for detection as well as the hard negative mining and data augmentation strategies.
2.2 训练
训练SSD与训练使用区域提议的典型检测器之间的关键区别在于,需要将真实信息分配给检测器固定输出集中的特定输出。YOLO的训练以及Faster R-CNN和MultiBox的区域提议阶段也需要某种形式的此步骤。一旦确定了这种分配,就可以端到端地应用损失函数和反向传播。训练还涉及选择用于检测的默认边界框集和尺度,以及难负例挖掘和数据增强策略。
Matching Strategy. During training we need to determine which default boxes correspond to a ground truth detection and train the network accordingly. For each ground truth box we are selecting from default boxes that vary over location, aspect ratio, and scale. We begin by matching each ground truth box to the default box with the best Jaccard overlap (as in MultiBox). Unlike MultiBox, we then match default boxes to any ground truth with Jaccard overlap higher than a threshold (0.5). This simplifies the learning problem, allowing the network to predict high scores for multiple overlapping default boxes rather than requiring it to pick only the one with maximum overlap.
匹配策略。 在训练期间,我们需要确定哪些默认边界框对应于真实检测,并相应地训练网络。对于每个真实边界框,我们从位置、长宽比和尺度各异的默认边界框中进行选择。我们首先将每个真实边界框与具有最佳Jaccard重叠的默认边界框进行匹配。与MultiBox不同,我们随后将默认边界框与Jaccard重叠高于阈值(0.5)的任何真实边界框进行匹配。这简化了学习问题,允许网络为多个重叠的默认边界框预测高分,而不是要求它只选择具有最大重叠的一个。
Training Objective. The SSD training objective is derived from the MultiBox objective but is extended to handle multiple object categories. Let
训练目标。 SSD训练目标源自MultiBox目标,但扩展为处理多个对象类别。设
where
其中
Choosing Scales and Aspect Ratios for Default Boxes. To handle different object scales, some methods suggest processing the image at different sizes and combining the results afterwards. However, by utilizing feature maps from several different layers in a single network for prediction we can mimic the same effect, while also sharing parameters across all object scales. Previous works have shown that using feature maps from the lower layers can improve semantic segmentation quality because the lower layers capture more fine details of the input objects. Similarly, other work showed that adding global context pooled from a feature map can help smooth the segmentation results. Motivated by these methods, we use both the lower and upper feature maps for detection. Figure 1 shows two exemplar feature maps
选择默认框的尺度和长宽比。 为了处理不同的对象尺度,一些方法建议以不同尺寸处理图像,然后组合结果。然而,通过利用单个网络中多个不同层的特征图进行预测,我们可以模拟相同的效果,同时跨所有对象尺度共享参数。先前的工作表明,使用来自较低层的特征图可以提高语义分割质量,因为较低层捕获了输入对象的更多精细细节。类似地,其他工作表明,添加从特征图池化的全局上下文有助于平滑分割结果。受这些方法的启发,我们同时使用较低层和较高层特征图进行检测。图1显示了框架中使用的两个示例特征图。在实践中,我们可以使用更多特征图,且计算开销很小。
We design the tiling of default boxes so that specific feature maps learn to be responsive to particular scales of the objects. Suppose we want to use
我们设计默认框的平铺方式,使得特定的特征图学习对特定尺度的对象做出响应。假设我们要使用
where
其中
By combining predictions for all default boxes with different scales and aspect ratios from all locations of many feature maps, we have a diverse set of predictions, covering various input object sizes and shapes. For example, in Fig. 1, the dog is matched to a default box in the
通过结合来自多个特征图所有位置上具有不同尺度和长宽比的所有默认框的预测,我们得到了多样化的预测集,覆盖了各种输入对象的大小和形状。例如,在图 1 中,狗与
Hard Negative Mining. After the matching step, most of the default boxes are negatives, especially when the number of possible default boxes is large. This introduces a significant imbalance between the positive and negative training examples. Instead of using all the negative examples, we sort them using the highest confidence loss for each default box and pick the top ones so that the ratio between the negatives and positives is at most 3:1. We found that this leads to faster optimization and a more stable training.
难负例挖掘。 在匹配步骤之后,大多数默认边界框都是负例,尤其是当可能默认边界框数量很大时。这引入了正负训练样本之间的显著不平衡。我们不是使用所有负例,而是根据每个默认边界框的最高置信度损失对其进行排序,并选择排名靠前的负例,使得负例与正例的比例最多为3:1。我们发现这可以带来更快的优化和更稳定的训练。
Data Augmentation. To make the model more robust to various input object sizes and shapes, each training image is randomly sampled by one of the following options:
数据增强。 为了使模型对各种输入对象大小和形状更具鲁棒性,每个训练图像通过以下选项之一随机采样:
- Use the entire original input image.
- Sample a patch so that the minimum Jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9.
- Randomly sample a patch.
- 使用整个原始输入图像。
- 采样一个图像块,使得其与对象的最小 Jaccard 重叠为 0.1、0.3、0.5、0.7 或 0.9。
- 随机采样一个图像块。
The size of each sampled patch is
每个采样块的大小是原始图像大小的
Related Work
There are two established classes of methods for object detection in images, one based on sliding windows and the other based on region proposal classification. Before the advent of convolutional neural networks, the state of the art for those two approaches – Deformable Part Model (DPM) and Selective Search – had comparable performance. However, after the dramatic improvement brought on by R-CNN, which combines selective search region proposals and convolutional network based post-classification, region proposal object detection methods became prevalent.
图像目标检测存在两类成熟的方法,一类基于滑动窗口,另一类基于区域提议分类。在卷积神经网络出现之前,这两类方法的最先进技术——可变形部件模型和选择性搜索——性能相当。然而,在 R-CNN 带来戏剧性改进之后,结合了选择性搜索区域提议和基于卷积网络的后期分类,基于区域提议的目标检测方法变得流行起来。
The original R-CNN approach has been improved in a variety of ways. The first set of approaches improve the quality and speed of post-classification, since it requires the classification of thousands of image crops, which is expensive and time-consuming. SPPnet speeds up the original R-CNN approach significantly. It introduces a spatial pyramid pooling layer that is more robust to region size and scale and allows the classification layers to reuse features computed over feature maps generated at several image resolutions. Fast R-CNN extends SPPnet so that it can fine-tune all layers end-to-end by minimizing a loss for both confidences and bounding box regression, which was first introduced in MultiBox for learning objectness.
原始的 R-CNN 方法在多个方面得到了改进。第一类改进旨在提高后期分类的质量和速度,因为它需要对数千个图像块进行分类,这既昂贵又耗时。SPPnet 显著加快了原始 R-CNN 方法的速度。它引入了一个空间金字塔池化层,对区域大小和尺度更加鲁棒,并允许分类层重用从多个图像分辨率生成的特征图上计算的特征。Fast R-CNN 扩展了 SPPnet,使其能够通过最小化置信度和边界框回归的损失来端到端地微调所有层,这种损失最初是在 MultiBox 中为学习目标性而引入的。
The second set of approaches improve the quality of proposal generation using deep neural networks. In the most recent works like MultiBox, the Selective Search region proposals, which are based on low-level image features, are replaced by proposals generated directly from a separate deep neural network. This further improves the detection accuracy but results in a somewhat complex setup, requiring the training of two neural networks with a dependency between them. Faster R-CNN replaces selective search proposals by ones learned from a region proposal network (RPN), and introduces a method to integrate the RPN with Fast R-CNN by alternating between fine-tuning shared convolutional layers and prediction layers for these two networks. This way region proposals are used to pool mid-level features and the final classification step is less expensive. Our SSD is very similar to the region proposal network (RPN) in Faster R-CNN in that we also use a fixed set of (default) boxes for prediction, similar to the anchor boxes in the RPN. But instead of using these to pool features and evaluate another classifier, we simultaneously produce a score for each object category in each box. Thus, our approach avoids the complication of merging RPN with Fast R-CNN and is easier to train, faster, and straightforward to integrate in other tasks.
第二类改进使用深度神经网络提高提议生成的质量。在像 MultiBox 这样的最新工作中,基于低级图像特征的选择性搜索区域提议被直接从独立的深度神经网络生成的提议所取代。这进一步提高了检测精度,但导致设置有些复杂,需要训练两个相互依赖的神经网络。Faster R-CNN 用从区域提议网络学习到的提议取代了选择性搜索提议,并引入了一种方法,通过交替微调这两个网络的共享卷积层和预测层,将区域提议网络与 Fast R-CNN 集成在一起。这样,区域提议被用来池化中层特征,最后的分类步骤成本更低。我们的 SSD 与 Faster R-CNN 中的区域提议网络非常相似,因为我们也使用一组固定的默认框进行预测,类似于区域提议网络中的锚点框。但不同之处在于,我们不是用这些框来池化特征并评估另一个分类器,而是同时在每个框中为每个对象类别生成分数。因此,我们的方法避免了将区域提议网络与 Fast R-CNN 合并的复杂性,并且更易于训练、速度更快,并且可以直接集成到其他任务中。
Another set of methods, which are directly related to our approach, skip the proposal step altogether and predict bounding boxes and confidences for multiple categories directly. OverFeat, a deep version of the sliding window method, predicts a bounding box directly from each location of the topmost feature map after knowing the confidences of the underlying object categories. YOLO uses the whole topmost feature map to predict both confidences for multiple categories and bounding boxes (which are shared for these categories). Our SSD method falls in this category because we do not have the proposal step but use the default boxes. However, our approach is more flexible than the existing methods because we can use default boxes of different aspect ratios on each feature location from multiple feature maps at different scales. If we only use one default box per location from the topmost feature map, our SSD would have similar architecture to OverFeat; if we use the whole topmost feature map and add a fully connected layer for predictions instead of our convolutional predictors, and do not explicitly consider multiple aspect ratios, we can approximately reproduce YOLO.
另一类方法与我们的方法直接相关,它们完全跳过提议步骤,直接预测多个类别的边界框和置信度。OverFeat 是滑动窗口方法的深度版本,在知道底层对象类别的置信度后,直接从最顶层特征图的每个位置预测一个边界框。YOLO 使用整个最顶层特征图来预测多个类别的置信度和边界框。我们的 SSD 方法属于这一类,因为我们没有提议步骤,而是使用默认框。然而,我们的方法比现有方法更灵活,因为我们可以在来自不同尺度的多个特征图的每个特征位置上使用不同长宽比的默认框。如果我们仅从最顶层特征图的每个位置使用一个默认框,我们的 SSD 将具有与 OverFeat 相似的架构;如果我们使用整个最顶层特征图并添加一个全连接层进行预测,而不是使用我们的卷积预测器,并且不显式考虑多个长宽比,我们可以近似地复现 YOLO。
Conclusions
This paper introduces SSD, a fast single-shot object detector for multiple categories. A key feature of our model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allows us to efficiently model the space of possible box shapes. We experimentally validate that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance. We build SSD models with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods.
本文介绍了 SSD,一种用于多类别的快速单次目标检测器。我们模型的一个关键特征是使用附加到网络顶部多个特征图上的多尺度卷积边界框输出。这种表示使我们能够有效地建模可能框形状的空间。我们通过实验验证,在适当的训练策略下,大量精心选择的默认边界框可以提高性能。我们构建的 SSD 模型在采样位置、尺度和长宽比方面产生的框预测数量比现有方法多一个数量级。
We demonstrate that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. Our SSD512 model significantly outperforms the state-of-the-art Faster R-CNN in terms of accuracy on PASCAL VOC and COCO, while being 3× faster. Our real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO alternative, while producing markedly superior detection accuracy.
我们证明,在相同的 VGG-16 基础架构下,SSD 在精度和速度方面均优于最先进的目标检测器。我们的 SSD512 模型在 PASCAL VOC 和 COCO 上的精度显著优于最先进的 Faster R-CNN,同时速度快 3 倍。我们的实时 SSD300 模型以 59 FPS 的速度运行,比当前的实时 YOLO 替代方案更快,同时检测精度明显更高。
Apart from its standalone utility, we believe that our monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component. A promising future direction is to explore its use as part of a system using recurrent neural networks to detect and track objects in video simultaneously.
除了其独立的实用性之外,我们相信我们整体且相对简单的 SSD 模型为采用目标检测组件的更大系统提供了一个有用的构建模块。一个很有前景的未来方向是探索将其作为使用循环神经网络同时检测和跟踪视频中目标的系统的一部分。