Skip to content


ImageNet:大规模分层图像数据库

Abstract

The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called "ImageNet", a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.

互联网上图像数据的激增有可能催生更复杂、更稳健的模型和算法,用于索引、检索、组织图像和多媒体数据并与之交互。但如何利用和组织这些数据仍然是一个关键问题。我们在此介绍一个名为"ImageNet"的新数据库,这是一个建立在WordNet结构基础上的大规模图像本体。ImageNet旨在为WordNet的8万个同义词集中的大部分填充平均500-1000张清晰且全分辨率的图像。这将产生数千万张按WordNet语义层次结构组织的带注释图像。本文对ImageNet的当前状态进行了详细分析:包含12个子树、5247个同义词集和总计320万张图像。我们表明,ImageNet在规模和多样性上远超当前的图像数据集,并且准确度更高。构建如此大规模的数据库是一项具有挑战性的任务。我们描述了使用亚马逊土耳其机器人的数据收集方案。最后,我们通过三个简单应用——目标识别、图像分类和自动目标聚类——展示了ImageNet的实用性。我们希望ImageNet的规模、准确性、多样性和层次结构能够为计算机视觉领域及其他领域的研究人员提供无与伦比的机会。

Introduction

The digital era has brought with it an enormous explosion of data. The latest estimations put a number of more than 3 billion photos on Flickr, a similar number of video clips on YouTube and an even larger number for images in the Google Image Search database. More sophisticated and robust models and algorithms can be proposed by exploiting these images, resulting in better applications for users to index, retrieve, organize and interact with these data. But exactly how such data can be utilized and organized is a problem yet to be solved. In this paper, we introduce a new image database called "ImageNet", a large-scale ontology of images. We believe that a large-scale ontology of images is a critical resource for developing advanced, large-scale content-based image search and image understanding algorithms, as well as for providing critical training and benchmarking data for such algorithms.

数字时代带来了数据的巨大爆炸。最新估计显示,Flickr上有超过30亿张照片,YouTube上有类似数量的视频剪辑,而Google图像搜索数据库中的图像数量则更为庞大。通过利用这些图像,可以提出更复杂、更稳健的模型和算法,从而为用户带来更好的索引、检索、组织和与这些数据交互的应用。但如何利用和组织这些数据仍然是一个有待解决的问题。在本文中,我们介绍了一个名为"ImageNet"的新图像数据库,这是一个大规模图像本体。我们相信,大规模图像本体对于开发先进的、大规模基于内容的图像搜索和图像理解算法至关重要,同时也为这类算法提供关键的训练和基准数据。

ImageNet uses the hierarchical structure of WordNet. Each meaningful concept in WordNet, possibly described by multiple words or word phrases, is called a "synonym set" or "synset". There are around 80,000 noun synsets in WordNet. In ImageNet, we aim to provide on average 500-1000 images to illustrate each synset. Images of each concept are quality-controlled and human-annotated as described in Sec. 3.2. ImageNet, therefore, will offer tens of millions of cleanly sorted images. In this paper, we report the current version of ImageNet, consisting of 12 "subtrees": mammal, bird, fish, reptile, amphibian, vehicle, furniture, musical instrument, geological formation, tool, flower, fruit. These subtrees contain 5247 synsets and 3.2 million images. Fig. 1 shows a snapshot of two branches of the mammal and vehicle subtrees.

ImageNet使用了WordNet的层次结构。WordNet中每个有意义的概念(可能由多个词或短语描述)称为一个"同义词集"或"synset"。WordNet中约有80,000个名词同义词集。在ImageNet中,我们的目标是为每个同义词集平均提供500-1000张图像来说明。每个概念的图像都经过质量控制,并由人工标注,如第3.2节所述。因此,ImageNet将提供数千万张清晰分类的图像。在本文中,我们报告了ImageNet的当前版本,包括12个"子树":哺乳动物、鸟类、鱼类、爬行动物、两栖动物、车辆、家具、乐器、地质构造、工具、花卉、水果。这些子树包含5247个同义词集和320万张图像。图1显示了哺乳动物和车辆子树两个分支的快照。

The rest of the paper is organized as follows: We first show that ImageNet is a large-scale, accurate and diverse image database (Section 2). In Section 4, we present a few simple application examples by exploiting the current ImageNet, mostly the mammal and vehicle subtrees. Our goal is to show that ImageNet can serve as a useful resource for visual recognition applications such as object recognition, image classification and object localization. In addition, the construction of such a large-scale and high-quality database can no longer rely on traditional data collection methods. Sec. 3 describes how ImageNet is constructed by leveraging Amazon Mechanical Turk.

本文的其余部分组织如下:我们首先展示ImageNet是一个大规模、准确且多样化的图像数据库(第2节)。在第4节中,我们通过利用当前的ImageNet(主要是哺乳动物和车辆子树)展示了几个简单的应用示例。我们的目标是表明ImageNet可以作为视觉识别应用的有用资源,例如目标识别、图像分类和目标定位。此外,构建如此大规模、高质量的数据库不能再依赖传统的数据收集方法。第3节描述了如何利用亚马逊土耳其机器人构建ImageNet。

Properties of ImageNet

ImageNet is built upon the hierarchical structure provided by WordNet. In its completion, ImageNet aims to contain in the order of 50 million cleanly labeled full resolution images (500-1000 per synset). At the time this paper is written, ImageNet consists of 12 subtrees. Most analysis will be based on the mammal and vehicle subtrees.

ImageNet建立在WordNet提供的层次结构之上。完成时,ImageNet的目标是包含约5000万张清晰标记的全分辨率图像。在撰写本文时,ImageNet包含12个子树。大多数分析将基于哺乳动物和车辆子树。

Scale ImageNet aims to provide the most comprehensive and diverse coverage of the image world. The current 12 subtrees consist of a total of 3.2 million cleanly annotated images spread over 5247 categories (Fig. 2). On average over 600 images are collected for each synset. Fig. 2 shows the distributions of the number of images per synset for the current ImageNet. To our knowledge this is already the largest clean image dataset available to the vision research community, in terms of the total number of images, number of images per category as well as the number of categories.

规模 ImageNet旨在提供最全面、最多样化的图像世界覆盖。当前的12个子树总计包含320万张清晰标注的图像,分布在5247个类别中。平均每个同义词集收集了超过600张图像。图2显示了当前ImageNet每个同义词集图像数量的分布。据我们所知,无论是从图像总数、每个类别的图像数量还是类别数量来看,这已经是视觉研究社区可用的最大的干净图像数据集。

Hierarchy ImageNet organizes the different classes of images in a densely populated semantic hierarchy. The main asset of WordNet lies in its semantic structure, i.e. its ontology of concepts. Similarly to WordNet, synsets of images in ImageNet are interlinked by several types of relations, the "IS-A" relation being the most comprehensive and useful. Although one can map any dataset with category labels into a semantic hierarchy by using WordNet, the density of ImageNet is unmatched by others. For example, to our knowledge no existing vision dataset offers images of 147 dog categories. Fig. 3 compares the "cat" and "cattle" subtrees of ImageNet and the ESP dataset. We observe that ImageNet offers much denser and larger trees.

层次结构 ImageNet在一个密集的语义层次结构中组织了不同类别的图像。WordNet的主要优势在于其语义结构,即其概念本体。与WordNet类似,ImageNet中的图像同义词集通过多种类型的关系相互链接,其中"IS-A"关系最为全面和有用。尽管人们可以通过使用WordNet将任何带有类别标签的数据集映射到语义层次结构中,但ImageNet的密度是其他数据集无法比拟的。例如,据我们所知,现有视觉数据集没有提供147种狗类别的图像。图3比较了ImageNet和ESP数据集的"猫"和"牛"子树。我们观察到ImageNet提供了密集得多、也大得多的树。

Accuracy We would like to offer a clean dataset at all levels of the WordNet hierarchy. Fig. 4 demonstrates the labeling precision on a total of 80 synsets randomly sampled at different tree depths. An average of 99.7% precision is achieved on average. Achieving a high precision for all depths of the ImageNet tree is challenging because the lower in the hierarchy a synset is, the harder it is to classify, e.g. Siamese cat versus Burmese cat.

准确性 我们希望在所有层次的WordNet层次结构中提供一个干净的数据集。图4展示了在不同树深度随机抽样的总共80个同义词集上的标记精度。平均达到了99.7%的准确率。为ImageNet树的所有深度实现高精度具有挑战性,因为同义词集在层次结构中位置越低,分类就越困难,例如暹罗猫与缅甸猫。

Diversity ImageNet is constructed with the goal that objects in images should have variable appearances, positions, view points, poses as well as background clutter and occlusions. In an attempt to tackle the difficult problem of quantifying image diversity, we compute the average image of each synset and measure lossless JPG file size which reflects the amount of information in an image. Our idea is that a synset containing diverse images will result in a blurrier average image, the extreme being a gray image, whereas a synset with little diversity will result in a more structured, sharper average image. We therefore expect to see a smaller JPG file size of the average image of a more diverse synset. Fig. 5 compares the image diversity in four randomly sampled synsets in Caltech101 and the mammal subtree of ImageNet.

多样性 ImageNet的构建目标是图像中的对象应具有可变的外观、位置、视角、姿态以及背景杂乱和遮挡。为了解决量化图像多样性这一难题,我们计算每个同义词集的平均图像,并测量反映图像信息量的无损JPG文件大小。我们的想法是,包含多样化图像的同义词集会得到一个更模糊的平均图像,极端情况是灰色图像,而多样性小的同义词集会得到更有结构、更清晰的平均图像。因此,我们预期多样性更高的同义词集的平均图像具有更小的JPG文件大小。图5比较了Caltech101中四个随机抽样的同义词集和ImageNet哺乳动物子树的图像多样性。

We compare ImageNet with other datasets and summarize the differences in Table 1.

2.1. ImageNet与相关数据集

我们将ImageNet与其他数据集进行了比较,并在表1中总结了差异。

Small image datasets A number of well labeled small datasets (Caltech101/256, MSRC, PASCAL etc.) have served as training and evaluation benchmarks for most of today's computer vision algorithms. As computer vision research advances, larger and more challenging datasets are needed for the next generation of algorithms. The current ImageNet offers 20× the number of categories, and 100× the number of total images than these datasets.

小型图像数据集 许多标记良好的小型数据集(如Caltech101/256、MSRC、PASCAL等)已成为当今大多数计算机视觉算法的训练和评估基准。随着计算机视觉研究的推进,下一代算法需要更大、更具挑战性的数据集。目前的ImageNet提供的类别数量是这些数据集的20倍,总图像数量是其100倍。

TinyImage TinyImage is a dataset of 80 million 32 × 32 low resolution images, collected from the Internet by sending all words in WordNet as queries to image search engines. Each synset in the TinyImage dataset contains an average of 1000 images, among which 10-25% are possibly clean images. Although the TinyImage dataset has had success with certain applications, the high level of noise and low resolution images make it less suitable for general purpose algorithm development, training, and evaluation. Compared to the TinyImage dataset, ImageNet contains high quality synsets (~ 99% precision) and full resolution images with an average size of around 400 × 350.

TinyImage TinyImage 是一个包含8000万张32×32低分辨率图像的数据集,通过将WordNet中的所有词作为查询发送给图像搜索引擎,从互联网上收集而来。TinyImage数据集中的每个同义词集平均包含1000张图像,其中10-25%可能是干净图像。尽管TinyImage数据集在某些应用中取得了成功,但高噪声和低分辨率图像使其不太适用于通用算法的开发、训练和评估。与TinyImage数据集相比,ImageNet包含高质量的同义词集(约99%的准确率)和平均大小约为400×350的全分辨率图像。

ESP dataset The ESP dataset is acquired through an online game. Two players independently propose labels to one image with the goal of matching as many words as possible in a certain time limit. Millions of images are labeled through this game, but its speeded nature also poses a major drawback. Rosch and Lloyd have demonstrated that humans tend to label visual objects at an easily accessible semantic level termed as "basic level" (e.g. bird), as opposed to more specific level ("sub-ordinate level", e.g. sparrow), or more general level ("super-ordinate level", e.g. vertebrate). Labels collected from the ESP game largely concentrate at the "basic level" of the semantic hierarchy as illustrated by the color bars in Fig. 6. ImageNet, however, demonstrates a much more balanced distribution of images across the semantic hierarchy. Another critical difference between ESP and ImageNet is sense disambiguation. When human players input the word "bank", it is unclear whether it means "a river bank" or a "financial institution". At this large scale, disambiguation becomes a non-trivial task. Without it, the accuracy and usefulness of the ESP data could be affected. ImageNet, on the other hand, does not have this problem by construction. See section 3.2 for more details. Lastly, most of the ESP dataset is not publicly available. Only 60K images and their labels can be accessed.

ESP数据集 ESP数据集通过在线游戏获得。两名玩家独立地为一张图片提出标签,目标是在一定时间内匹配尽可能多的词。通过这个游戏,数百万张图片被标记,但其快速性也带来了一个主要缺点。Rosch和Lloyd 已经证明,人类倾向于在一个易于访问的语义层面上标记视觉对象,称为"基本层"(如鸟),而不是更具体的层面("下属层",如麻雀)或更一般的层面("上级层",如脊椎动物)。如图6中的颜色条所示,从ESP游戏中收集的标签主要集中在语义层次结构的"基本层"。然而,ImageNet展示了图像在语义层次结构中更加均衡的分布。ESP和ImageNet之间的另一个关键区别是词义消歧。当人类玩家输入"bank"这个词时,不清楚它是指"河岸"还是"金融机构"。在这种大规模下,消歧成为一项不容忽视的任务。没有消歧,ESP数据的准确性和实用性可能会受到影响。另一方面,ImageNet在构建时没有这个问题。更多细节请参见第3.2节。最后,ESP数据集的大部分内容并不公开可用。只有6万张图像及其标签可以被访问。

LabelMe and Lotus Hill datasets LabelMe and the Lotus Hill dataset provide 30k and 50k labeled and segmented images, respectively. These two datasets provide complementary resources for the vision community compared to ImageNet. Both only have around 200 categories, but the outlines and locations of objects are provided. ImageNet in its current form does not provide detailed object outlines (see potential extensions in Sec. 5.1), but the number of categories and the number of images per category already far exceeds these two datasets. In addition, images in these two datasets are largely uploaded or provided by users or researchers of the dataset, whereas ImageNet contains images crawled from the entire Internet. The Lotus Hill dataset is only available through purchase.

LabelMe和Lotus Hill数据集 LabelMe 和 Lotus Hill 数据集分别提供了30k和50k张标记和分割的图像。与ImageNet相比,这两个数据集为视觉社区提供了互补的资源。两者都只有大约200个类别,但提供了对象的轮廓和位置。ImageNet目前的形式没有提供详细的对象轮廓(见第5.1节的潜在扩展),但类别数量和每个类别的图像数量已经远远超过了这两个数据集。此外,这两个数据集中的图像大部分由数据集用户或研究人员上传或提供,而ImageNet包含从整个互联网抓取的图像。Lotus Hill数据集只能通过购买获得。

Constructing ImageNet

ImageNet is an ambitious project. Thus far, we have constructed 12 subtrees containing 3.2 million images. Our goal is to complete the construction of around 50 million images in the next two years. We describe here the method we use to construct ImageNet, shedding light on how properties of Sec. 2 can be ensured in this process.

ImageNet是一项雄心勃勃的项目。到目前为止,我们已经构建了包含320万张图像的12个子树。我们的目标是在未来两年内完成约5000万张图像的构建。在此,我们描述了用于构建ImageNet的方法,阐明了在此过程中如何确保第2节中的属性。

3.1. Collecting Candidate Images

The first stage of the construction of ImageNet involves collecting candidate images for each synset. The average accuracy of image search results from the Internet is around 10%. ImageNet aims to eventually offer 500-1000 clean images per synset. We therefore collect a large set of candidate images. After intra-synset duplicate removal, each synset has over 10K images on average.

3.1. 收集候选图像

ImageNet构建的第一阶段涉及为每个同义词集收集候选图像。从互联网获取的图像搜索结果平均准确率约为10%。ImageNet的目标是最终为每个同义词集提供500-1000张干净图像。因此,我们收集了大量的候选图像。在进行同义词集内去重后,每个同义词集平均拥有超过1万张图像。

We collect candidate images from the Internet by querying several image search engines. For each synset, the queries are the set of WordNet synonyms. Search engines typically limit the number of images retrievable (in the order of a few hundred to a thousand). To obtain as many images as possible, we expand the query set by appending the queries with the word from parent synsets, if the same word appears in the gloss of the target synset. For example, when querying "whippet", according to WordNet's gloss a "small slender dog of greyhound type developed in England", we also use "whippet dog" and "whippet greyhound".

我们通过查询多个图像搜索引擎,从互联网上收集候选图像。对于每个同义词集,查询词就是WordNet的同义词集。搜索引擎通常会限制可检索的图像数量。为了获得尽可能多的图像,我们扩展了查询集:如果目标同义词集的注释中出现了其父级同义词集中的词,我们就在查询词后附加该父级词。例如,当查询"whippet"时,根据WordNet的注释,它是一种"在英格兰培育的灵缇型细长小狗",我们还会使用"whippet dog"和"whippet greyhound"作为查询词。

To further enlarge and diversify the candidate pool, we translate the queries into other languages, including Chinese, Spanish, Dutch and Italian. We obtain accurate translations by WordNets in those languages.

为了进一步扩大和多样化候选池,我们将查询词翻译成其他语言,包括中文、西班牙语、荷兰语和意大利语。我们通过这些语言中的WordNet来获得准确的翻译。

3.2. Cleaning Candidate Images

To collect a highly accurate dataset, we rely on humans to verify each candidate image collected in the previous step for a given synset. This is achieved by using the service of Amazon Mechanical Turk (AMT), an online platform on which one can put up tasks for users to complete and to get paid. AMT has been used for labeling vision data. With a global user base, AMT is particularly suitable for large scale labeling.

3.2. 清洗候选图像

为了收集一个高准确率的数据集,我们依靠人工来验证上一步为给定同义词集收集的每个候选图像。这是通过使用亚马逊土耳其机器人服务实现的,这是一个在线平台,人们可以在上面发布任务让用户完成并获取报酬。AMT已被用于标记视觉数据。凭借其全球用户群,AMT特别适合大规模标记。

In each of our labeling tasks, we present the users with a set of candidate images and the definition of the target synset (including a link to Wikipedia). We then ask the users to verify whether each image contains objects of the synset. We encourage users to select images regardless of occlusions, number of objects and clutter in the scene to ensure diversity.

在每个标记任务中,我们向用户展示一组候选图像和目标同义词集的定义。然后我们要求用户验证每张图像是否包含该同义词集的对象。为了确保多样性,我们鼓励用户选择图像,无论场景中是否存在遮挡、对象数量多少以及背景是否杂乱。

While users are instructed to make accurate judgment, we need to set up a quality control system to ensure this accuracy. There are two issues to consider. First, human users make mistakes and not all users follow the instructions. Second, users do not always agree with each other, especially for more subtle or confusing synsets, typically at the deeper levels of the tree. Fig. 7(left) shows an example of how users' judgments differ for "Burmese cat".

虽然我们指示用户做出准确判断,但我们仍需建立一个质量控制体系来确保这种准确性。有两个问题需要考虑。首先,人类用户会犯错,并非所有用户都遵循指示。其次,用户之间并不总是一致,尤其是在树中更深层次的、更微妙或更易混淆的同义词集上。图7(左)展示了一个关于"缅甸猫"用户判断存在差异的例子。

The solution to these issues is to have multiple users independently label the same image. An image is considered positive only if it gets a convincing majority of the votes. We observe, however, that different categories require different levels of consensus among users. For example, while five users might be necessary for obtaining a good consensus on "Burmese cat" images, a much smaller number is needed for "cat" images. We develop a simple algorithm to dynamically determine the number of agreements needed for different categories of images. For each synset, we first randomly sample an initial subset of images. At least 10 users are asked to vote on each of these images. We then obtain a confidence score table, indicating the probability of an image being a good image given the user votes (Fig. 7(right) shows examples for "Burmese cat" and "cat"). For each of remaining candidate images in this synset, we proceed with the AMT user labeling until a pre-determined confidence score threshold is reached. It is worth noting that the confidence table gives a natural measure of the "semantic difficulty" of the synset. For some synsets, users fail to reach a majority vote for any image, indicating that the synset cannot be easily illustrated by images. Fig. 4 shows that our algorithm successfully filters the candidate images, resulting in a high percentage of clean images per synset.

这些问题的解决方案是由多个用户独立标记同一张图像。只有当一张图像获得有说服力的多数票时,才被视为正例。然而,我们观察到不同的类别需要用户之间达成不同程度的共识。例如,"缅甸猫"图像可能需要五个用户才能获得良好共识,而"猫"图像则需要少得多的用户。我们开发了一个简单的算法来动态确定不同类别图像所需的一致票数。对于每个同义词集,我们首先随机抽取一个初始图像子集。要求至少10个用户对每张图像进行投票。然后我们得到一个置信度分数表,该表指示了在给定用户投票的情况下,图像是合格图像的概率(图7(右)展示了"缅甸猫"和"猫"的例子)。对于该同义词集中剩余的每个候选图像,我们继续进行AMT用户标记,直到达到预先确定的置信度分数阈值。值得注意的是,置信度表自然地度量了同义词集的"语义难度"。对于一些同义词集,用户无法对任何图像达成多数投票,这表明该同义词集不易用图像来说明。图4显示,我们的算法成功地过滤了候选图像,使得每个同义词集中干净图像的比例很高。

ImageNet Applications

In this section, we show three applications of ImageNet. The first set of experiments underline the advantages of having clean, full resolution images. The second experiment exploits the tree structure of ImageNet, whereas the last experiment outlines a possible extension and gives more insights into the data.

在本节中,我们将展示ImageNet的三个应用。第一组实验强调了拥有干净、全分辨率图像的优势。第二个实验利用了ImageNet的树结构,而最后一个实验则概述了一个可能的扩展方向,并提供了对数据的更多见解。

4.1. Non-parametric Object Recognition

Given an image containing an unknown object, we would like to recognize its object class by querying similar images in ImageNet. Torralba et al. has demonstrated that, given a large number of images, simple nearest neighbor methods can achieve reasonable performances despite a high level of noise. We show that with a clean set of full resolution images, object recognition can be more accurate, especially by exploiting more feature level information.

4.1. 非参数目标识别

给定一张包含未知对象的图像,我们希望通过在ImageNet中查询相似图像来识别其对象类别。先前的研究表明,给定大量图像,简单的最近邻方法即使在噪声水平较高的情况下也能达到合理的性能。我们展示了使用一套干净的全分辨率图像,目标识别可以更加准确,尤其是通过利用更多特征层面的信息。

We run four different object recognition experiments. In all experiments, we test on images from the 16 common categories between Caltech256 and the mammal subtree. We measure classification performance on each category in the form of an ROC curve. For each category, the negative set consists of all images from the other 15 categories. We now describe in detail our experiments and results(Fig. 8).

我们进行了四个不同的目标识别实验。在所有实验中,我们使用来自Caltech256和哺乳动物子树之间共有的16个常见类别的图像进行测试。我们以ROC曲线的形式衡量每个类别的分类性能。对于每个类别,负例集由其他15个类别的所有图像组成。我们现在详细描述我们的实验和结果(图8)。

  1. NN-voting + noisy ImageNet First we replicate one of the experiments described in previous work, which we refer to as "NN-voting" hereafter. To imitate the TinyImage dataset (i.e. images collected from search engines without human cleaning), we use the original candidate images for each synset (Section 3.1) and down-sample them to 32×32. Given a query image, we retrieve 100 of the nearest neighbor images by SSD pixel distance from the mammal subtree. Then we perform classification by aggregating votes (number of nearest neighbors) inside the tree of the target category.
  1. NN投票 + 含噪ImageNet 首先,我们复现了先前工作中描述的一个实验,下文简称为"NN投票"。为了模仿TinyImage数据集(即从搜索引擎收集未经人工清洗的图像),我们使用每个同义词集的原始候选图像,并将其下采样至 32×32。给定一张查询图像,我们从哺乳动物子树中通过SSD像素距离检索出100张最近邻图像。然后,通过在目标类别树内聚合投票(最近邻的数量)进行分类。
  1. NN-voting + clean ImageNet Next we run the same NN-voting experiment described above on the clean ImageNet dataset. This result shows that having more accurate data improves classification performance.
  1. NN投票 + 干净ImageNet 接下来,我们在干净的ImageNet数据集上运行上述相同的NN投票实验。结果表明,拥有更准确的数据可以提升分类性能。
  1. NBNN We also implement the Naive Bayesian Nearest Neighbor (NBNN) method proposed in previous work to underline the usefulness of full resolution images. NBNN employs a bag-of-features representation of images. SIFT descriptors are used in this experiment. Given a query image Q with descriptors {di},i=1,,M, for each object class C, we compute the query-class distance DC=i=1MdidiC2, where diC is the nearest neighbor of di from all the image descriptors in class C. We order all classes by DC and define the classification score as the minimum rank of the target class and its subclasses. The result shows that NBNN gives substantially better performance, demonstrating the advantage of using a more sophisticated feature representation available through full resolution images.
  1. NBNN 我们还实现了先前工作中提出的朴素贝叶斯最近邻方法,以强调全分辨率图像的有用性。NBNN采用图像的词袋表示。本实验使用SIFT描述子。给定一张具有描述子 {di},i=1,,M 的查询图像 Q,对于每个对象类别 C,我们计算查询-类别距离 DC=i=1MdidiC2,其中 diCdi 在类别 C 的所有图像描述子中的最近邻。我们按照 DC 对所有类别进行排序,并将分类分数定义为目标类别及其子类别的最小排名。结果表明NBNN提供了显著更好的性能,展示了通过全分辨率图像使用更复杂特征表示的优势。
  1. NBNN-100 Finally, we run the same NBNN experiment, but limit the number of images per category to 100. The result confirms previous findings. Performance can be significantly improved by enlarging the dataset. It is worth noting that NBNN-100 outperforms NN-voting with access to the entire dataset, again demonstrating the benefit of having detailed feature level information by using full resolution images.
  1. NBNN-100 最后,我们运行相同的NBNN实验,但将每个类别的图像数量限制为100。结果证实了先前研究的发现。通过扩大数据集可以显著提高性能。值得注意的是,NBNN-100在访问整个数据集的情况下优于NN投票,这再次证明了使用全分辨率图像获得详细特征级别信息的好处。

4.2. Tree Based Image Classification

Compared to other available datasets, ImageNet provides image data in a densely populated hierarchical structure. Many possible algorithms could be applied to exploit a hierarchical data structure. In this experiment, we choose to illustrate the usefulness of the ImageNet hierarchy by a simple object classification method which we call the "tree-max classifier". Imagine you have a classifier at each synset node of the tree and you want to decide whether an image contains an object of that synset or not. The idea is to not only consider the classification score at a node such as "dog", but also of its child synsets, such as "German shepherd", "English terrier", etc. The maximum of all the classifier responses in this subtree becomes the classification score of the query image.

4.2. 基于树的图像分类

与其他可用数据集相比,ImageNet在密集的层次结构中提供图像数据。许多可能的算法可用于利用这种层次数据结构。在本实验中,我们选择通过一个简单的对象分类方法来说明ImageNet层次结构的有用性,我们称之为"树最大分类器"。想象一下,在树的每个同义词集节点都有一个分类器,你想判断一张图像是否包含该同义词集的对象。其思想是不仅要考虑像"狗"这样的节点处的分类分数,还要考虑其子同义词集(如"德国牧羊犬"、"英国㹴"等)的分类分数。该子树中所有分类器响应的最大值成为查询图像的分类分数。

Fig. 9 illustrates the result of our experiment on the mammal subtree. Note that our algorithm is agnostic to any method used to learn image classifiers for each synset. In this case, we use an AdaBoost-based classifier proposed in previous work. For each synset, we randomly sample 90% of the images to form the positive training image set, leaving the rest of the 10% as testing images. We form a common negative image set by aggregating 10 images randomly sampled from each synset. When training an image classifier for a particular synset, we use the positive set from this synset as well as the common negative image set excluding the images drawn from this synset, and its child and parent synsets.

图9展示了我们在哺乳动物子树上的实验结果。请注意,我们的算法与用于学习每个同义词集图像分类器的任何方法无关。在这种情况下,我们使用先前工作中提出的基于AdaBoost的分类器。对于每个同义词集,我们随机抽取90%的图像作为正训练图像集,其余10%作为测试图像。我们通过从每个同义词集中随机抽取10张图像来形成一个共同的负图像集。在为特定同义词集训练图像分类器时,我们使用来自该同义词集的正集以及排除了来自该同义词集及其子、父同义词集的图像的共同负图像集。

We evaluate the classification results by AUC (the area under ROC curve). Fig. 9 shows the results of AUC for synsets at different levels of the hierarchy, compared with an independent classifier that does not exploit the tree structure of ImageNet. The plot indicates that images are easier to classify at the bottom of the tree (e.g. star-nosed mole, minivan, polar bear) as opposed to the top of the tree (e.g. vehicles, mammal, artifact, etc.). This is most likely due to stronger visual coherence near the leaf nodes of the tree.

我们通过AUC评估分类结果。图9显示了层次结构中不同层级同义词集的AUC结果,并与不利用ImageNet树结构的独立分类器进行了比较。该图表明,与树的顶部相比,图像在树的底部更容易分类。这很可能是由于树中叶节点附近更强的视觉一致性。

At nearly all levels, the performance of the tree-max classifier is consistently higher than the independent classifier. This result shows that a simple way of exploiting the ImageNet hierarchy can already provide substantial improvement for the image classification task without additional training or model learning.

几乎在所有层级上,树最大分类器的性能始终高于独立分类器。这一结果表明,利用ImageNet层次结构的一种简单方法,无需额外训练或模型学习,就能为图像分类任务提供实质性的改进。

4.3. Automatic Object Localization

ImageNet can be extended to provide additional information about each image. One such information is the spatial extent of the objects in each image. Two application areas come to mind. First, for training a robust object detection algorithm one often needs localized objects in different poses and under different viewpoints. Second, having localized objects in cluttered scenes enables users to use ImageNet as a benchmark dataset for object localization algorithms. In this section we present results of localization on 22 categories from different depths of the WordNet hierarchy. The results also throw light on the diversity of images in each of these categories.

4.3. 自动目标定位

ImageNet 可以扩展以提供关于每张图像的附加信息。其中之一就是图像中对象的空间范围。这让人想到两个应用领域。首先,为了训练一个稳健的目标检测算法,通常需要不同姿态和视角下的定位对象。其次,在杂乱场景中拥有定位好的对象,使得用户能够将 ImageNet 用作目标定位算法的基准数据集。在本节中,我们展示了在 WordNet 层次结构中不同深度的 22 个类别上的定位结果。这些结果也揭示了每个类别中图像的多样性。

We use the non-parametric graphical model described in previous work to learn the visual representation of objects against a global background class. In this model, every input image is represented as a "bag of words". The output is a probability for each image patch to belong to the topics zi of a given category. In order to annotate images with a bounding box we calculate the likelihood of each image patch given a category c: p(x|c)=ip(x|zi,c)p(zi|c). Finally, one bounding box is put around the region which accumulates the highest likelihood.

我们使用先前工作中描述的非参数化图模型来学习对象相对于全局背景类别的视觉表示。在该模型中,每张输入图像都被表示为一个“词袋”。输出是每个图像块属于给定类别主题 zi 的概率。为了用边界框标注图像,我们计算给定类别 c 下每个图像块的似然:p(x|c)=ip(x|zi,c)p(zi|c). 最后,在累积似然最高的区域周围放置一个边界框。

We annotated 100 images in 22 different categories of the mammal and vehicle subtrees with bounding boxes around the objects of that category. Fig. 10 shows precision and recall values. Note that precision is low due to extreme variability of the objects and because of small objects which have hardly any salient regions.

我们对哺乳动物和车辆子树的 22 个不同类别中的 100 张图像进行了标注,在每张图像中围绕该类别的对象画出了边界框。图 10 显示了精确率和召回率的值。请注意,由于对象的极端变异性以及存在几乎没有任何显著区域的小对象,精确率较低。

Fig. 11 shows sampled bounding boxes on different classes. The colored region is the detected bounding box, while the original image is in light gray.

图 11 展示了不同类别上的采样边界框。彩色区域是检测到的边界框,而原始图像以浅灰色显示。

In order to illustrate the diversity of ImageNet inside each category, Fig. 12 shows results on running k-means clustering on the detected bounding boxes after converting them to grayscale and rescaling them to 32×32. All average images, including those for the entire cluster, are created with approximately 40 images. While it is hard to identify the object in the average image of all bounding boxes (shown in the center) due to the diversity of ImageNet, the average images of the single clusters consistently discover viewpoints or common poses.

为了说明 ImageNet 在每个类别内部的多样性,图 12 显示了在将检测到的边界框转换为灰度图并缩放至 32×32 后,对其运行 k-means 聚类的结果。所有平均图像,包括整个簇的平均图像,均由大约 40 张图像创建。由于 ImageNet 的多样性,虽然很难在所有边界框的平均图像(显示在中心)中识别出对象,但单个簇的平均图像却总能发现视角或常见姿态。

Discussion and Future Work

Our future work has two goals:

我们的未来工作有两个目标:

5.1. Completing ImageNet

The current ImageNet constitutes 10% of the WordNet synsets. To further speed up the construction process, we will continue to explore more effective methods to evaluate the AMT user labels and optimize the number of repetitions needed to accurately verify each image. At the completion of ImageNet, we aim to (i) have roughly 50 million clean, diverse and full resolution images spread over approximately 50K synsets; (ii) deliver ImageNet to research communities by making it publicly available and readily accessible online. We plan to use cloud storage to enable efficient distribution of ImageNet data; (iii) extend ImageNet to include more information such as localization as described in Sec. 4.3, segmentation, cross-synset referencing of images, as well as expert annotation for difficult synsets and (iv) foster an ImageNet community and develop an online platform where everyone can contribute to and benefit from ImageNet resources.

5.1. 完成ImageNet

目前的ImageNet约占WordNet同义词集的10%。为了进一步加快构建过程,我们将继续探索更有效的方法来评估AMT用户标签,并优化准确验证每张图像所需的重复次数。在ImageNet完成时,我们的目标是:(i) 拥有约5000万张干净、多样且全分辨率的图像,分布在约5万个同义词集中;(ii) 通过使ImageNet公开可用且易于在线访问,将其交付给研究社区。我们计划使用云存储以实现ImageNet数据的高效分发;(iii) 扩展ImageNet以包含更多信息,例如第4.3节所述的定位、分割、图像的跨同义词集引用,以及对困难同义词集的专家标注;(iv) 培育ImageNet社区,并开发一个在线平台,让每个人都能为ImageNet资源做出贡献并从中受益。

5.2. Exploiting ImageNet

We hope ImageNet will become a central resource for a broad range of vision related research. For the computer vision community in particular, we envision the following possible applications.

我们希望ImageNet能成为广泛视觉相关研究的核心资源。特别是对于计算机视觉社区,我们设想了以下可能的应用。

A training resource. Most of today's object recognition algorithms have focused on a small number of common objects, such as pedestrians, cars and faces. This is mainly due to the high availability of images for these categories. Fig. 6 has shown that even the largest datasets today have a strong bias in their coverage of different types of objects. ImageNet, on the other hand, contains a large number of images for nearly all object classes including rare ones. One interesting research direction could be to transfer knowledge of common objects to learn rare object models.

训练资源。 当今大多数目标识别算法都集中在少量常见对象上,如行人、汽车和人脸。这主要是因为这些类别的图像高度可用。图6表明,即使是当今最大的数据集,在对不同类型对象的覆盖上也存在严重偏差。另一方面,ImageNet包含了几乎所有对象类别的大量图像,包括罕见类别。一个有趣的研究方向可能是迁移常见对象的知识来学习罕见对象模型。

A benchmark dataset. The current benchmark datasets in computer vision such as Caltech101/256 and PASCAL have played a critical role in advancing object recognition and scene classification research. We believe that the high quality, diversity and large scale of ImageNet will enable it to become a new and challenging benchmark dataset for future research.

基准数据集。 当前的计算机视觉基准数据集,如Caltech101/256和PASCAL,在推动目标识别和场景分类研究方面发挥了关键作用。我们相信,ImageNet的高质量、多样性和大规模将使其成为未来研究的一个新的、具有挑战性的基准数据集。

Introducing new semantic relations for visual modeling. Because ImageNet is uniquely linked to all concrete nouns of WordNet whose synsets are richly interconnected, one could also exploit different semantic relations for instance to learn part models. To move towards total scene understanding, it is also helpful to consider different depths of the semantic hierarchy.

为视觉建模引入新的语义关系。 由于ImageNet与WordNet的所有具体名词唯一关联,且其同义词集相互紧密连接,人们还可以利用不同的语义关系,例如学习部件模型。为了实现整体场景理解,考虑语义层次结构的不同深度也是有益的。

Human vision research. ImageNet's rich structure and dense coverage of the image world may help advance the understanding of the human visual system. For example, the question of whether a concept can be illustrated by images is much more complex than one would expect at first. Aligning the cognitive hierarchy with the "visual" hierarchy also remains an unexplored area.

人类视觉研究。 ImageNet的丰富结构和密集的图像世界覆盖可能有助于增进对人类视觉系统的理解。例如,一个概念是否可以用图像来说明这个问题,比人们最初想象的要复杂得多。将认知层次结构与“视觉”层次结构对齐也是一个尚未探索的领域。