从自然语言监督中学习可迁移的视觉模型
Abstract
SOTA computer vision systems are trained to predict a fixed set of predetermined object categories. This restricted form of supervision limits their generality and usability since additional labeled data is needed to specify any other visual concept. Learning directly from raw text about images is a promising alternative which leverages a much broader source of supervision. We demonstrate that the simple pre-training task of predicting which caption goes with which image is an efficient and scalable way to learn SOTA image representations from scratch on a dataset of 400 million (image, text) pairs collected from the internet. After pre-training, natural language is used to reference learned visual concepts (or describe new ones) enabling zero-shot transfer of the model to downstream tasks. We study performance on over 30 different computer vision datasets, spanning tasks such as OCR, action recognition in videos, geo-localization, and many types of fine-grained object classification. The model transfers non-trivially to most tasks and is often competitive with a fully supervised baseline without the need for any dataset specific training. For instance, we match the accuracy of the original ResNet50 on ImageNet zero-shot without needing to use any of the 1.28 million training examples it was trained on.
最先进的计算机视觉系统被训练来预测一组固定的预定对象类别。这种受限的监督形式限制了它们的通用性和可用性,因为指定任何其他视觉概念都需要额外的标记数据。直接从关于图像的原始文本中学习是一种有前景的替代方案,它利用了更广泛的监督来源。我们证明,预测哪个说明与哪个图像匹配这一简单的预训练任务,是一种高效且可扩展的方式,可以在从互联网收集的4亿(图像,文本)对数据集上从头开始学习最先进的图像表示。预训练后,使用自然语言来引用已学习的视觉概念(或描述新的概念),使得模型能够零样本迁移到下游任务。我们研究了在超过30个不同的计算机视觉数据集上的性能,涵盖OCR、视频中的动作识别、地理定位以及多种细粒度对象分类等任务。该模型能够非平凡地迁移到大多数任务,并且通常与完全监督的基线模型相竞争,而无需任何特定于数据集的训练。例如,我们在ImageNet上零样本地匹配了原始ResNet50的准确率,而不需要使用其训练所用的128万个训练样本中的任何一个。
Introduction and Motivating Work
Pre-training methods which learn directly from raw text have revolutionized NLP over the last few years (Dai & Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Radford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). The development of "text-to-text" as a standardized input-output interface (McCann et al., 2018; Radford et al., 2019; Raffel et al., 2019) has enabled task-agnostic architectures to zero-shot transfer to downstream datasets. Flagship systems like GPT-3 (Brown et al., 2020) are now competitive across many tasks with bespoke models while requiring little to no dataset specific training data.
过去几年中,直接从原始文本中学习的预训练方法彻底改变了自然语言处理领域。将"文本到文本"发展为标准化的输入-输出接口,使得任务无关的架构能够零样本迁移到下游数据集。像GPT-3这样的旗舰系统现在在众多任务上与定制模型竞争,同时几乎不需要或完全不需要特定于数据集的训练数据。
These results suggest that the aggregate supervision accessible to modern pre-training methods within web-scale collections of text surpasses that of high-quality crowd-labeled NLP datasets. However, in other fields such as computer vision it is still standard practice to pre-train models on crowd-labeled datasets such as ImageNet (Deng et al., 2009). Could scalable pre-training methods which learn directly from web text result in a similar breakthrough in computer vision? Prior work is encouraging.
这些结果表明,现代预训练方法在互联网规模的文本集合中所能获得的总体监督,超过了高质量众包标注的NLP数据集。然而,在计算机视觉等其他领域,在ImageNet等众包标注数据集上预训练模型仍然是标准做法。那么,直接从网络文本中学习的可扩展预训练方法能否在计算机视觉领域引发类似的突破?先前的工作令人鼓舞。
Joulin et al. (2016) demonstrated that CNNs trained to predict words in image captions can learn representations competitive with ImageNet training. Li et al. (2017) then extended this approach to predicting phrase n-grams in addition to individual words and demonstrated the ability of their system to zero-shot transfer to other image classification datasets. Adopting more recent architectures and pre-training approaches, VirTex (Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., 2020), and ConVIRT (Zhang et al., 2020) have recently demonstrated the potential of transformer-based language modeling, masked language modeling, and contrastive objectives to learn image representations from text.
Joulin等人证明,训练用于预测图像说明中词语的CNN可以学习到与ImageNet训练相竞争的表示。Li等人随后将此方法扩展到预测短语n-gram以及单个词语,并展示了其系统零样本迁移到其他图像分类数据集的能力。采用更新的架构和预训练方法,VirTex、ICMLM和ConVIRT最近证明了基于Transformer的语言建模、掩码语言建模和对比学习目标在从文本中学习图像表示方面的潜力。
However, the aforementioned models still under-perform current SOTA computer vision models such as Big Transfer (Kolesnikov et al., 2019) and the weakly supervised ResNeXt (Mahajan et al., 2018). A crucial difference is scale. While Mahajan et al. (2018) and Kolesnikov et al. (2019) trained for accelerator years on millions to billions of images, VirTex, ICMLM, and ConVIRT trained for accelerator days on one to two hundred thousand images. We close this gap and study the behaviors of image models trained from natural language supervision at large scale. We demonstrate that a simplified version of ConVIRT trained from scratch, which we call CLIP, for Contrastive Language-Image Pre-training, is an efficient and scalable method of learning from natural language supervision. We find that CLIP learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and outperforms the best publicly available ImageNet model while being more computationally efficient. We also find that zero-shot CLIP models are much more robust than equivalent accuracy supervised ImageNet models.
然而,上述模型仍然表现不及当前的SOTA计算机视觉模型,如Big Transfer和弱监督的ResNeXt。一个关键的差异是规模。当Mahajan等人和Kolesnikov等人使用加速器年(accelerator years)的时间在数百万到数十亿张图像上进行训练时,VirTex、ICMLM和ConVIRT仅使用加速器天(accelerator days)的时间在一到二十万张图像上进行训练。我们弥补了这一差距,并研究在大规模自然语言监督下训练的图像模型的行为。我们证明,从零开始训练的ConVIRT的一个简化版本(我们称之为CLIP,即对比语言-图像预训练)是一种从自然语言监督中学习的高效且可扩展的方法。我们发现,CLIP在预训练期间学习执行广泛的任务,包括OCR、地理定位、动作识别,并且在计算效率更高的同时,优于最佳的公开可用ImageNet模型。我们还发现,零样本CLIP模型比同等准确率的监督ImageNet模型鲁棒性更强。
Approach
At the core of our work is the idea of learning perception from the supervision contained in natural language paired with images. In the following subsections we detail our specific approach.
我们工作的核心思想是从与图像配对的自然语言所包含的监督信号中学习感知能力。在以下小节中,我们将详细阐述我们的具体方法。
2.1. Creating a Sufficiently Large Dataset
Existing work has mainly used three datasets, MS-COCO (Lin et al., 2014), Visual Genome (Krishna et al., 2017), and YFCC100M (Thomee et al., 2016). While MS-COCO and Visual Genome are high quality crowd-labeled datasets, they are small by modern standards with approximately 100,000 training photos each. By comparison, other computer vision systems are trained on up to 3.5 billion Instagram photos (Mahajan et al., 2018). YFCC100M, at 100 million photos, is a possible alternative, but the metadata for each image is sparse and of varying quality. Many images use automatically generated filenames like 20160716_113957.JPG as "titles" or contain "descriptions" of camera exposure settings. After filtering to keep only images with natural language titles and/or descriptions in English, the dataset shrunk by a factor of 6 to only 15 million photos. This is approximately the same size as ImageNet.
2.1. 创建足够大的数据集
现有工作主要使用了三个数据集:MS-COCO、Visual Genome和YFCC100M。尽管MS-COCO和Visual Genome是高质量的众包标注数据集,但按照现代标准来看,它们规模较小,各自仅有大约10万张训练照片。相比之下,其他计算机视觉系统的训练数据量高达35亿张Instagram照片。拥有1亿张照片的YFCC100M是一个可能的选择,但每张图像的元数据稀疏且质量参差不齐。许多图像使用自动生成的文件名(如20160716_113957.JPG)作为“标题”,或包含相机曝光设置的“描述”。在过滤掉非英文自然语言标题和/或描述后,该数据集规模缩小了6倍,仅剩1500万张照片。这与ImageNet的规模大致相当。
A major motivation for natural language supervision is the large quantities of data of this form available publicly on the internet. To test this we constructed a new dataset of 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet. To attempt to cover as broad a set of visual concepts as possible, we search for (image, text) pairs as part of the construction process whose text includes one of a set of 500,000 queries. We approximately class balance the results by including up to 20,000 (image, text) pairs per query. The resulting dataset has a similar total word count as the WebText dataset used to train GPT-2. We refer to this dataset as WIT for WebImageText.
采用自然语言监督的一个主要动机是,互联网上公开存在大量这种形式的数据。为了验证这一点,我们构建了一个包含4亿个(图像,文本)对的新数据集,这些数据对收集自互联网上各种公开来源。为了尽可能覆盖广泛的视觉概念,我们在构建过程中搜索文本包含50万个查询中任意一个的(图像,文本)对。我们通过每个查询最多包含2万个(图像,文本)对来大致实现类别平衡。由此产生的数据集在总词数上与用于训练GPT-2的WebText数据集相近。我们将此数据集称为WIT,即WebImageText。
2.2. Selecting an Efficient Pre-Training Method
Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method. In Figure 2 we show that a 63 million parameter transformer language model, which already uses twice the compute of its ResNet50 image encoder, learns to recognize ImageNet classes three times slower than an approach similar to Joulin et al. (2016) that predicts a bag-of-words encoding of the same text.
2.2. 选择高效的预训练方法
我们最初的方法与VirTex类似,即从零开始联合训练图像CNN和文本Transformer来预测图像的说明。然而,我们在高效扩展此方法时遇到了困难。在图2中,我们展示了一个6300万参数的Transformer语言模型(其计算量已是其ResNet50图像编码器的两倍)在识别ImageNet类别上的学习速度,比类似Joulin等人(2016)预测相同文本的词袋编码的方法慢了三倍。
Recent work in contrastive representation learning has found that contrastive objectives can outperform the equivalent predictive objective (Tian et al., 2019). Noting this finding, we explored training a system to solve the potentially easier proxy task of predicting only which text as a whole is paired with which image and not the exact words of that text. Starting with the same bag-of-words encoding baseline, we swapped the predictive objective for a contrastive objective in Figure 2, observed a further 4x efficiency improvement in the rate of zero-shot transfer to ImageNet.
近期对比表示学习的研究发现,对比学习目标可以优于等效的预测目标。注意到这一发现,我们探索训练一个系统来解决一个可能更简单的代理任务:仅预测哪个文本整体与哪个图像配对,而不是预测该文本的确切词语。从相同的词袋编码基线开始,我们在图2中将预测目标替换为对比目标,观察到零样本迁移到ImageNet的速率又提升了4倍。
Given a batch of
给定一个包含
Since over-fitting is not a major concern, the details of training CLIP are simplified compared to Zhang et al. (2020). We train CLIP from scratch instead of initializing with pre-trained weights. We remove the non-linear projection between the representation and the contrastive embedding space. We use only a linear projection to map from each encoder's representation to the multi-modal embedding space. We also remove the text transformation function
由于过拟合不是主要问题,与Zhang等人(2020)相比,训练CLIP的细节得以简化。我们从零开始训练CLIP,而不是使用预训练权重初始化。我们去除了表示层与对比嵌入空间之间的非线性投影。我们仅使用线性投影将每个编码器的表示映射到多模态嵌入空间。我们还移除了文本转换函数
2.3. Choosing and Scaling a Model
We consider two different architectures for the image encoder. For the first, we use ResNet50 (He et al., 2016a) as the base architecture for the image encoder due to its widespread adoption and proven performance. We make several modifications to the original version using the ResNetD improvements from He et al. (2019) and the antialiased rect-2 blur pooling from Zhang (2019). We also replace the global average pooling layer with an attention pooling mechanism. The attention pooling is implemented as a single layer of "transformer-style" multi-head QKV attention where the query is conditioned on the global average-pooled representation of the image. For the second architecture, we experiment with the recently introduced Vision Transformer (ViT) (Dosovitskiy et al., 2020). We closely follow their implementation with only the minor modification of adding an additional layer normalization to the combined patch and position embeddings before the transformer and use a slightly different initialization scheme.
2.3. 选择与扩展模型
我们考虑了两种不同的图像编码器架构。第一种,我们使用ResNet50作为图像编码器的基础架构,因为它被广泛采用且性能已验证。我们利用He等人(2019)的ResNetD改进和Zhang(2019)的抗混叠rect-2模糊池化,对原始版本进行了几处修改。我们还用注意力池化机制替换了全局平均池化层。注意力池化实现为单层“Transformer风格”的多头QKV注意力,其中查询以图像的全局平均池化表示为条件。对于第二种架构,我们尝试了最近引入的视觉Transformer。我们紧密遵循其实现,仅做了微小修改:在Transformer之前对组合的图像块和位置嵌入添加了额外的层归一化,并使用了略微不同的初始化方案。
The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019). As a base size we use a 12-layer 512-wide model with 8 attention heads. The transformer operates on a lower-cased byte pair encoding (BPE) representation of the text (Sennrich et al., 2015). The text sequence is bracketed with [SOS] and [EOS] tokens and the activations of the highest layer of the transformer at the [EOS] token are used as the feature representation of the text which is layer normalized and then linearly projected into the multi-modal embedding space. Masked self-attention was used in the text encoder to preserve the ability to add language modeling as an auxiliary objective, though exploration of this is left as future work.
文本编码器是一个Transformer,采用了Radford等人(2019)中描述的架构修改。作为基础尺寸,我们使用一个12层、512宽度、8个注意力头的模型。该Transformer处理文本的小写字节对编码表示。文本序列用[SOS]和[EOS]标记括起来,Transformer最高层在[EOS]标记处的激活值作为文本的特征表示,经过层归一化后线性投影到多模态嵌入空间。文本编码器中使用了掩码自注意力,以保留将语言建模作为辅助目标的能力,不过对此的探索留作未来工作。
While previous computer vision research has often scaled models by increasing the width (Mahajan et al., 2018) or depth (He et al., 2016a) in isolation, for the ResNet image encoders we adapt the approach of Tan & Le (2019) which found that allocating additional compute across all of width, depth, and resolution outperforms allocating it to only one dimension. We use a simple variant which allocates additional compute equally to increasing the width, depth, and resolution of the model. For the text encoder, we only scale the width of the model to be proportional to the calculated increase in width of the ResNet and do not scale the depth at all, as we found CLIP's performance to be less sensitive to the text encoder.
虽然以往的计算机视觉研究常常通过单独增加宽度或深度来扩展模型,但对于ResNet图像编码器,我们采用了Tan & Le(2019)的方法,该研究发现将额外的计算量分配给宽度、深度和分辨率三个维度共同提升,优于只分配给单一维度。我们使用一个简单的变体,将额外的计算量平均分配给增加模型的宽度、深度和分辨率。对于文本编码器,我们仅按比例缩放模型的宽度以匹配ResNet宽度的计算增量,而不缩放深度,因为我们发现CLIP的性能对文本编码器的敏感度较低。
2.4. Pre-training
We train a series of 5 ResNets and 3 Vision Transformers. For the ResNets we train a ResNet50, a ResNet101, and then 3 more which follow EfficientNet-style model scaling and use approximately 4x, 16x, and 64x the compute of a ResNet50. They are denoted as RN50x4, RN50x16, and RN50x64 respectively. For the Vision Transformers we train a ViT-B/32, a ViT-B/16, and a ViT-L/14. The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs. For the ViT-L/14 we also pre-train at a higher 336 pixel resolution for one additional epoch to boost performance similar to FixRes (Touvron et al., 2019). We denote this model as ViT-L/14@336px. Unless otherwise specified, all results reported in this paper as "CLIP" use this model which we found to perform best. Full model hyperparameters and details are in supplementary material.
2.4. 预训练
我们训练了一系列模型,包括5个ResNet和3个视觉Transformer。对于ResNet,我们训练了ResNet50、ResNet101,以及另外3个遵循EfficientNet风格模型缩放、计算量分别约为ResNet50的4倍、16倍和64倍的模型。它们分别记为RN50x4、RN50x16和RN50x64。对于视觉Transformer,我们训练了ViT-B/32、ViT-B/16和ViT-L/14。最大的ResNet模型RN50x64在592块V100 GPU上训练了18天,而最大的视觉Transformer在256块V100 GPU上训练了12天。对于ViT-L/14,我们还以更高的336像素分辨率额外预训练了一个周期以提升性能,类似于FixRes的做法。我们将此模型记为ViT-L/14@336px。除非另有说明,本文中报告为“CLIP”的所有结果均使用该模型,我们发现其性能最佳。完整的模型超参数和细节见补充材料。
2.5. Using CLIP
CLIP is pre-trained to predict if an image and a text snippet are paired together in WIT. To apply CLIP to downstream tasks, we reuse this capability and study the zero-shot transfer performance of CLIP on standard computer vision datasets. Similar to Radford et al. (2019) we motivate this as a way of measuring the task learning capability of a system (as opposed to its representation learning capability). For each dataset, we use the names of all the classes in the dataset as the set of potential text pairings and predict the most probable (image, text) pair according to CLIP. We additionally experiment with providing CLIP with text prompts to help specify the task as well as ensembling multiple of these templates in order to boost performance. However, since the vast majority of unsupervised and self-supervised computer vision research focuses on representation learning, we also investigate this for CLIP using the common linear probe protocol.
2.5. 使用CLIP
CLIP经过预训练,可以预测WIT数据集中图像和文本片段是否配对。为了将CLIP应用于下游任务,我们重用此能力,并研究CLIP在标准计算机视觉数据集上的零样本迁移性能。与Radford等人(2019)类似,我们将此视为衡量系统任务学习能力(而非其表示学习能力)的一种方式。 对于每个数据集,我们使用数据集中所有类别的名称作为潜在文本配对的集合,并根据CLIP预测最可能的(图像,文本)配对。我们还尝试为CLIP提供文本提示以帮助明确任务,并集成多个这些模板以提升性能。然而,由于绝大多数无监督和自监督计算机视觉研究都聚焦于表示学习,我们也通过常用的线性探测协议对此进行了研究。
Data Overlap Analysis
A concern with pre-training on a very large internet dataset is unintentional overlap with downstream evals. We conducted de-duplication analysis to investigate this with full details in the supplementary material. Out of 35 datasets studied, 9 datasets have no detected overlap at all. There is a median overlap of 2.2% and an average overlap of 3.2%. Due to this small amount of overlap, overall accuracy is rarely shifted by more than 0.1% with only 7 datasets above this threshold. Of these, only 2 are statistically significant after Bonferroni correction. The max detected improvement is only 0.6% on Birdsnap. This echos the findings of similar duplicate analysis in previous work on large scale pre-training. Mahajan et al. (2018) and Kolesnikov et al. (2019) detected similar overlap rates for their models and also observed minimal changes in overall performance.
在非常大的互联网数据集上进行预训练的一个担忧是与下游评估数据的意外重叠。我们进行了去重分析来研究这一问题,完整细节见补充材料。在所研究的35个数据集中,有9个数据集完全没有检测到重叠。重叠率的中位数为2.2%,平均值为3.2%。由于重叠量很小,整体准确率很少出现超过0.1%的偏移,只有7个数据集高于此阈值。其中,经过Bonferroni校正后,仅有2个数据集具有统计学显著性。检测到的最大提升仅为Birdsnap上的0.6%。这与先前大规模预训练工作中类似重复数据分析的结果相呼应。Mahajan等人(2018)和Kolesnikov等人(2019)也在他们的模型中检测到相似的重叠率,并观察到整体性能的变化极小。
Broader Impacts
CLIP allows people to design their own classifiers and removes the need for task-specific training data. How these classes are designed heavily influences both model performance and model biases. For example, we find that when given a set of labels including Fairface race labels (Kärkkäinen & Joo, 2019) and a handful of egregious terms such as "criminal" and "animal" the model tends to classify images of people aged 0–20 in the egregious category at a rate of 32.3%. However, when we add the class "child" to the list of possible classes, this behaviour drops to 8.7%. We also found discrepancies across gender and race for people categorized into the 'crime' and 'non-human' categories, highlighting the potential for disparate impact even when extreme care is taken for thoughtful class design.
CLIP允许人们设计自己的分类器,并消除了对特定任务训练数据的需求。这些类别的设计方式极大地影响着模型性能和模型偏见。例如,我们发现,当给定一组包含Fairface种族标签以及诸如"罪犯"和"动物"等少数负面词汇的标签时,模型倾向于将0-20岁人群的图像归类到负面类别的比例为32.3%。然而,当我们将"儿童"类添加到可能的类别列表后,这一行为下降到8.7%。我们还发现,在被归类为"犯罪"和"非人类"类别的人群中存在性别和种族的差异,这凸显了即使在精心设计类别时,也可能产生差异性影响的潜在风险。
Additionally, given that CLIP does not need task-specific training data, it can unlock certain niche tasks with greater ease. Some of these tasks may raise privacy or surveillance related risks, which we explore by testing CLIP's performance on celebrity identification using the CelebA dataset (Liu et al., 2018). CLIP has a top-1 accuracy of 59.2% for "in the wild" celebrity image classification when choosing from 100 candidates and of 43.3% when choosing from 1000 possible choices. Although it's noteworthy to achieve these results with task agnostic pre-training, this performance is not competitive with widely available production level models. We explore challenges that CLIP poses in our supplemental materials and hope that this work motivates future research on the characterization of the capabilities, shortcomings, and biases of such models.
此外,由于CLIP不需要特定任务的训练数据,它可以更容易地解锁某些小众任务。其中一些任务可能引发隐私或监控相关的风险,我们通过使用CelebA数据集测试CLIP在名人识别上的表现来探讨这一点。CLIP在从100个候选中进行"野外"名人图像分类时的top-1准确率为59.2%,在从1000个可能候选中选择时为43.3%。虽然以任务无关的预训练方式取得这些成果值得关注,但这一性能与广泛可用的生产级模型相比仍不具备竞争力。我们在补充材料中探讨了CLIP带来的挑战,并希望这项工作能激励未来对此类模型的能力、缺陷和偏见特征化方面的研究。
Limitations
The performance of zero-shot CLIP is often just competitive with the supervised baseline of a linear classifier on ResNet-50 features. This baseline is now well below the overall SOTA. Significant work is still needed to improve the task learning and transfer capabilities of CLIP. We estimate around a 1000x increase in compute is required for zero-shot CLIP to reach overall SOTA performance across our evaluation suite. This is infeasible to train with current hardware. Further research into improving upon the computational and data efficiency of CLIP will be necessary.
零样本CLIP的性能通常仅能与基于ResNet-50特征的线性分类器的监督基线相媲美。而这一基线目前远低于整体SOTA水平。要提升CLIP的任务学习和迁移能力,仍需大量工作。我们估计,要让零样本CLIP在评估套件中达到整体SOTA性能,需要约1000倍的计算量提升。这在当前硬件条件下是无法训练的。因此,有必要进一步研究提高CLIP的计算和数据效率。
Despite our emphasis on zero-shot transfer, we repeatedly queried performance on validation sets to guide development. This is unrealistic for true zero-shot scenarios. Similar concerns have been raised in the field of semi-supervised learning (Oliver et al., 2018). Another potential issue is our selection of evaluation datasets. While we report results on Kornblith et al. (2019)'s 12 dataset evaluation suite as a standardized collection, our main analysis uses a somewhat haphazard collection of 27 datasets that is undeniably co-adapted with the capabilities of CLIP. A new benchmark of tasks designed to evaluate broad zero-shot transfer capabilities would help address this issue.
尽管我们强调零样本迁移,但在开发过程中仍反复查询验证集性能以指导进展。这对于真正的零样本场景是不现实的。半监督学习领域也曾提出过类似的担忧。另一个潜在问题是我们对评估数据集的选择。虽然我们报告了Kornblith等人(2019)的12数据集评估套件作为标准化集合的结果,但我们的主要分析使用的是27个数据集的集合,这个选择有些随意,并且不可否认地与CLIP的能力共同适配。一个旨在评估广泛零样本迁移能力的新任务基准将有助于解决这一问题。
We emphasize that specifying image classifiers through natural language is a flexible interface but this has its own limitations. Many complex tasks can be difficult to specify just through text. Actual training examples are undeniably useful but CLIP does not optimize for few-shot performance directly. We fall back to fitting linear classifiers on top of CLIP's features. This results in a counter-intuitive drop in performance when transitioning from a zero-shot to a few-shot setting.
我们强调,通过自然语言指定图像分类器是一种灵活的接口,但这也有其自身的局限性。许多复杂的任务难以仅通过文本来精确描述。实际的训练示例无疑是有用的,但CLIP并未直接针对少样本性能进行优化。我们退而求其次,在CLIP的特征之上拟合线性分类器。这导致从零样本过渡到少样本设置时,出现了反直觉的性能下降。
Related Work
The idea of learning to perform computer vision tasks from natural language supervision is by no means new. Rather, our main contribution is studying its behavior at large scale. Over 20 years ago Mori et al. (1999) explored improving content based image retrieval by training a model to predict the nouns and adjectives in text paired with images. Quattoni et al. (2007) demonstrated it was possible to learn more data efficient image representations via manifold learning in the weight space of classifiers trained to predict words in image captions. Srivastava & Salakhutdinov (2012) explored deep representation learning by training multimodal Deep Boltzmann Machines on top of low-level image and text tag features. More recent work inspiring CLIP is described in the Introduction.
从自然语言监督中学习执行计算机视觉任务的想法绝非新鲜事。相反,我们的主要贡献在于研究其在大规模下的行为。20多年前,Mori等人就探索了通过训练模型来预测与图像配对的文本中的名词和形容词,从而改进基于内容的图像检索。Quattoni等人证明,通过在训练用于预测图像说明中词语的分类器的权重空间中进行流形学习,可以学习到更具数据效率的图像表示。Srivastava & Salakhutdinov探索了在底层图像和文本标签特征之上训练多模态深度玻尔兹曼机进行深度表示学习。引言部分描述了最近启发CLIP的工作。
Learning from collections of internet images is commonly investigated in webly supervised learning with Fergus et al. (2005) demonstrating the ability to train competitive computer vision classifiers by treating image search engine results as supervision. Of this line of work, Learning Everything about Anything: Webly-Supervised Visual Concept Learning (Divvala et al., 2014) has a notably similar ambition and goal as CLIP.
从互联网图像集合中学习通常是在网络监督学习中研究的,Fergus等人证明了通过将图像搜索引擎结果视为监督信号,可以训练出具有竞争力的计算机视觉分类器。在这一系列工作中,《Learning Everything about Anything: Webly-Supervised Visual Concept Learning》与CLIP有着显著相似的抱负和目标。
Developments in zero-shot computer vision (Larochelle et al., 2008; Lampert et al., 2009) were essential for CLIP. Socher et al. (2013a) demonstrated that connecting image and language representations enabled zero-shot transfer to unseen classes on CIFAR10 and Frome et al. (2013) improved and scaled this finding to ImageNet. The idea of generating a classifier from natural language dates back to at least Elhoseiny et al. (2013) and a form similar to CLIP's zero-shot classifier was explored in Lei Ba et al. (2015).
零样本计算机视觉的发展对CLIP至关重要。Socher等人证明,连接图像和语言表示能够实现对CIFAR10上未见类别的零样本迁移,Frome等人改进并将这一发现扩展到ImageNet。从自然语言生成分类器的想法至少可以追溯到Elhoseiny等人,而Lei Ba等人探索了与CLIP零样本分类器相似的形式。
Natural language supervision has also been explored for tasks beyond image classification including video understanding (Ramanathan et al., 2013; Miech et al., 2019), Reinforcement Learning (Hermann et al., 2017), and a burst of recent work on learning joint models of vision and language (Lu et al., 2019; Tan & Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., 2020) for complex joint tasks beyond those studied here including visual question answering.
自然语言监督也被探索用于图像分类之外的任务,包括视频理解、强化学习,以及最近涌现的大量关于学习视觉与语言联合模型的工作,这些工作针对的是比本文所研究的更复杂的联合任务,包括视觉问答。
Conclusion
We have investigated whether it is possible to transfer the success of task-agnostic web-scale pre-training in NLP to another domain. We find that adopting this formula results in similar behaviors emerging in the field of computer vision and discuss the social implications of this line of research. In order to optimize their training objective, CLIP models learn to perform a wide variety of tasks during pre-training. This task learning can then be leveraged via natural language prompting to enable zero-shot transfer to many existing datasets. At sufficient scale, the performance of this approach can be competitive with task-specific supervised models although there is still room for much improvement.
我们研究了是否可能将NLP中任务无关的网络规模预训练的成功经验迁移到另一个领域。我们发现,采用这一范式导致计算机视觉领域出现了类似的行为,并讨论了这一研究方向的社会影响。为了优化其训练目标,CLIP模型在预训练过程中学习执行各种各样的任务。然后,可以通过自然语言提示来利用这种任务学习能力,实现对许多现有数据集的零样本迁移。在足够规模下,这种方法的性能可以与特定任务的监督模型相媲美,尽管仍有很大的改进空间。