深度学习
Abstract
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
深度学习允许由多个处理层组成的计算模型学习具有多个抽象层次的数据表征。这些方法显著提升了语音识别、视觉对象识别、目标检测以及药物发现和基因组学等许多其他领域的最新技术水平。深度学习通过使用反向传播算法来指示机器如何改变其内部参数,从而从上一层表征计算出每一层的表征,进而发现大型数据集中的复杂结构。深度卷积网络在处理图像、视频、语音和音频方面取得了突破,而循环网络则在文本和语音等序列数据上表现出色。
Machine learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users' interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning.
机器学习技术支撑着现代社会的许多方面:从网络搜索到社交网络上的内容过滤,再到电子商务网站上的推荐,并且它越来越多地出现在相机和智能手机等消费产品中。机器学习系统被用于识别图像中的物体、将语音转录为文本、将新闻条目、帖子或产品与用户的兴趣相匹配,并选择相关的搜索结果。这些应用越来越频繁地使用一类称为深度学习的技术。
Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input.
传统的机器学习技术在处理原始形式的自然数据方面能力有限。几十年来,构建模式识别或机器学习系统需要精心的工程设计和大量的领域专业知识,以设计一个特征提取器,将原始数据转换为合适的内部表征或特征向量,然后由学习子系统检测或分类输入中的模式。
Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
表征学习是一组方法,它允许机器接收原始数据并自动发现检测或分类所需的表征。深度学习方法是具有多个表征层次的表征学习方法,通过组合简单但非线性的模块获得,每个模块将一个层次的表征转换为更高、稍抽象层次的表征。通过组合足够多的此类转换,可以学习到非常复杂的函数。对于分类任务,较高层的表征会放大输入中对区分重要的方面,并抑制不相关的变化。例如,图像以像素值数组的形式存在,第一层表征中学习到的特征通常代表图像中特定方向和位置是否存在边缘。第二层通常通过识别边缘的特定排列来检测模式,而不受边缘位置微小变化的影响。第三层可能将这些模式组合成对应于熟悉物体部分的更大组合,后续层则将物体检测为这些部分的组合。深度学习的关键在于这些特征层不是由人类工程师设计的:它们是通过通用学习程序从数据中学习得到的。
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition and speech recognition, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules, analysing particle accelerator data, reconstructing brain circuits, and predicting the effects of mutations in non-coding DNA on gene expression and disease. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding, particularly topic classification, sentiment analysis, question answering and language translation.
深度学习在解决人工智能界多年来一直未能攻克的难题方面取得了重大进展。事实证明,它非常擅长发现高维数据中的复杂结构,因此适用于科学、商业和政府的许多领域。除了打破图像识别和语音识别的记录外,它在预测潜在药物分子的活性、分析粒子加速器数据、重建大脑回路以及预测非编码DNA突变对基因表达和疾病的影响方面,也超越了其他机器学习技术。也许更令人惊讶的是,深度学习在自然语言理解的各种任务中产生了非常有希望的结果,特别是主题分类、情感分析、问答和语言翻译。
We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress.
我们认为深度学习在不久的将来会取得更多成功,因为它几乎不需要人工工程,因此可以轻松利用可用计算和数据量的增加。目前正在为深度神经网络开发的新学习算法和架构将只会加速这一进展。
Supervised learning
The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as 'knobs' that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.
监督学习
最常见的机器学习形式,无论是否深度学习,都是监督学习。假设我们要构建一个系统,能够将图像分类为包含例如房子、汽车、人或宠物。我们首先收集一个包含房子、汽车、人和宠物图像的大型数据集,每张图像都标注了其类别。在训练期间,机器会看到一张图像,并产生一个分数向量的输出,每个类别对应一个分数。我们希望目标类别的分数是所有类别中最高的,但这在训练前不太可能发生。我们计算一个目标函数,来衡量输出分数与期望分数模式之间的误差。然后机器修改其内部可调参数以减少此误差。这些可调参数通常称为权重,是实数,可以被视为定义机器输入-输出函数的“旋钮”。在一个典型的深度学习系统中,可能有数亿个这样的可调权重,以及数亿个用于训练机器的带标签示例。
To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector.
为了正确调整权重向量,学习算法会计算一个梯度向量,对于每个权重,它指示如果权重增加微小量,误差会增加或减少多少。然后权重向量会朝着与梯度向量相反的方向进行调整。
The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.
目标函数在所有训练样本上的平均值,可以看作是权重值高维空间中的一种丘陵地形。负梯度向量指明了在这个地形中最陡下降的方向,使其更接近平均输出误差较低的极小值点。
In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.
在实践中,大多数从业者使用一种称为随机梯度下降的方法。它包括向机器展示少量样本的输入向量,计算输出和误差,计算这些样本的平均梯度,并相应调整权重。这个过程在训练集中的许多小样本集上重复进行,直到目标函数的平均值停止下降。它被称为随机梯度下降,因为每一小批样本都提供了对所有样本平均梯度的有噪声估计。与复杂得多的优化技术相比,这种简单的过程通常能以惊人的速度找到一组不错的权重。训练结束后,系统的性能会在另一组称为测试集的样本上进行衡量。这用于测试机器的泛化能力——即对训练期间从未见过的新输入产生合理答案的能力。
Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.
当前机器学习的许多实际应用都使用基于人工设计特征的线性分类器。一个二类线性分类器计算特征向量分量的加权和。如果加权和高于一个阈值,输入就被归类为属于特定类别。
Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other 'shallow' classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.
自20世纪60年代以来,我们就知道线性分类器只能将输入空间分割成非常简单的区域,即由超平面分隔的半空间。但像图像和语音识别这样的问题,要求输入-输出函数对输入的不相关变化不敏感,例如物体位置、方向或光照的变化,或者语音音调或口音的变化,同时对特定的微小变化非常敏感。在像素级别,两只在不同姿势、不同环境中的萨摩耶犬的图像可能彼此差异很大,而一只萨摩耶犬和一只狼在同一位置、相似背景下的两张图像可能彼此非常相似。线性分类器,或任何其他在原始像素上操作的“浅层”分类器,不可能区分后两者,同时将前两者归为同一类别。这就是为什么浅层分类器需要一个好的特征提取器来解决选择性-不变性困境——该特征提取器产生的表征对区分图像重要的方面具有选择性,但对动物姿势等无关方面具有不变性。为了使分类器更强大,可以使用通用的非线性特征,如核方法,但像高斯核产生的通用特征不能让学习器在远离训练样本的地方很好地泛化。传统的方法是人手工设计好的特征提取器,这需要相当多的工程技能和领域专业知识。但如果好的特征可以通过通用学习程序自动学习得到,那么这一切都可以避免。这就是深度学习的关键优势。
A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.
深度学习架构是一个由简单模块组成的多层堆叠,所有(或大多数)模块都用于学习,其中许多模块计算非线性的输入-输出映射。堆叠中的每个模块都转换其输入,以增加表征的选择性和不变性。通过多个非线性层,比如5到20层的深度,一个系统可以实现极其复杂的输入函数,这些函数同时对细微细节敏感,而对背景、姿势、光照和周围物体等巨大无关变化不敏感。
Backpropagation to train multilayer architectures
From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.
反向传播训练多层架构
从模式识别的早期开始,研究人员的目标就是用可训练的多层网络取代手工设计的特征,但尽管其简单性,这一解决方案直到20世纪80年代中期才被广泛理解。事实证明,多层架构可以通过简单的随机梯度下降进行训练。只要模块是其输入和内部权重的相对平滑的函数,就可以使用反向传播过程计算梯度。这一想法可行且有效,是在20世纪70年代和80年代由几个不同的小组独立发现的。
The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.
用于计算目标函数关于多层模块堆叠权重的梯度的反向传播过程,不过是导数链式法则的一个实际应用。关键的见解是,目标关于模块输入的导数可以通过从关于该模块输出(或后续模块输入)的梯度反向计算得到。反向传播方程可以反复应用,以将梯度传播通过所有模块,从顶部的输出开始,一直到底部的外部输入馈入点。一旦计算出这些梯度,就可以直接计算出关于每个模块权重的梯度。
Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier
深度学习的许多应用都使用前馈神经网络架构,该架构学习将固定大小的输入映射到固定大小的输出。要从一层到下一层,一组单元计算它们从前一层输入的加权和,并将结果通过一个非线性函数传递。目前,最流行的非线性函数是修正线性单元,它本质上是半波整流器
In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.
20世纪90年代末,神经网络和反向传播在很大程度上被机器学习社区所抛弃,并被计算机视觉和语音识别社区所忽视。人们普遍认为,在几乎没有先验知识的情况下学习有用的多级特征提取器是不可行的。特别是,人们通常认为简单梯度下降会陷入糟糕的局部极小值——即任何微小变化都无法减少平均误差的权重配置。
In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder.
在实践中,对于大型网络来说,糟糕的局部极小值很少成为问题。无论初始条件如何,系统几乎总能达到质量非常相似的解。近期的理论和实证结果强烈表明,局部极小值通常不是一个严重问题。相反,优化地形中充满了组合数量的鞍点,在这些点上梯度为零,并且曲面在大多数维度上向上弯曲,在其余维度上向下弯曲。
The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.
分析似乎表明,只有少数下弯方向的鞍点大量存在,但几乎所有鞍点的目标函数值都非常相似。因此,算法卡在哪个鞍点上并不重要。
Interest in deep feedforward networks was revived around 2006 by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited.
大约在2006年,由加拿大高级研究院召集的一组研究人员重新点燃了对深度前馈网络的兴趣。研究人员引入了无监督学习过程,可以在不需要标记数据的情况下创建特征检测器层。学习每一层特征检测器的目标是能够重建或建模下一层特征检测器(或原始输入)的活动。通过使用这种重建目标对逐渐复杂的特征检测器的几个层进行“预训练”,深度网络的权重可以初始化为合理的值。然后可以在网络顶部添加一个输出单元层,并使用标准反向传播对整个深度系统进行微调。这在识别手写数字或检测行人方面效果显著,尤其是在标记数据量非常有限的情况下。
The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.
这种预训练方法的首次主要应用是在语音识别领域,而这得益于易于编程的快速图形处理器的出现,使研究人员能够以10倍或20倍的速度训练网络。2009年,该方法被用于将从声波中提取的短时窗口系数映射到一组概率,这些概率对应窗口中心帧可能表示的各种语音片段。它在一个使用小词汇量的标准语音识别基准上取得了破纪录的结果,并迅速发展为在大词汇量任务上也取得破纪录的结果。到2012年,2009年深度网络的版本已被许多主要语音研究小组开发,并已部署在Android手机中。对于较小的数据集,无监督预训练有助于防止过拟合,从而在标记样本数量较少时,或者在迁移学习场景中,当我们对某些“源”任务有很多样本但对某些“目标”任务样本极少时,显著提升泛化能力。一旦深度学习被重新接纳,事实证明预训练阶段仅对小型数据集是必要的。
There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet). It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer-vision community.
然而,有一种特殊类型的深度前馈网络比相邻层全连接的网络更容易训练且泛化能力更好,这就是卷积神经网络。它在神经网络不受青睐的时期取得了许多实际成功,并且最近已被计算机视觉社区广泛采用。
Convolutional neural networks
ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.
卷积神经网络
卷积神经网络旨在处理以多维数组形式出现的数据,例如由三个包含三色通道像素强度的二维数组组成的彩色图像。许多数据模态都以多维数组的形式存在:一维用于信号和序列,包括语言;二维用于图像或音频频谱图;三维用于视频或立体图像。ConvNet利用了自然信号特性的四个关键思想:局部连接、权重共享、池化和多层使用。
The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.
典型ConvNet的架构由一系列阶段构成。最初几个阶段由两种类型的层组成:卷积层和池化层。卷积层中的单元被组织在特征图中,其中每个单元通过一组称为滤波器组的权重连接到前一层特征图中的局部块。这个局部加权和的结果随后通过一个非线性函数,如ReLU。一个特征图中的所有单元共享相同的滤波器组。一层中不同的特征图使用不同的滤波器组。这种架构的原因有二。首先,在图像等数组数据中,值的局部组通常是高度相关的,形成易于检测的独特局部模式。其次,图像和其他信号的局部统计特性是位置不变的。换句话说,如果一个模式可以出现在图像的一个部分,那么它可以出现在任何地方,因此就有了不同位置的单元共享相同权重并在数组的不同部分检测相同模式的思想。从数学上讲,特征图执行的滤波操作是离散卷积,因此得名。
Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximum of a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.
虽然卷积层的作用是检测前一层特征的局部组合,但池化层的作用是将语义相似的特征合并为一个。由于构成模式的特征的相对位置可能有一定变化,通过对每个特征的位置进行粗粒化,可以可靠地检测该模式。一个典型的池化单元计算一个特征图(或几个特征图)中局部块的最大值。相邻的池化单元从移动了多于一行或一列的块中获取输入,从而降低了表征的维度,并创建了对微小位移和扭曲的不变性。两到三个阶段的卷积、非线性变换和池化被堆叠起来,之后是更多的卷积层和全连接层。通过ConvNet反向传播梯度就像通过常规深度网络一样简单,从而可以训练所有滤波器组中的所有权重。
Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.
深度神经网络利用了这样一个特性:许多自然信号是组合层次结构,其中高层特征由低层特征组合而成。在图像中,边缘的局部组合形成模式,模式组合成部件,部件形成物体。在语音和文本中,从声音到音素、音节、单词和句子,也存在类似的层次结构。池化使得当前一层元素在位置和外观上变化时,表征变化极小。
The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey's inferotemporal cortex. ConvNets have their roots in the neocognitron, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words.
ConvNet中的卷积层和池化层直接受到视觉神经科学中经典概念简单细胞和复杂细胞的启发,其整体架构让人联想到视觉皮层腹侧通路中的LGN-V1-V2-V4-IT层次结构。当向ConvNet模型和猴子展示相同的图片时,ConvNet中高层单元的激活解释了猴子下颞叶皮层中160个随机神经元集合一半的方差。ConvNet源于新认知机,其架构有些相似,但没有端到端的监督学习算法,如反向传播。一种称为时延神经网络的原始一维ConvNet曾被用于识别音素和简单单词。
There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition and document reading. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands, and for face recognition.
卷积网络的应用可以追溯到20世纪90年代初,最初用于语音识别和文档读取的时延神经网络。文档读取系统使用了一个与实现语言约束的概率模型联合训练的ConvNet。到20世纪90年代末,该系统读取了美国超过10%的支票。微软后来部署了许多基于ConvNet的光学字符识别和手写识别系统。在20世纪90年代初,ConvNet也被实验性地用于自然图像中的目标检测,包括人脸和手部,以及人脸识别。
Image understanding with deep convolutional networks
Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition. Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.
深度卷积网络下的图像理解
自2000年代初以来,ConvNet已被成功应用于图像中物体和区域的检测、分割和识别。这些任务都有相对丰富的标记数据,例如交通标志识别、生物图像分割(特别是用于连接组学),以及自然图像中的人脸、文本、行人和人体检测。ConvNet近期一个主要的实际成功是人脸识别。重要的是,图像可以在像素级别进行标记,这将在技术领域得到应用,包括自主移动机器人和自动驾驶汽车。Mobileye和NVIDIA等公司正在其即将推出的汽车视觉系统中使用这种基于ConvNet的方法。其他日益重要的应用涉及自然语言理解和语音识别。
Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).
尽管取得了这些成功,但在2012年ImageNet竞赛之前,ConvNet在很大程度上仍被主流计算机视觉和机器学习社区所忽视。当深度卷积网络应用于一个包含约一百万张来自网络的图像、涵盖1000个不同类别的数据集时,它们取得了惊人的结果,几乎将最佳竞争方法的错误率减半。这一成功得益于GPU、ReLU、一种称为Dropout的新正则化技术,以及通过对现有示例进行变形来生成更多训练样本的技术。这一成功引发了一场计算机视觉革命;现在,ConvNet已成为几乎所有识别和检测任务的主导方法,并在某些任务上接近人类水平。最近一个令人惊叹的演示将ConvNet和循环网络模块结合起来,用于生成图像描述。
Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.
最近的ConvNet架构拥有10到20层ReLU、数亿个权重以及数十亿个单元间的连接。就在两年前,训练如此庞大的网络可能需要数周时间,而硬件、软件和算法并行化的进步已将训练时间缩短至数小时。
The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.
基于ConvNet的视觉系统的性能已促使大多数主要科技公司,包括谷歌、Facebook、微软、IBM、雅虎、Twitter和Adobe,以及数量迅速增长的初创公司,启动研发项目并部署基于ConvNet的图像理解产品和服务。
ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.
ConvNet易于在芯片或现场可编程门阵列上实现高效的硬件。许多公司,如NVIDIA、Mobileye、英特尔、高通和三星,正在开发ConvNet芯片,以实现智能手机、相机、机器人和自动驾驶汽车中的实时视觉应用。
Distributed representations and language processing
Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example,
分布式表征与语言处理
深度学习理论表明,深度网络相较于不使用分布式表征的经典学习算法,具有两种不同的指数级优势。这两种优势都源于组合的力量,并依赖于底层数据生成分布具有适当的分量结构。首先,学习分布式表征使得能够泛化到训练中未见过的已学习特征值的新组合。其次,在深度网络中组合表征层带来了另一个指数级优势的潜力。
The hidden layers of a multilayer neural network learn to represent the network's inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple 'micro-rules'. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.
多层神经网络的隐藏层学习以一种易于预测目标输出的方式来表示网络的输入。通过训练一个多层神经网络来根据前面单词的局部上下文预测序列中的下一个单词,可以很好地证明这一点。上下文中的每个单词都以一个独热向量的形式呈现给网络,即一个分量为1,其余为0。在第一层,每个单词会产生不同的激活模式,即词向量。在语言模型中,网络的其他层学习将输入词向量转换为预测下一个单词的输出词向量,该向量可用于预测词汇表中任何单词作为下一个单词出现的概率。网络学习到的词向量包含许多活跃的分量,每个分量都可以解释为单词的一个独立特征,这最初是在学习符号的分布式表征的背景下证明的。这些语义特征并非明确存在于输入中。它们是学习过程发现的一种将输入和输出符号之间的结构化关系分解为多个"微规则"的好方法。事实证明,当词序列来自真实文本的大语料库且单个微规则不可靠时,学习词向量也非常有效。例如,当训练用于预测新闻报道中的下一个单词时,学习到的星期二和星期三的词向量非常相似,瑞典和挪威的词向量也是如此。这种表征被称为分布式表征,因为它们的元素不是互斥的,并且它们的许多配置对应于观测数据中看到的变化。这些词向量由学习到的特征组成,这些特征不是由专家预先确定的,而是由神经网络自动发现的。从文本中学习到的单词向量表示现在被广泛应用于自然语言应用中。
The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast 'intuitive' inference that underpins effortless commonsense reasoning.
表征问题处于受逻辑启发的认知范式和受神经网络启发的认知范式之间争论的核心。在受逻辑启发的范式中,一个符号实例的唯一属性就是它与其他符号实例要么相同要么不同。它没有与使用相关的内部结构;要用符号进行推理,必须将它们绑定到明智选择的推理规则中的变量上。相比之下,神经网络仅使用大的活动向量、大的权重矩阵和标量非线性来执行那种支撑毫不费力的常识推理的快速"直觉"推理。
Before the introduction of neural language models, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to
在引入神经语言模型之前,标准的语言统计建模方法并未利用分布式表征:它基于对长度不超过
Recurrent neural networks
When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a 'state vector' that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.
循环神经网络
当反向传播首次被引入时,其最激动人心的用途是用于训练循环神经网络。对于涉及序列输入的任务,如语音和语言,使用RNN通常更好。RNN一次处理一个输入序列的元素,在其隐藏单元中维护一个"状态向量",该向量隐含地包含了序列所有过去元素的历史信息。当我们把不同离散时间步的隐藏单元输出看作是深度多层网络中不同神经元的输出时,如何应用反向传播来训练RNN就变得清晰了。
RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.
RNN是非常强大的动态系统,但训练它们已被证明是有问题的,因为反向传播的梯度在每个时间步要么增长要么缩小,因此在许多时间步上它们通常会爆炸或消失。
Thanks to advances in their architecture and ways of training them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English 'encoder' network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French 'decoder' network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.
得益于其架构和训练方式的进步,RNN已被发现在预测文本中的下一个字符或序列中的下一个词方面非常擅长,但它们也可用于更复杂的任务。例如,在逐词阅读一个英语句子后,可以训练一个英语"编码器"网络,使其隐藏单元的最终状态向量成为该句子所表达思想的良好表征。然后,这个思想向量可以用作联合训练的法语"解码器"网络的初始隐藏状态,该解码器输出法语翻译第一个单词的概率分布。如果从这个分布中选择一个特定的第一个单词,并将其作为输入提供给解码器网络,它将输出翻译第二个单词的概率分布,依此类推,直到选择句号。总的来说,这个过程根据依赖于英语句子的概率分布生成法语单词序列。这种相当朴素的机器翻译方式已迅速与最先进技术竞争,这引起了严重怀疑:理解一个句子是否需要使用推理规则操作的内部符号表达之类的东西。它更符合这样一种观点:日常推理涉及许多同时进行的类比,每个类比都为结论贡献了一定的合理性。
Instead of translating the meaning of a French sentence into an English sentence, one can learn to 'translate' the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).
除了将法语句子翻译成英语,还可以学习将图像的意义"翻译"成英语句子。这里的编码器是一个深度ConvNet,它将像素转换为其最后一个隐藏层中的活动向量。解码器是一个类似于用于机器翻译和神经语言建模的RNN。最近,人们对这类系统产生了浓厚的兴趣。
RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.
RNN一旦按时间展开,可以看作是所有层共享相同权重的极深前馈网络。尽管它们的主要目的是学习长期依赖,但理论和实证证据表明,很难学习将信息存储很长时间。
To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.
为了解决这个问题,一个想法是用显式记忆来增强网络。这类提议的第一个是长短时记忆网络,它使用特殊的隐藏单元,其自然行为是长时间记住输入。一个称为记忆单元的特殊单元像一个累加器或门控泄漏神经元:它在下一个时间步有一个权重为1的自连接,因此它复制自己的实值状态并累积外部信号,但这种自连接由另一个学习决定何时清除记忆内容的单元进行乘法门控。
LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.
LSTM网络随后被证明比传统RNN更有效,特别是当它们在每个时间步有多层时,使得能够构建从声学到转录中字符序列的完整语音识别系统。LSTM网络或相关形式的门控单元目前也用于在机器翻译中表现优异的编码器和解码器网络。
Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a 'tape-like' memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.
在过去一年中,几位作者提出了不同的建议来增强RNN的记忆模块。提议包括神经图灵机,其中网络通过一个"磁带式"记忆进行增强,RNN可以选择从中读取或写入;以及记忆网络,其中常规网络通过一种联想记忆进行增强。记忆网络在标准问答基准上取得了优异的性能。记忆用于记住故事,之后网络被要求回答关于该故事的问题。
Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught 'algorithms'. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as "where is Frodo now?".
除了简单的记忆之外,神经图灵机和记忆网络还被用于通常需要推理和符号操作的任务。神经图灵机可以被教导"算法"。其中,当输入由未排序的序列组成,每个符号附带一个指示其在列表中优先级的实值时,它们可以学习输出一个排序后的符号列表。记忆网络可以被训练来跟踪类似于文本冒险游戏环境中的世界状态,并且在阅读一个故事后,它们可以回答需要复杂推理的问题。在一个测试示例中,网络被展示了《指环王》的15句版本,并正确回答了诸如"弗罗多现在在哪里?"这样的问题。
The future of deep learning
Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.
深度学习的未来
无监督学习在复兴人们对深度学习的兴趣方面起到了催化作用,但此后却被纯监督学习的成功所掩盖。尽管我们在本综述中并未重点关注它,但我们预期从长远来看,无监督学习将变得更为重要。人类和动物的学习在很大程度上是无监督的:我们通过观察来发现世界的结构,而不是通过被告知每个物体的名称。
Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to-end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems at classification tasks and produce impressive results in learning to play many different video games.
人类视觉是一个主动的过程,它利用一个高分辨率的小型中央凹及其周围的大范围低分辨率区域,以智能的、特定于任务的方式顺序采样光阵列。我们预期视觉领域未来的许多进步将来自端到端训练的系统,这些系统将卷积神经网络与使用强化学习来决定注视点的循环神经网络相结合。结合深度学习和强化学习的系统尚处于起步阶段,但它们在分类任务上已经超越了被动视觉系统,并在学习玩多种不同视频游戏方面取得了令人印象深刻的结果。
Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time.
自然语言理解是另一个深度学习有望在未来几年产生重大影响的领域。我们预期,当使用RNN的系统学会了一次选择性关注一部分内容的策略时,它们在理解句子或整个文档方面将变得更加出色。
Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.
最终,人工智能的重大进步将来自那些将表征学习与复杂推理结合起来的系统。尽管深度学习和简单推理在语音和手写识别中已应用了很长时间,但仍需要新的范式,用对大向量的操作来取代基于规则的符号表达式操作。