Paper reading: Deep learning of the three giants of artificial intelligence

The paper reading column of "Swordsman Algorithm Jianghu" focuses on the Chinese-English translations of classic and latest papers in the field of deep learning, covering professional fields such as computer vision, natural language, speech recognition and reinforcement learning, helping beginners understand algorithm theory, and for Lay the foundation for future algorithm engineers or scientific research work. " Swordsman Algorithm Rivers and Lakes " official account backstage reply " papers ", you can download all leading papers. The purpose of this article is for academic exchanges. If there is any infringement, please let me know and delete it.

The paper "Deep Learning" recommended today is a review article published on Nature by the three giants of artificial intelligence Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. The purpose is to commemorate the 60th anniversary of artificial intelligence. The article introduces the basic principles and core advantages of deep learning, introduces CNN, distributed feature representation, RNN and its different applications in detail, and looks forward to the future development of deep learning technology.

Table of contents

topic

author

original

Summary

text

(1 Introduction

(2) Supervised learning

(3) Using backpropagation to train the multi-layer structure

(4) Convolutional neural network

(5) Image understanding using deep convolutional networks

(7) Recurrent neural network

(8) Prospects for Deep Learning


topic

Deep LearningDeep learning

 

author

Geoffrey Hinton: Vice President and Engineering Fellow at Google (University Professor Emeritus at the University of Toronto). Outstanding Contributions: Published a backpropagation paper in 1986; invented the Boltzmann machine in 1983; improved the convolutional neural network in 2012 and achieved good results in ImageNet.

Yann LeCun: Vice President and Chief Artificial Intelligence Scientist at Facebook. Outstanding contribution: In 1980, the convolutional neural network was invented; in the late 1980s, the convolutional neural network was first used in handwritten digit recognition;

Yoshua Bengio: Professor at the University of Montreal (one of the authors of "Deep Learning" flower book). Outstanding contribution: In 1990, the neural network and the probability model were combined; in 2000, the paper "A Neural probabilistic Language Model" was published, using high-dimensional word vectors to represent natural language.

The three authors received the 2018 Turing Award for their contributions to deep learning.

 

original

Deep learning | Nature

Summary

Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech. 

Deep learning is the learning of data representations with multiple layers of abstraction by computational models composed of multiple processing layers. These methods have greatly advanced the state of the art in speech recognition, visual object recognition, object detection, and other fields such as drug discovery and genomics. Deep learning discovers complex structures in large data sets by using the backpropagation algorithm, indicating how the machine should change its internal parameters used to compute the representation of each layer, starting from the representation of the previous layer. Deep convolutional networks have achieved breakthroughs in processing images, video, speech, and audio, while recurrent networks have achieved breakthroughs in processing sequential data such as text and speech.

text

(1 Introduction

Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social networks to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning. 

Machine learning technology powers many aspects of modern society: from web search to content filtering on social networks to recommendations on e-commerce sites, and it's increasingly found in consumer products like cameras and smartphones. Machine learning systems are used to recognize objects in images, convert speech to text, match news items, posts or products that match users' interests, and select relevant search results. Increasingly, these applications take advantage of a technique called deep learning.

Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, constructing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a feature extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input. 

Traditional machine learning techniques are limited in their ability to process natural data in raw forms. For decades, constructing a pattern recognition or machine learning system has required careful engineering and considerable expertise to design a functional device that transforms raw data (such as the pixel values ​​of an image) into a suitable internal representation or feature vector learner In systems, classifiers typically detect patterns in the input or classify them.

Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. 

Representation learning is a set of methods that allow machines to input raw data and automatically discover the representations needed for detection or classification. Deep learning methods are representation learning methods with multiple levels of representation, obtained by composing simple but non-linear modules, each of which transforms the representation of one level (starting from the original input) into a higher, slightly abstract level of express. As long as enough such transformations are synthesized, very complex functions can be learned. In classification tasks, higher-level representations amplify aspects of the input that are important for discrimination and suppression of irrelevant variation. For example, an image comes in the form of an array of pixel values, and the features learned in the first-layer representation typically represent the presence or absence of edges in a particular orientation and location in the image. The second layer typically detects patterns by locating special arrangements of edges, regardless of small changes in edge position. A third layer might combine patterns into larger combinations, corresponding to parts of familiar objects, and subsequent layers will detect objects as combinations of these parts. The key to deep learning is that these feature layers are not designed by human engineers: they are learned from data through a general learning process.

Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years. It has turned out to be very good at discovering intricate structures in high-dimensional data and is therefore applicable to many domains of science, business and government. In addition to beating records in image recognition 1–4 and speech recognition 5–7, it has beaten other machine-learning techniques at predicting the activity of potential drug molecules 8, analysing particle accelerator data 9,10, reconstructing brain circuits 11, and predicting the effects of mutations in non-coding DNA on gene expression and disease 12,13. Perhaps more surprisingly, deep learning has produced extremely promising results for various tasks in natural language understanding 14, particularly topic classification, sentiment analysis, question answering 15 and language translation 16,17.

Deep learning has made significant progress in solving problems that the best attempts in artificial intelligence over the years have been unable to solve. It is very good at discovering complex structures in high-dimensional data, so it is applicable to many fields of science, business and government. In addition to breaking records in image recognition and speech recognition, it has surpassed other machine learning in predicting the activity of potential drug molecules, analyzing particle accelerator data, reconstructing brain circuits, and predicting the impact of non-coding DNA mutations on gene expression and disease technology. Perhaps even more surprisingly, deep learning has produced very promising results in a variety of tasks in natural language understanding, notably topic classification, sentiment analysis, question answering, and language translation.

We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available computation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only accelerate this progress. 

We believe that deep learning will achieve more success in the near future because it requires little manual engineering, so it can easily take advantage of the growing amount of computation and data available. New learning algorithms and architectures currently being developed for deep neural networks will only accelerate this progress.

(2) Supervised learning

The most common form of machine learning, deep or not, is supervised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or distance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output function of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine. 

The most common form of machine learning, whether deep or not, is supervised learning. Imagine we build a system that can classify the things contained in the input picture, including people, cars, houses, pets, etc. We first need to build a huge dataset of images with corresponding class labels. During training, the machine is given a picture and outputs a vector of its scores for all categories. We would like to have the desired class correspond to the highest score, but this is unlikely before we train. We utilize an objective function that measures the error (or distance) to compute the difference between the actual output score and the desired sample score. The machine reduces the error by modifying its internal adjustable parameters. And these adjustable parameters, often called weights, are the visible "knobs" that really determine the functional relationship between input and output. In a typical deep learning system, there are millions of these adjustable weights, and millions of labeled examples for training.

To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direction to the gradient vector. 

To adjust the weight vector appropriately, the learning algorithm computes a gradient vector for each weight, representing the amount by which the error would increase or decrease if the weight was increased by a small amount. Then, the weight vector needs to be adjusted in the opposite direction of the gradient vector.

The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average. 

The objective function, which averages over all samples, can be seen as a hilly terrain in a high-dimensional space of weights. The negative gradient vector direction indicates the fastest descending direction in the terrain, and the closer the terrain is to its minimum value, the smallest error on average is obtained.

In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization techniques 18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.  

In practice, most developers use stochastic gradient descent (SGD). It involves taking a set of training sample vectors as input, computing the output and error, and the average gradient of these samples, and adjusting the weights accordingly. Continuously obtain a part of training samples from the training set to repeat this process until the mean value of the objective function no longer decreases. It is called stochastic because each small sample set produces noise for the estimate of the average gradient of the entire sample. Compared with some fine optimization methods, this simple process often finds a good set of Weights. After training, the system's performance is tested on a test set, which tests the system's ability to generalize—that is, its ability to correctly predict new samples it has not seen during training.

Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category. 

Many current machine learning applications use linear classifiers on the basis of artificially extracted special evidence. A binary linear classifier calculates the weighted sum of the elements of the feature vector. If it is higher than a threshold, the input will be classified into a specific category.

Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces separated by a hyperplane 19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods 20, but generic features such as those arising with the Gaussian kernel do not allow the learner to generalize well far from the training examples 21. The conventional option is to hand design good feature extractors, which requires a considerable amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

We have known since the 1990s that a linear classifier divides the input space into simple regions by just a hyperplane. However, image or speech recognition problems require that the input-output function be insensitive to unrelated inputs such as position, illumination and rotation of the target, or changes in pitch and timbre in speech, but to some specific Small changes are very sensitive, for example we want to be able to distinguish a white wolf from a Samoyed that looks like a wolf. At the pixel level, pictures of two Samoyeds in different poses in different environments may be very different, while a Samoyed and a wolf in the same position in the same background will look alike. Linear classifiers, or other shallow classifiers that operate directly on pixels, even if they can classify the first two into the same class, will probably not be able to distinguish the latter two. This is why shallow classifiers need a good feature extractor for the selection invariance problem - one that can extract those key features that distinguish objects in an image, but these features are powerless to distinguish the pose of the animal. In order to make the classifier more powerful, general non-linear features, such as kernel methods, can be used, but these general features, such as those produced by Gaussian kernels, cannot make the learning object have a good generalization effect for all training samples. The traditional choice is to manually design a good feature extractor, which requires a lot of engineering skills and professional experience. By using a general objective learning process, good features can be learned automatically, thus avoiding the above problems. This is the key advantage of deep learning.

 

Figure 1: Multilayer Neural Networks and Backpropagation

Figure 1: a A multi-layer neural network (represented by connected nodes) can make the data set (the samples represented by the red and blue lines) more linearly separable by distorting the input space. Note how the regular grid of the input space (left) is transformed by the hidden layer (middle). This is an example with only two input nodes, two hidden nodes, and one output node, but networks used for object recognition or natural language processing often have hundreds or thousands of these nodes.

Figure 1: The chain rule of b tells us how two small influence quantities (the effect of a small change in x on y, and the effect of y on z) are related. A small change in x, Δx, is first converted to a change in y, Δy, by multiplying by ∂y/∂x (which is the definition of a partial derivative). Similarly, Δy will bring about a change Δz in z. Relate one equation to another via the chain rule - Δx is obtained by multiplying ∂y/∂x and ∂z/∂x. When x, y, z are vectors, it can also be handled (the derivative is a Jacobian matrix).

Figure 1: c Using these equations when computing the value of the forward pass in a neural network with two hidden layers and an output layer, each containing a module that can backpropagate gradients. In each layer, we first calculate the total input z of each node, which is the weighted sum of the output of the previous layer. Then apply the nonlinear function to z to get the output of this node. For simplicity, we omit the bias term. The nonlinear functions used in neural networks include the modified linear unit (ReLU) f(z)=max(z,0) widely used in recent years, and the more widely used S function, such as the hyperbolic tangent function f(z )=(exp(z)-exp(-z))/(exp(z)+exp(-z)) and Logistic function f(z)=1/(1+exp(-z)).

Figure 1: d Equation for computing backpropagation values. At each hidden layer, we compute the partial derivative of the error with respect to the output of each node, which is a weighted sum of the partial derivatives of the error with respect to the input of the previous layer. We convert the partial derivative of the error with respect to the output into a derivative with respect to the input by multiplying by the gradient of f(z). In the output layer, the partial derivative of the error for each node output is obtained by deriving the cost function. If the cost function for node l is 0.5(yl-tl)2, then the result is yl-tl, and tl is the target value. Once ∂E/∂zk is known, the derivative of the error E with respect to the weight wlk on the connection of node j is yl(∂E/∂zk).

 

Figure 2: The interior of a convolutional network

Figure 2: Outputs (not filters) of each layer of a typical convolutional network applied to an image of a Samoyed dog. Each rectangular image is a feature map corresponding to the output of the features learned by the detection of each location. Information flows bottom-up, with low-level features acting as oriented boundary detectors, and using rectified linear units to compute class scores for each output image.

A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects. 

A deep learning framework is a stack of layers of simple modules, all (or most) of which aim to learn, and many compute non-linear input-output mappings. Each module is transforming its input in order to increase both selectivity and expression invariance. Systems with 5 to 20 nonlinear layers can form complex functions that are sensitive to a few details - able to distinguish Samoyeds from white wolves, and insensitive to large irrelevant variables such as background, pose, Lighting and surrounding objects.

(3) Using backpropagation to train the multi-layer structure

From the earliest days of pattern recognition 22,23, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s 24–27.

In early pattern recognition, the goal of research was to use trainable networks instead of manually designed feature extraction. Despite its simplicity, its solution was not widely understood until the mid-1980s. It points out that multilayer network structures can be trained using stochastic gradient descent. As long as each module is a relatively smooth function of the input and internal weights, gradients can be computed by the backpropagation method. The feasibility and validity of this approach was independently discovered by several different groups in the 1970s and 1980s.

The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module. 

The backpropagation process used to compute the objective function for the weights in the multi-layer module is just a practical application of the chain rule. The key is that the derivative of the target for the input of a module can be obtained by using the derivative of the output of this module (or the input of the next module) (as shown in Figure 1). The equation of backpropagation can be used repeatedly to propagate gradients in all models from the output layer (where the network forms its predictions) to the input layer (where the external data is input). Once these gradients are computed, the gradients for each module weight can be computed directly.

Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a probability for each of several categories). To go from one layer to the 

next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z)= max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1+exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

Many deep learning application examples use a feed-forward neural network structure (Figure 1), which maps a fixed-size input (such as an image) to a fixed-size output (such as the possibility of dividing into several categories value). In layer-to-layer transfer, some units compute a weighted sum of inputs from the previous layer and pass their output through a nonlinear function. The most popular nonlinear function at the moment is the rectified linear unit (ReLU), which is a simple half-wave rectified function f(z)=max(z,0). In the past few decades, neural networks have used functions with smoother nonlinearities, such as tanh(z) and 1/(1+exp(z)), but ReLU can still learn relatively quickly in multi-layer networks, It allows training a deep supervised network without unsupervised pre-training. Units that do not belong to the input and output layers are called hidden units. Hidden layers can be viewed as distorting the input space in a non-linear fashion, so that categories become linearly separable by the output layer (see Figure 1).

In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with little prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.  

In the late 1990s, researchers in the field of neural network and BP algorithm machine learning were abandoned, and were also ignored by researchers in the fields of computer vision and speech recognition. It is widely believed that learning practical, multi-stage feature extractors that require little prior knowledge is infeasible. In particular, it is widely believed that simple gradient descent is likely to get stuck in local minima—weight configurations where small changes do not bring the average error down any further.

In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinatorially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the remainder 29,30. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objective function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

In practice, local minima are rarely a problem in large networks. Irrespective of the initial conditions, the system always yields similar results. Both recent theoretical and empirical results strongly suggest that local minima are generally not a serious problem. On the contrary, there are a large number of saddle points with zero gradient in the solution space, and the surface is curved upward in most dimensions, and only a few remaining surface directions are downward. Analysis indicates that a large number of saddle points occur with very few downward curling directions, but almost all of them have similar values ​​to the objective function. So even if the algorithm gets stuck in these saddle points it's not much of a problem.

Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchersintroduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation 33–35. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited 36.

Around 2006, the Canadian Institute for Advanced Research (CIFAR) assembled a team of researchers who rekindled research interest in deep feedforward networks. The researchers introduce unsupervised learning methods for creating multi-layer feature detectors without the need for labeled data. During the learning process, the goal of each layer's feature detector is to be able to reconstruct or simulate the activity of the feature detector (or the original data) in the next layer. The weights of the network can be initialized to appropriate values ​​by pre-training several layers of more complex feature extractors using the reconstruction objective. After the output layer is added to the top of the network, the entire network can be adjusted accordingly through the standard BP algorithm. This approach has shown promising results in handwriting recognition and pedestrian detection, especially when labeled data is very limited.

The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program 37 and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coefficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabulary 38 and was quickly developed to give record-breaking results on a large vocabulary task 39. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups 6 and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting 40, leading to significantly better generalization when the number of labelled examples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

A major application of this pre-training effort is speech recognition, where the advent of fast graphics processing units (GPUs) has made programming easy and allowed researchers to train networks 10 to 20 times faster than before. In 2009 this attempt was applied to map short-term window coefficients extracted from a sound wave to the probabilities of a series of speech fragments that can be replaced by the center of the window frame. It broke the standard benchmark speech recognition record using a smaller vocabulary, and quickly broke the recognition record using a larger vocabulary. As of 2012, the deep web developed in 2009 has been developed by many mainstream speech teams and has been applied to Android phones. For relatively small data sets, unsupervised pre-training can prevent overfitting well. It generalizes better when there is less labeled data or when there is a lot of source data and little target data. Once deep learning research has resumed, this pre-training is only needed when there is less data.

There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet) 41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computervision community.

However, a particular kind of deep feedforward network is easier to train and generalizes better than a network in which adjacent layers are fully connected. It is a Convolutional Neural Network (ConvNet). Neural networks have enjoyed many practical successes during their time of obscurity, and have recently gained widespread acceptance among researchers in the field of computer vision.

(4) Convolutional neural network

ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

Convolutional networks are designed to handle multi-column data, such as an image containing a three-channel 2D queue of color pixel intensities. Many data modalities are in the form of multi-dimensional arrays: 1-dimensional ones are signal sequences including speech, 2-dimensional ones are images or spectrograms, and 3-dimensional ones are video or stereoscopic images. There are four key points when the convolutional network utilizes the characteristics of natural signals: local connection, weight sharing, pooling and multi-layer structure.

The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolutional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Different feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

A typical convolutional network structure (as shown in Figure 2) consists of a series of stages. The initial stage consists of convolutional and pooling layers. Many nodes of the convolutional layer are constructed into a feature map, and each node is connected to the local block in the feature map of the previous layer through a series of weights called filters. The locally weighted sum of this node is passed through a non-linear function such as ReLU. All nodes in a feature map share the same filter. The reason for designing this structure is twofold. First, in queue-type data such as images, local values ​​are highly correlated, forming local feature maps that are easy to detect. Second, the local statistics of the image are positionally independent of other signals. In other words, a partial image can appear anywhere else, so the way to build units that share the same weights at different locations and detect the same pattern in different parts of the queue, mathematically called discrete convolution, is to use feature Mapping method for filtering operations.

Although the role of the convolutional layer is to detect local conjunctions of features from the previous layer, the role of the pooling layer is to merge semantically similar features into one. Because the relative positions of the features forming a motif can vary somewhat, reliably detecting the motif can be done by coarse-graining the position of each feature. A typical pooling unit computes the maximumof a local patch of units in one feature map (or in a few feature maps). Neighbouring pooling units take input from patches that are shifted by more than one row or column, thereby reducing the dimension of the representation and creating an invariance to small shifts and distortions. Two or three stages of convolution, non-linearity and pooling are stacked, followed by more convolutional and fully-connected layers. Backpropagating gradients through a ConvNet is as simple as through a regular deep network, allowing all the weights in all the filter banks to be trained.

The function of the convolutional layer is to detect local connections of the features of the previous layer, while the role of the pooling layer is to merge semantically similar features. This is because the relative positions of the features forming an object will vary, and features with coarsely-grained positions can also form reliable object detection. A typical pooling unit computes the maximum value of units in a local block in one (or a few) feature maps. Adjacent pooling units switch local blocks to obtain data in one row or one column or more order, thereby reducing the dimensionality of expression and creating invariance to small movements or distortions. Two to three convolutional layers, plus a stack of nonlinearity and pooling layers, and then connected to a fully connected layer constitute a convolutional network. The same simple BP algorithm as in the ordinary deep network can train the weights in all the filters of the convolutional network.

Deep neural networks exploit the property that many natural signals are compositional hierarchies, in which higher-level features are obtained by composing lower-level ones. In images, local combinations of edges form motifs, motifs assemble into parts, and parts form objects. Similar hierarchies exist in speech and text from sounds to phones, phonemes, syllables, words and sentences. The pooling allows representations to vary very little when elements in the previous layer vary in position and appearance.

Deep neural networks are exploring the properties of the natural signal hierarchy, where high-level features are obtained by combining low-level features. In an image, local combinations of edges form patterns, which aggregate into many parts that eventually compose objects. A similar hierarchical structure also exists in these speech and texts from the sounds on the phone, phonemes, syllables, and words and sentences. When the position or performance of the elements of the previous layer changes, the pooling operation can ensure that the expression is almost unchanged.

The convolutional and pooling layers in ConvNets are directly inspired by the classic notions of simple cells and complex cells in visual neuroscience 43, and the overall architecture is reminiscent of the LGN–V1–V2–V4–IT hierarchy in the visual cortex ventral pathway 44. When ConvNet models and monkeys are shown the same picture, the activations of high-level units in the ConvNet explains half of the variance of random sets of 160 neurons in the monkey’s inferotemporal cortex 45. ConvNets have their roots in the neocognitron 46, the architecture of which was somewhat similar, but did not have an end-to-end supervised-learning algorithm such as backpropagation. A primitive 1D ConvNet called a time-delay neural net was used for the recognition of phonemes and simple words 47,48.

The convolutional layer and pooling layer in the convolutional network are inspired by the classic concepts of simple cells and complex cells in visual neuroscience. The neural circuits of the visual cortex are integrated with LGN–V1–V2–V4–IT Structured. When the ConvNet and the monkey were shown the same picture, the activations of the higher-level units of the ConvNet explained the activity of half of the neurons in a randomly assembled group of 160 neurons in the monkey's inferior temporal cortex . The root of the convolutional network can be attributed to the neurocognitive machine. They have a similar structure, but there is no end-to-end supervised learning algorithm like the BP algorithm in the neurocognitive machine. A simple 1D convolutional network called a time-delayed neural network has been used for phoneme and simple word recognition.

There have been numerous applications of convolutional networks going back to the early 1990s, starting with time-delay neural networks for speech recognition 47 and document reading 42. The document reading system used a ConvNet trained jointly with a probabilistic model that implemented language constraints. By the late 1990s this system was reading over 10% of all the cheques in the United States. A number of ConvNet-based optical character recognition and handwriting recognition systems were later deployed by Microsoft 49. ConvNets were also experimented with in the early 1990s for object detection in natural images, including faces and hands 50,51, and for face recognition 52.

Convolutional networks have had numerous applications since the early 1990s, starting with time-delayed neural networks for speech recognition and text reading. The text reading system uses a combination of a trained convolutional network and a language-constrained probabilistic model. In the late 1990s, the system read about 10 percent of all checks in the United States. Later, Microsoft developed a large number of visual feature recognition and handwriting recognition systems based on convolutional networks. In the early 1990s, convolutional networks were also experimented with object detection in natural images, including face recognition and handwriting recognition.

(5) Image understanding using deep convolutional networks

Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition 53, the segmentation of biological images 54 particularly for connectomics 55, and the detection of faces, text, pedestrians and human bodies in natural images 36,50,51,56–58. A major recent practical success of ConvNets is face recognition 59.

Since 2000, convolutional networks have been applied to object and region detection, segmentation, and recognition with great success. All of these tasks have relatively rich labeled datasets, such as traffic sign recognition, segmentation of biological images, especially neural groups, and detection of faces, text, pedestrians, and human bodies in natural images. A recent major success of Convolutional Neural Networks is face recognition.

Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars 60,61. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding 14 and speech recognition 7.

Importantly, images can be labeled at the pixel level, which can be used in many technologies, including autonomous mobile robots, self-driving cars, etc. Companies such as Mobileye and Nvidia are using convolutional neural network-based approaches in their upcoming automotive vision systems. Other noteworthy applications involve natural language understanding and speech recognition.

Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best competing approaches 1. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout 62, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks 4,58,59,63–65 and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

Despite these successes, convolutional neural networks were still abandoned by mainstream computer vision and machine learning teams until the 2012 ImageNet competition. When deep convolutional networks were applied to a dataset of approximately 1 million images in 1,000 different categories, they achieved astonishing success, with error rates as low as half of the original best results. Such success comes from the effective utilization of GPU, the use of ReLU function, and the application of a new regularization technique called dropout, as well as the technique of distorting existing images to obtain more training data. This success has brought about a revolution in the field of computer vision; convolutional neural networks are the most dominant method in almost all detection and recognition projects, and their performance is almost close to that of humans in some aspects. A good recent demonstration is the combination of convolutional neural network and recurrent network, which can generate text captions based on images (Figure 3).

 

Figure 3: From image to text

Captions generated by a Recurrent Neural Network (RNN), with additional input from test images using Deep Convolutional Neural Networks (CNNs) to "translate" them into captions using high-level image representations trained by the RNN ( at the top of the figure). Figure reproduced from reference 102. When the RNN has the ability to focus on different locations in the input image as it produces each word (in bold) (lines two and three in the figure; the brighter parts are given more attention), we find that it makes the The technology to "translate" pictures into words has expanded enormously.

Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours. 

Recent convolutional neural network architectures have 10 to 20 ReLU layers, millions of weights, and billions of connections between nodes. Whereas 2 years ago it would have taken weeks to train such a network, with improvements in hardware, software and algorithms, it now takes only hours.

The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook,Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services. 

The performance of vision systems based on convolutional neural networks has attracted the attention of many technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, and more and more startups have begun research and are committed to providing Image understanding products and services based on convolutional neural networks.

ConvNets are easily amenable to efficient hardware implementations in chips or field-programmable gate arrays66,67. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision 

applications in smartphones, cameras, robots and self-driving cars.

Convolutional neural networks are easily implemented on chips or programmable gate arrays. Companies such as NVIDIA, Mobileye, Intel, Qualcomm, and Samsung are developing convolutional neural network chips to enable real-time vision applications in smartphones, cameras, robots, and self-driving cars.

(6) Distributed expression and language processing

Deep-learning theory shows that deep nets have two different exponential advantages over classic learning algorithms that do not use distributed representations 21. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure 40. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features) 68,69. Second, composing layers of representation in a deep net brings the potential for another exponential advantage 70 (exponential in the depth).

Deep learning theory shows that deep networks outperform classical learning algorithms that do not use distributed representations by two distinct indices. Both advantages stem from its composition, and depend on the underlying data-generating distribution having an appropriate compositional structure. First, learning distributed representations can generalize beyond the features seen at training time to new combinations of learned features (e.g., n binary features are a possibility). Second, there is another potential exponential advantage (exponential depth) from the constituent layers expressed in deep networks.

 

Figure 4: Visualization of learned word vectors

On the left is a diagram of word representations learned for language modeling, visualized by mapping them non-linearly to 2 dimensions using the t-SNE algorithm. The image on the right is a 2D representation of words learned by an English-French encoder-decoder recurrent neural network. It can be observed that semantically similar words are expressed in similar regions on the graph. The distribution of word expressions is obtained by the joint learning of each word expression by the BP algorithm and a function that can predict the target amount, such as the next word in a sentence (for the language model) or the translation of the entire sentence (for the machine translation).

The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a localcontext of earlier words 71. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated 27 in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable 71. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications 14,17,72–76.

The hidden layers of a multilayer neural network learn to re-represent the features of the input in order to make it easier to predict the output. This is demonstrated by training a multi-layer neural network to predict the next word based on the previous part of the partial text. Each word in the content represents a 1-in-N vector in the network, that is, one element has one 1 and the others are all 0s. In the first layer, each word generates a different activation pattern, or word vector (see Figure 4). In a language model, in order to predict the next word, other parts of the network learn to convert the input word vector into an output word vector, which can be used to predict the probability of any word in the dictionary as the next occurrence. The network learns word vectors that contain many active components, each of which can be viewed as a discrete feature of the word, just as distributions are achieved in text when learning distributed representations of symbols. These semantic features are not explicitly represented in the input. During the learning process, it is found that the relationship between input and output symbols can be decomposed into multiple "micro-rules". The approach of learning word vectors also works well when the words come from a large semantic corpus of real text and independent micro-rules are not reliable. When using the trained network to predict the next word in the news, the learned word vectors are very similar, such as Tuesday and Wednesday, Sweden and Norway. Such representations are called distributed representations because their elements (features) are not independent of each other, and many of their configurations are consistent with observed changes in the data. These word vectors consist of learned features that are not determined in advance by experts, but are automatically discovered by the neural network. The method of learning word vector representations from text has been widely used in natural language processing.

The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to thevariables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

Feature representation is at the heart of the debate on cognitive issues between the logic-inspired paradigm and the neural network-inspired paradigm. In the logic heuristic paradigm, instances of one symbol are unique properties that are either identical or distinct from others. It has no internal structure related to its use; and, to understand its semantics, it must be related to variables for inferring selection rules. On the contrary, neural networks use huge activation vectors, huge weight matrices, and step nonlinearity to achieve a fast and intuitive inference that supports simple common sense.

Before the introduction of neural language models 71, the standard approach to statistical modelling of language did not exploit distributed representations: it was based on counting frequencies of occurrences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).  

Before the introduction of neural language models, standard statistical language modeling methods did not extend distributed representations: they were based on frequency statistics of short symbol sequences of up to length N (N-grams). The number of possible N-grams approaches V to the Nth power, where V is the size of the vocabulary, so considering a text may require a very large training corpus. N-grams treats each word as an atomic unit, so it cannot generalize semantically related word sequences, which is possible with neural language models because it organizes each word with a vector of real-valued features, And semantically related words are adjacent in the vector space (Figure 4).

(7) Recurrent neural network

When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

When the BP algorithm was first published, its most exciting application was training recurrent neural networks (RNNs). For continuous input problems such as speech and language, the effect of recurrent neural networks is usually better. Recurrent neural networks process the input sequence one element at a time and implicitly store the history of previous elements in the sequence in the state vectors of hidden nodes. When we consider the output of hidden nodes at different discrete time nodes, as if they were the output of different neurons in a deep multilayer network, we can clearly see how the BP algorithm is used to train recurrent neural networks.

RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish 77,78.

Recurrent neural networks are very powerful dynamic systems, but their training is problematic because the gradient of backpropagation increases or decreases in each time interval, so it may spike or return to zero after a while.

Thanks to advances in their architecture 79,80 and ways of training them 81,82, RNNs have been found to be very good at predicting the next character in the text83 or the next word in a sequence 75, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen 17,72,76. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion 84,85.  

Thanks to their advanced structure and training methods, recurrent neural networks have been found to be very effective in predicting the next character in text or the next word in a sequence, but can also be used to complete some more complex tasks. For example, after reading an English sentence word by word, the final state vector of the hidden layer of a trained English "encoder" network is the correct representation of the meaning of the sentence. This set vector can be used as the initial state vector for the hidden layer of the co-trained French "decoder" network (or as an additional input to the network), whose output is the probability distribution of the first word translated in French. If a particular first word is selected from the distribution and given as input to the encoder network, it will output the probability distribution of the second word, and so on until the end. In general, the process is to generate sequences of French words according to the probability distribution of English sentences. This very simple approach to machine translation quickly became a rival to state-of-the-art methods, raising serious doubts about whether understanding a sentence requires the use of inference rules to organize internal symbolic representations. This is juxtaposed with the idea that everyday reasoning also involves analogies that provide plausibility for forming conclusions.

Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep ConvNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).

Unlike translating a French sentence into English, we can also learn to "translate" the meaning of an image into an English sentence. The encoder here is a deep convolutional network that converts pixels into activity vectors in its final hidden layer. The decoder is a recurrent neural network similar to those used in machine translation and neural language models. Recently, there has been a great deal of interest in such systems (see ref. 86 for examples).

RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long 78.

A recurrent neural network is unrolled in the time domain and can be viewed as a deep feed-forward network with all layers sharing the same weights. Although their main goal is to learn long-term dependencies, both theory and practical experience show that learning to store information for long periods of time is difficult.

To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time 79. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory.

In order to solve this problem, the storage capacity of the network can be increased. The original proposal for this method is a long short-term memory network (LSTM) with special hidden units whose natural behavior is to store input for a long time. A special kind of unit called a memory cell, similar to an accumulator or a gated neuron: at the next time node, it is connected to itself with a weight, so it copies its real-valued state and accumulates the external signal, but This self-connection is controlled with a multiplicative gate by another unit that learns to decide when to clear memory.

LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step 87, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation 17,72,76.

Long short-term memory networks later proved to be more effective than traditional recurrent neural networks, especially when there are several layers per time node, enabling the entire speech recognition system to complete the transcription from sound to character sequence. Long short-term memory networks or other gating units are also used in encoder-decoder networks, which also perform well in machine translation.

 

Figure 5: A recurrent neural network and its unrolling in the time domain including the forward computation process

Figure 5: On the previous time beat, the artificial neuron (for example, at time t, the hidden node st with a value under node s) obtains input from other neurons (represented by the black square on the left in the figure, indicating the time of a time beat Delay). In this way, the recurrent neural network can map input sequence elements xt to output sequence ot elements, each determined by the previous xtʹ (for tʹ≤t) . The same parameters (matrices U, V, W) are used at any one time. Many architectures can do this, including many networks that generate a sequence of outputs (such as words) that are used as input to the next layer. The backpropagation algorithm (Fig. 1) can be applied directly to the unrolled network on the right to compute the derivative of the total error with respect to all states and parameters (e.g. the log probability of producing the correct output).

Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to 88, and memory networks, in which a regular network is augmented by a kind of associative memory 89. Memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions. 

In the past, several scholars have proposed different methods for increasing the memory module of recurrent neural networks. These include the Neural Turing Machine, a "tape"-like memory model that recurrent neural networks can optionally read or write, and Memory Networks, which augment regular networks with associative memory. Memory networks perform well on standard question answering benchmarks. Memory is used to remember instances of questions that the network was asked to answer later.

Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list 88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?” 89.

In addition to simple memory, neural Turing machines and memory networks have also been used to perform reasoning or symbol manipulation tasks. Neural Turing machines can also be taught "algorithms". Among other things, when their input consists of unsorted sequences, they can learn to output a sorted list of symbols, and each symbol output will have a real value indicating the preferred value. A memory network can be trained to trace a world similar to a text adventure game, and after reading a story, it can answer questions that require complex reasoning to arrive at a result. In one test example, the network was given a 15-sentence version of The Lord of the Rings, and it correctly answered questions like "Where's Frodo?"

(8) Prospects for Deep Learning

Unsupervised learning 91–98 had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object.

Unsupervised learning was instrumental in reviving interest in deep learning, but the success of purely supervised learning has lost its luster. Although we have not paid much attention to unsupervised learning in this survey, we expect it to become more important in the long run. Human and animal learning is unsupervised: we discover the structure of the world by observing, rather than being told the name of each object.

Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-toend and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems 99 at classification tasks and produce impressive results in learning to play many different video games 100.

Human vision is an active process of capturing images in an intelligent, specific way using a tiny, high-resolution fovea surrounded by relatively large, low-resolution sensory organs. We expect to be able to train end-to-end systems in the future and combine convolutional neural networks and recurrent neural networks using reinforcement learning to achieve systems that can decide where to look. The system combining deep learning and reinforcement learning is still in its infancy, but it has outperformed previous vision systems at classification tasks and learning to play many different video games.

Natural language understanding is another area in which deep learning is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time 76,86. 

Natural language understanding is another promising area of ​​development for deep learning in the coming years. We want systems using recurrent neural networks to learn strategies to selectively go in one part at a time, making them better at understanding sentences or entire documents.

Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors 101.

Finally, systems that combine representation learning with complex reasoning will make significant advances in artificial intelligence. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, replacing rule-based symbolic representations with methods that manipulate large vectors requires new paradigms.

" Swordsman Algorithm Rivers and Lakes " official account backstage reply " papers ", you can download all leading papers.

The purpose of this article is for academic exchanges. If there is any infringement, please let me know and delete it.

Guess you like

Origin blog.csdn.net/sh_0001/article/details/126153055