Deep Learning Convolutional Neural Network Learning Summary 2

Introduction

After about two weeks of study, I have a preliminary understanding of deep learning. The most recent task is to intensively read the literature in the direction of deep learning. I didn’t try again because I failed to build the caffe platform and it was time-consuming, so I didn’t do it in practice. This article only introduces the knowledge learned from reading the literature, mainly including the process significance of deep learning development and four common network models (convolutional neural network, deep belief network, recurrent neural network, stacked autoencoder network), The focus is on the implementation process of the convolutional neural network.

development process

The first generation of neural network (1958~1969)
The idea of ​​the earliest neural network originated from the MCP artificial neuron model in 1943. At that time, it was hoped that computers could be used to simulate the process of human neuron responses. The model simplified neurons into Three processes: linear weighting of input signals, summation, non-linear activation (thresholding). As shown in the figure below,
insert image description here
the first time MCP was used in machine learning (classification) was the perceptron algorithm invented by Rosenblatt in 1958. The algorithm uses the MCP model to classify the input multidimensional data, and can use the gradient descent method to automatically learn and update the weights from the training samples. In 1962, the method was proved to be able to converge, and the theoretical and practical results gave rise to the first wave of neural networks.
However, the history of discipline development has not always been smooth sailing. In 1969, American mathematician and artificial intelligence pioneer Minsky proved in his book that the perceptron is essentially a linear model that can only handle linear classification problems, and even the simplest XOR (or) problem cannot be correctly classified. This is tantamount to directly pronouncing the death penalty of the perceptron, and the research on neural networks has also stagnated for nearly 20 years.
The second-generation neural network (1986~1998)
was the first to break the curse of nonlinearity. It was the modern DL Daniel Hinton, who invented the BP algorithm suitable for multi-layer perceptron (MLP) in 1986, and used Sigmoid for non-linear Linear mapping, which effectively solves the problem of nonlinear classification and learning. This method gave rise to a second wave of neural networks.
In 1989, Robert Hecht-Nielsen proved the universal approximation theorem of MLP, that is, for a continuous function f in any closed interval, the BP network with a hidden layer can be used to approximate the theorem. Network researchers.
Also in 1989, LeCun invented the convolutional neural network-LeNet and used it for digital recognition, and achieved good results, but it did not attract enough attention at the time.
It is worth emphasizing that after 1989, since no particularly prominent method was proposed, and NN has always lacked corresponding strict mathematical theoretical support, the upsurge of neural networks has gradually cooled down. The freezing point came from 1991. The BP algorithm was pointed out that there was a gradient disappearance problem, that is, in the process of the backward transmission of the error gradient, the gradient of the back layer is superimposed on the front layer in a multiplicative manner. Due to the saturation characteristic of the Sigmoid function, the gradient of the back layer is inherently Small, the error gradient is almost 0 when it is passed to the front layer, so the front layer cannot be effectively learned. This finding makes the development of NN worse at this time.
In 1997, the LSTM model was invented. Although the model has outstanding characteristics in sequence modeling, it has not attracted enough attention because it is in the downhill period of NN.
The third-generation neural network-DL (2006-present)
stage is divided into two periods: rapid development period (2006 2012) and explosive period (2012- present)
rapid development period (2006~2012)
2006, the first year of DL. In the same year, Hinton proposed a solution to the gradient disappearance problem in deep network training: unsupervised pre-training to initialize weights + supervised training fine-tuning. The main idea is to first learn the structure of the training data (autoencoder) through the self-learning method, and then perform supervised training and fine-tuning on this structure. However, due to the absence of particularly effective experimental verification, the paper did not attract attention.
In 2011, the ReLU activation function was proposed, which can effectively suppress the gradient disappearance problem.
In 2011, Microsoft applied DL to speech recognition for the first time and made a major breakthrough.
Explosive period (2012~present) In
2012, in order to prove the potential of deep learning, Hinton's research group participated in the ImageNet image recognition competition for the first time. It won the championship in one fell swoop through the constructed CNN network AlexNet, and crushed the second place (SVM method) classification performance. It is also because of this competition that CNN has attracted the attention of many researchers.

convolutional neural network

Convolutional neural network (CNN) consists of input layer, convolutional layer, activation function, pooling layer, and fully connected layer, namely INPUT-CONV-RELU-POOL-FC (1) Convolutional layer: use it for
feature extraction , as follows:
insert image description here
the input image is 32 32 3, 3 is its depth (ie R, G, B), the convolutional layer is a 5 5 3 filter (receptive field), here note: the depth of the receptive field must be the same as the input image same depth. A 28 28 1 feature map can be obtained by convolving a filter with the input image . We usually use multiple convolutional layers to get deeper feature maps.
The input image and the corresponding position elements of the filter are multiplied and summed, and finally b is added to obtain the feature map. As shown in the figure, the depth of the first layer of filter w0 is multiplied by the corresponding elements in the blue box of the input image and then summed to get 0, and the other two depths get 2, 0, then there are 0+2+0+1 =3 is the first element 3 of the feature map on the right side of the figure. After convolution, the blue box of the input image slides again, stride=2.
The process of convolution is illustrated as follows:
insert image description here
insert image description here
(2) Pooling layer: Compress the input feature map, on the one hand, make the feature map smaller, simplify the network calculation complexity; on the one hand, perform feature compression, extract the main features, and pool the operation There are generally two types, one is Avy Pooling, the other is max Pooling, as follows:
insert image description here
Similarly, a 2 2 filter is used, and max pooling is to find the maximum value in each area, where stride=2, and finally in the original feature map Extract the main features to get the above picture. (Avy pooling is not used very much now, the method is to sum each 2 + 2 area elements, and then divide by 4 to get the main feature), while the general filter takes 2 + 2, and the maximum is 33, the stride is 2, and the compression is 1/4 of the original.
Fully connected layer: connect all the features and send the output value to the classifier (such as softmax classifier).
The overall structure is roughly as follows:

insert image description here
Weight sharing: One of the highlights of CNN is that it reduces the number of parameters that the neural network needs to train through receptive field and weight sharing.
As shown in the left picture below: If we have an image of 1000x1000 pixels and 1 million neurons in the hidden layer, if it is fully connected (that is, each neuron in the hidden layer must be connected to every pixel of the image), There are 1000x1000x1000000=10^12 connections, that is, 10^12 weight parameters. But the spatial connection of the image is local, just like people perceive the external image through a local receptive field, each neuron does not need to perceive the global image, because each neuron only perceives the local image area, Then at a higher level, these neurons that perceive different parts can be combined to obtain global information. In this way, we can reduce the number of connections, that is, reduce the number of weight parameters that the neural network needs to train.
insert image description here
As shown in the right figure above: if the local receptive field is 10x10, each receptive field in the hidden layer only needs to be connected to the 10x10 local image, so there are only 100 million connections for 1 million neurons in the hidden layer, that is, 10 ^8 parameters. We know that each neuron in the hidden layer is connected to an image area of ​​10x10 elements, that is to say, each neuron has 10x10=100 connection weight parameters. If the 100 parameters of each of our neurons are the same, that is to say, each neuron uses the same convolution kernel to deconvolute the image. This way we only have 100 parameters.
A filter, that is, a convolution kernel, is to extract a feature of an image. Then we need to extract different features, and we need to add several kinds of filters. Suppose we add it to 50. The parameters of each filter are different, which means that it proposes different features of the input image, such as different edges. In this way, each filter deconvolves the image to get a projection of different features of the image, which we call Feature Map. So 50 kinds of convolution kernels have corresponding 50 Feature Maps. These 50 Feature Maps form a layer of neurons. 50 kinds of convolution kernels x each type of convolution kernel share 100 parameters = 50 100, that is, 5000 parameters, which can not only extract various features but also reduce calculations. See above right: Different colors represent different filters.
Backpropagator: Let us take
the partial derivative [3] of e=(a+b) (b+1) as an example, and its composite relationship can be expressed as follows:
insert image description here
In the figure, an intermediate variable c is introduced, d.
In order to find the gradient of e when a=2, b=1, we can first use the definition of partial derivatives to find the partial derivative relationship between adjacent nodes in different layers, as shown in the figure below.
insert image description here
Using the chain rule we know:
insert image description here
As you may have noticed, this is very redundant, because many paths are visited repeatedly. For example, in the figure above, ace and bce both take the path ce. For the neural network in the deep model with tens of thousands of weights, the amount of calculation caused by such redundancy is quite large. Also using the chain rule, the BP algorithm cleverly avoids this redundancy. It only visits each path once to find the partial derivative value of the vertex to all the lower nodes.
As the name of the backpropagation (BP) algorithm says, the BP algorithm is reverse (top-down) to find the path. Starting from the node e at the top layer, the initial value is 1, and the processing is performed in units of layers. For all child nodes one layer below e, multiply 1 by the partial derivative value of e on the path to some node, and "stack" the result in that child node. After the layer where e is located is propagated in this way, each node in the second layer "stacks" some values, and then we sum up all the "stacked" values ​​in it for each node, and we get the pair of vertex e The partial derivative of this node. Then, each of these second-layer nodes is used as the starting vertex, and the initial value is set as the partial derivative value of the vertex e to them, and the above propagation process is repeated in units of "layers", and the vertex e to each layer of nodes can be obtained. Partial derivative.
insert image description here
insert image description here
It can be seen that the reverse differential algorithm retains the influence of all variables including intermediate variables on the result e. If e is an error function, a calculation is performed on the graph, and the influence of all nodes on e can be obtained, that is, the gradient value. In the next step, these gradient values ​​can be used to update the weight of the edge; and the forward differential algorithm obtains The result is that only the influence of one input variable on the error e is retained. Obviously, if we want to obtain the influence of multiple variables on e, we need to perform multiple calculations, so the forward differential algorithm is obviously not as efficient as the reverse differential algorithm. This is why we choose the reverse differentiation algorithm.

Stacked Auto-Encoder stacked self-encoder

The stacked autoencoder is the most basic deep learning model. The sub-network structure autoencoder of the model trains and adjusts the network parameters by assuming that the output is the same as the input, and obtains the weights in each layer. Several different representations of the input signal can be obtained by stacking multi-layer autoencoder networks (each layer represents a representation), and these representations are features. An autoencoder is a neural network that reproduces the input signal as closely as possible. In order to achieve this reproduction, the autoencoder must capture the most important factors that can represent the input data, just like PCA, find the main components that can represent the original information. Network structure The network structure of the stacked autoencoder is essentially a common multi-layer neural network structure.
insert image description here
The training process The stacked autoencoder differs from the ordinary neural network in its training process. The network structure training is mainly divided into two steps: unsupervised pre-training and supervised fine-tuning training.
(1) The unsupervised pre-training autoencoder obtains the compression and distributed representation of the original data through self-learning, and is generally used for high-level feature extraction and data nonlinear dimensionality reduction. The structure is similar to a typical three-layer BP neural network, which consists of an input layer, an intermediate hidden layer and an output layer. However, the number of neurons in the output layer and the input layer is equal, and the label value of the training sample set is the input value, that is, no label value. The mapping between the input layer and the hidden layer is called encoding (Encoder), and the mapping between the hidden layer and the output layer is called decoding (Decoder). The middle layer of the unsupervised pre-training autoencoder is the feature layer. After the first layer of feature layer is trained, the training method of the second layer is the same as that of the first layer. We regard the feature layer output by the first layer as the input layer of the second layer, and also minimize the reconstruction error, we will get the parameters of the second layer, and get the feature layer of the second layer input, that is, the first layer of the original input information Two representations. By analogy, other feature layers can be trained.
(2) Supervised fine-tuning training After the above training method, a multi-layer stacked autoencoder can be obtained, and each layer will get a different expression of the original input. At this point, the stacked autoencoder cannot be used to classify data because it has not learned how to associate an input with a class. It just learns a feature that can represent the input well, and this feature can represent the original input signal to the greatest extent. Then, in order to achieve classification, we can add a classifier (such as logistic regression, SVM, etc.) to the top coding layer of AutoEncoder, and then fine-tune it through the standard multi-layer neural network supervised training method (gradient descent method) train.

Deep belief network deep belief network

Network Structure Deep Belief Network (DBN) is stacked by several layers of Restricted Boltzmann Machines (RBM), and the hidden layer of the previous RBM serves as the visible layer of the next RBM. The following introduces RBM first, and then introduces DBN.
(1) RBM
insert image description here
An ordinary RBM network structure is shown in the figure above. It is a two-layer model consisting of m visible layer units and n hidden layer units. Among them, the neurons in the layer are not connected, and the neurons in the layer are fully connected. , that is to say: when the state of the visible layer is given, the activation state of the hidden layer is conditionally independent, conversely, when the state of the hidden layer is given, the activation state of the visible layer is conditionally independent. This ensures the conditional independence between neurons in the layer and reduces the complexity of probability distribution calculation and training. RBM can be regarded as an undirected graph model. The connection weight between visible layer neurons and hidden layer neurons is bidirectional, that is, the connection weight from visible layer to hidden layer is W, and the connection weight from hidden layer to visible layer is for W'. In addition to the parameters mentioned above, the parameters of RBM also include visible layer bias b and hidden layer bias c. The distribution defined by the RBM visible layer and hidden layer units can be changed according to actual needs, including: Binary unit, Gaussian unit, Rectified Linear unit, etc. The main difference between these different units is that their activation functions are different.
(2) DBN
insert image description here
The DBN model is stacked by several layers of RBM. If there is label data in the training set, the visible layer of the last layer of RBM contains both the hidden layer unit of the previous layer RBM and the label layer unit. Assuming that the visible layer of the top RBM has 500 neurons, and the classification of training data is divided into 10 categories, then the visible layer of the top RBM has 510 dominant neurons. For each training data, the corresponding label neuron is turned on and set to 1, while others are turned off and set to 0.
The training process of DBN includes two steps of Pre-training and Fine tuning. The Pre-training process is equivalent to training each RBM layer by layer. The DBN after pre-training can be used to simulate training data. In order to further improve the discriminative performance of the network , the Fine-tuning process uses the label data to fine-tune the network parameters through the BP algorithm.
(1) Pre-training As mentioned above, the pre-training process of DBN is equivalent to training each RBM layer by layer, so the RBM training algorithm is directly used for Pre-training.
(2) Fine-tuning builds a neural network with the same number of layers as DBN, assigns the network parameters obtained in the Pre-training process to this neural network as the initial value of its parameters, and then adds a label layer after the last layer, combining Label the training data, use the BP algorithm to fine-tune the entire network parameters, and complete the Fine tuning process.

Recurrent neural network Recurrent neural network

In the field of deep learning, the above-mentioned network structures based on the traditional multi-layer perceptron have excellent performance and have achieved many successes. It has created records in many different tasks, including handwritten digit recognition and object classification. However, they also have certain problems, none of the above models can analyze the overall logical sequence among the input information. These information sequences are rich in a large amount of content, and the information has complex time correlation with each other, and the length of the information is various. This cannot be solved by the above model. The recurrent neural network was born to solve this sequence problem. The key point is that the hidden state of the current network will retain the previous input information and use it as the output of the current network. Many tasks need to process sequence data. For example, Image captioning, speech synthesis, and music generation all require models to generate sequence data. Other fields such as time series prediction, video analysis, and musical information retrieval require the input of the model to be sequence data. Other tasks such as Machine translation, man-machine dialogue, and controlling a robot models require both input and output to be sequence data.
Network structure
The left side of the figure below is the original structure of the recurrent neural network. If you discard the daunting closed loop in the middle, it is actually a simple three-layer structure of "input layer => hidden layer => output layer", but there are many in the figure. A very unfamiliar closed loop, that is to say, after inputting to the hidden layer, the hidden layer will also input to itself, so that the network can have memory ability. We say that the recurrent neural network has memory ability, and this ability is to summarize the previous input state through W, and serve as an auxiliary for the next input. The hidden state can be understood in this way: h=f (existing input + past memory summary)
insert image description here
in the training process of the recurrent neural network, because the previous signal is superimposed on the input, the reverse conduction is different from the traditional neural network, because for time t The input layer of , the residual is not only from the output, but also from the subsequent hidden layer. Through the backward transfer algorithm, the error of the output layer is used to solve the gradient of each weight, and then the gradient descent method is used to update each weight.
A typical improved recurrent neural network model can be used to process sequence data. The recurrent neural network contains a large number of parameters and is difficult to train (gradient dissipation or gradient explosion in the time dimension), so there are a series of optimizations for RNN, such as network structure, solution algorithm and Parallelization. This year, bidirectional RNN (BRNN) and LSTM have made breakthroughs in image captioning, language translation, and handwriting recognition.

Summarize

(1) There is no understanding of the CMRM cross-media correlation model on image annotation.
(2) Although I understand most of the network models, it is only at the theoretical stage. The next step is to build a caffe platform, and deepen the understanding of the specific implementation and training process of each model through practice.
(3) I only read a part of the review on each small direction of deep learning, and now I am looking at Li Feifei’s image annotation literature (Deep Visual-Semantic Alignments for Generating Image Descriptions), but there are many problems that cannot be understood. The next step is to re-read it .

references

[1]Yoshua Bengio (2009), “Learning Deep Architectures for AI”, Foundations and Trends® in Machine Learning: Vol. 2: No. 1, pp 1-127.
[2] G. E. Hinton, R. R. Salakhutdinov. Reducing the Dimensionality of Data with Neural Networks .Science 28 Jul 2006:Vol. 313, Issue 5786, pp. 504-507.
[3]LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., Gradient-Based Learning Applied to Document Recognition, Proceedings of the IEEE, vol.86, no.11, November 1998, pp. 2278-2324.
[4]Soniya, S. Paul and L. Singh, “A review on advances in deep learning,” 2015 IEEE Workshop on Computational Intelligence: Theories, Applications and Future Directions (WCI), Kanpur, 2015, pp. 1-6.
[5]X. Du, Y. Cai, S. Wang and L. Zhang, “Overview of deep learning,” 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), Wuhan, 2016, pp. 159-164
[6] Guo Lili, Ding Shifei. Research Progress in Deep Learning [J]. Computer Science, 2015, 42(5): 28-33. [7] Liu Jianwei, Liu Yuan
, Luo Xionglin, etc. Research Progress in Deep Learning [J]. Computer Applied Research, 2014,31(7):1921-1930,1942. DOI:10.3969/j.issn.1001-3695.2014.07.001.

Guess you like

Origin blog.csdn.net/q15516221118/article/details/130471893