Introduction to common deep learning algorithms

Original link: http://www.notescloud.top/cloudSearch/detail?id=2356

Introduction to common deep learning algorithms

Recommended books

Deep learning algorithm practice.pdf:
http://www.notescloud.top/cloudSearch/detail?id=2355

Many people have a misunderstanding that deep learning is more advanced than machine learning. In fact, deep learning is a branch of machine learning. It can be understood as a model with a multilayer structure. Specifically, deep learning is a neural network algorithm with a deep structure in machine learning, namely machine learning>neural network algorithm>deep neural network (deep learning).
The theoretical derivation of deep learning is too large and complicated. Some common deep learning algorithms are also vague. After reading them many times, they will forget after the interval. Now organize their systems (from history, fatal problems) Starting, look at the specific algorithm's ideas, framework, advantages and disadvantages and improvement directions, and summarize the comparison between CNN and RNN).

1. History: Multilayer Perceptron to Neural Network and Deep Learning

Neural network technology originated in the 1950s and 1960s, when it was called a perceptron, which had an input layer, an output layer and a hidden layer. The input feature vector reaches the output layer through the hidden layer transformation, and the classification result is obtained in the output layer. (An irrelevant thing: due to the backwardness of computing technology, the transfer function of the perceptron was realized mechanically by pulling a rheostat to change the resistance by a wire. Let me make up for the appearance of scientists pulling dense wires...)
Psychologist Rosenblatt proposed The single-layer perceptron has a problem that is too serious to be serious, that is, it can't do anything with slightly more complicated functions (such as the most typical "exclusive OR" operation).
This shortcoming was not solved by the multi-layer perceptron invented by Rumelhart, Williams, Hinton, LeCun, etc. until the 1980s. Multi-layer perceptron solved the previous defect that the XOR logic could not be simulated, and at the same time, more layers also allowed The network can better portray complex situations in the real world.
Multilayer perceptrons can get rid of the shackles of early discrete transfer functions, use continuous functions such as sigmoid or tanh to simulate the response of neurons to excitation, and use the backpropagation BP algorithm invented by Werbos in the training algorithm. This is what we are now talking about [neural network], the BP algorithm is also called BP neural network. The specific process can be found in my reprinted article ( http://blog.csdn.net/abc200941410128/article/details/78708319 ).
But the BP neural network (multilayer perceptron) is facing a fatal problem (see the next section). As the number of neural network layers deepens, there are two major problems: First, the optimization function is more and more likely to fall into the local optimal solution, and this "trap" is getting more and more away from the true global optimal. The performance of a deep network trained with limited data is not as good as a shallower network. At the same time, another problem that cannot be ignored is that the phenomenon of "gradient disappearance" is more serious.
In 2006, Hinton used the pre-training method to alleviate the local optimal solution problem, and pushed the hidden layer to 7 layers. The neural network has "deepness" in the true sense, thus unveiling the craze of deep learning. Then DBN, CNN, RNN, LSTM, etc. gradually appeared.
There is no fixed definition of "depth" here-a 4-layer network can be considered "deeper" in speech recognition, and a network with more than 20 layers is not uncommon in image recognition.
In order to overcome the disappearance of gradients, transfer functions such as ReLU and maxout replaced sigmoid, forming the basic form of today's DNN. In terms of structure alone, there is no difference in a fully-linked multilayer perceptron.

Second, the fatal problem of deep neural networks

As the number of neural network layers deepens, there are three major problems: one is the non-convex optimization problem, that is, the optimization function is more and more likely to fall into the local optimal solution; the second is the gradient vanish (Gradient Vanish) problem; the third is the overfitting problem.

2.1 Non-convex optimization problem

Linear regression is essentially a multivariate linear function optimization problem. Let f(x,y)=x+y
multi-layer neural network. The essence is a multivariate K-
degree function optimization problem. Let f(x,y)=xy in linear In the regression, starting from any point of search, it must eventually fall to near the global minimum. So it doesn't hurt to set it to 0 (this is why we often solve linear regression equations with an initial value of 0).
In a multilayer neural network, starting from different points, it may end up stuck in a local minimum. The local minimum is a lingering shadow brought by the neural network structure. As the number of hidden layers increases, the non-convex objective function becomes more and more complex, and the local minimum points multiply, using the deep layer trained with limited data. The performance of the network is not as good as the shallower network. . The method to avoid is generally weight initialization. In order to unify the initialization scheme, the input is usually scaled to [−1,1], but there is still no guarantee that the global optimum can be achieved. In fact, this is also an unsolved problem that scientists have been studying.
Therefore, in essence, the non-convex optimization brought by the deep structure still cannot be solved (including the current various deep learning algorithms and other non-convex optimization problems), which limits the development of the deep structure.

2.2 (Gradient Vanish) the problem of gradient disappearance

This problem is actually caused by improper activation function. The use of Sigmoid function in multiple layers will cause the error to decay exponentially from the output layer. In mathematics, the function of the activation function is to map the input data from 0 to 1 (tanh maps from -1 to +1). As for the reason for mapping, in addition to regularizing the data, it is probably to control the data so that it is only within a certain range. Of course, there are other details. For example, in Sigmoid (tanh), when it is activated, it can pay more attention to the small changes of the data before and after zero (or the center point), while ignoring the changes of the data at the extreme. For example, ReLU can also avoid the disappearance of the gradient. The role of. Generally, Sigmoid (tanh) is mostly used in fully connected layers, and ReLU is mostly used in convolutional layers.
Sigmoid
Sigmoid
resume
ReLU
"gradient disappearance" phenomenon specifically, we often use sigmoid as the input and output function of neurons. For a signal with an amplitude of 1, when the BP propagates the gradient back, the gradient attenuates to 0.25 for each layer passed. If the number of layers is large, the lower layer basically cannot receive effective training signals after the gradient exponentially attenuates.
Fortunately, this problem has been mitigated by the layer-by-layer greedy pre-training weight matrix change proposed by Hinton in 2006, and ReLu recently proposed a fundamental solution.
In 2012, Alex Krizhevsky of the Hinton group was the first to use the newly proposed ReLu function on a large scale in CNN, which is less affected by Gradient Vanish.
In 2014, Google researcher Jia Yangqing used ReLu as an artifact to successfully expand CNN to a 22-layer giant deep network.
For RNN that is plagued by Gradient Vanish, its variant LSTM also overcomes this problem.

2.3 Overfitting problem

This is the last fatal problem of neural networks: over-fitting, huge structure and parameters make, although the training error drops very low, the test error is ridiculously high.
Over-fitting can also be mixed with Gradient Vanish and local minimum. The specific gameplay is this:
Due to Gradient Vanish, the lower layer of the depth structure is almost impossible to train, while the higher layer is very easy to train.
Since the lower layer cannot be trained, it is easy to push the original input information without any nonlinear transformation or wrong transformation to the upper layer, which makes the high-layer dissociation feature pressure too much.
If the features cannot be dissociated, mandatory error supervision training will make the model directly fit the input data.
The result is A Good Optimation But a Poor Generalization, which is also a fault with shallow structures such as SVM and decision trees.
Bengio pointed out that these shallow structures that use local data for optimization are based on prior knowledge (Prior): Smoothness
, that is, given a sample (xi, yi), optimize numerically as much as possible, so that the trained model is for the approximate x, output approximate y.
However, once the input value has undergone a generic migration, such as two different birds, the colors of the birds are different, and the proportions in the image are different, then SVM and decision trees are almost useless.
Because it is meaningless for high-dimensional data (such as images, sounds, texts) to simply do numerical learning on the input data instead of dissociating features.
Then it was the last thing. As the lower-level students didn’t move and the higher-level students were learning randomly, they quickly fell into the basin of attraction and completed the neural network triple kill.

Three, the basic model in deep learning

The basic models in deep learning are roughly divided into three categories: multilayer perceptron models; deep neural network models and recurrent neural network models. Its representatives are DBN (Deep belief network) deep belief network, CNN (Convolution Neural Networks) convolutional neural network, RNN (Recurrent neural network) recurrent neural network.

3.1 DBN (Deep belief network) deep belief network

In 2006, Geoffrey Hinton proposed Deep Belief Network (DBN) and its efficient learning algorithm, Pre-training+Fine tuning, and published it on "Science", which became the main framework for subsequent deep learning algorithms. DBN is a generative model. By training the weights between its neurons, we can make the entire neural network generate training data according to the maximum probability. Therefore, we can not only use DBN to identify features and classify data, but also use it to generate data.

3.1.1 Network structure

The Deep Belief Network (DBN) is formed by stacking several layers of Restricted Boltzmann Machines (RBM), and the hidden layer of the upper RBM serves as the visible layer of the next RBM.
(1) RBM
Write picture description here
A common RBM network structure is shown in the figure above. It is a two-layer model composed of m visible layer units and n hidden layer units. Among them, the neurons in the layer are not connected, and the neurons in the layers are fully connected. That is to say: when the state of the visible layer is given, the activation state of the hidden layer is independent. On the contrary, when the state of the hidden layer is given, the activation state of the visible layer is independent. This ensures the conditional independence between neurons in the layer and reduces the complexity of probability distribution calculation and training. RBM can be regarded as an undirected graph model. The connection weights between visible layer neurons and hidden layer neurons are bidirectional, that is, the connection weight from the visible layer to the hidden layer is W, then the connection weight from the hidden layer to the visible layer Is W'. In addition to the parameters mentioned above, the parameters of RBM also include visible layer bias b and hidden layer bias c. The distribution defined by RBM visible layer and hidden layer unit can be replaced according to actual needs, including: Binary unit, Gaussian unit, Rectified Linear unit, etc. The main difference between these different units is that their activation functions are different
(2) The DBN
Write picture description here
DBN model consists of several layers RBM is stacked. If there is label data in the training set, then the visible layer of the last layer of RBM contains both the hidden layer unit of the previous layer of RBM and the label layer unit. Assuming that the visible layer of the top RBM has 500 neurons, and the training data is classified into 10 categories, the visible layer of the top RBM has 510 dominant neurons. For each training data, the corresponding label neuron is opened. 1 and others are turned off and set to 0

3.1.2 Training process and advantages and disadvantages

DBN training includes two steps: Pre-training and Fine tuning. The pre-training process is equivalent to training each RBM layer by layer. DBN after Pre-training can be used to simulate training data. In order to further improve the discriminative performance of the network, Fine The tuning process uses the label data to fine-tune the network parameters through the BP algorithm.
The summary of the advantages and disadvantages of DBN mainly focuses on the summary of the advantages and disadvantages of the generative model and the discriminant model.
1. Advantages:

  • The generative model learns the joint probability density distribution, so it can represent the distribution of data from a statistical point of view, and can reflect the similarity of similar data itself;
  • The generative model can restore the conditional probability distribution, which is equivalent to the discriminant model, and the discriminant model cannot get the joint distribution, so it cannot be used as a generative model.

2. Disadvantages:

  • The generative model does not care where the optimal classification surface between different categories is, so when used in classification problems, the classification accuracy may not be as high as the discrimination model;
  • Since the generative model learns the joint distribution of data, the complexity of the learning problem is higher to some extent.
  • The input data is required to have translation invariance.

For the discriminant model and generative model, please refer to ( http://blog.csdn.net/erlib/article/details/53585134 )

3.1.3 Improved model

There are many variants of DBN, and its improvement mainly focuses on the improvement of its constituent "parts" RBM, such as convolutional DBN (CDBN) and conditional RBM (Conditional RBM).
DBN does not consider the two-dimensional structure information of the image, because the input is simply to convert an image matrix into a one-dimensional vector. CDBN utilizes the spatial relationship of neighboring pixels to achieve the transformation invariance of the generated model through a model called convolutional RBM (CRBM), and it can be easily transformed into high-dimensional images.
DBN does not explicitly deal with the learning of the time connection of observation variables. Conditional RBM considers the visible layer unit variables at the previous moment as additional conditional input to simulate sequence data. This variant is more widely used in the field of speech signal processing. many.

3.2 CNN (Convolution Neural Networks) convolutional neural network

Convolutional neural network is a kind of artificial neural network, which has become a research hotspot in the field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to a biological neural network, which reduces the complexity of the network model and reduces the number of weights. This advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input of the network, avoiding the complicated feature extraction and data reconstruction process in traditional recognition algorithms.
In the structure of a fully-linked DNN, the lower layer neurons and all upper layer neurons can form connections, which brings about the expansion of the number of parameters. For example, for a 1000*1000 pixel image, this layer alone has 10^12 weights to be trained. At this time, we can use the convolutional neural network CNN. For CNN, not all neurons in the upper and lower layers can be directly connected, but through the "convolution kernel" as an intermediary. The same convolution kernel is shared in all images, and the image still retains the original positional relationship after the convolution operation. The parameters from the image input layer to the hidden layer are instantly reduced to 100*100*100=10^6. The
convolutional network is a multi-layer perceptron specially designed to recognize two-dimensional shapes. This network structure is for translation and scaling. The deformation of, tilt, or other forms is highly invariant.

3.2.1 Network structure

Convolutional neural network is a multi-layer neural network, and its basic operation units include: convolution operation, pool operation, fully connected operation and recognition operation.
Write picture description here

  • Convolution operation: the feature map of the previous layer is convolved with a learnable convolution kernel. The output of the convolution result after the activation function forms the neurons of this layer, thus forming the feature map of this layer, also called In the feature extraction layer, the input of each neuron is connected with the local receptive field of the previous layer and the local feature is extracted. Once the local feature is extracted, the positional relationship between it and other features is determined. l
  • Pooling operation: It can aggregate features and reduce dimensionality to reduce the amount of calculation. It divides the input signal into non-overlapping areas, and reduces the spatial resolution of the network by pooling (downsampling) for each area. For example, maximum pooling is the maximum value in the selected area, and mean pooling is the calculation area. The average value within. Through this operation, the offset and distortion of the signal are eliminated.
  • Fully connected operation: After the input signal undergoes multiple convolution kernel pooling operations, the output is multiple sets of signals. After fully connected operation, multiple sets of signals are sequentially combined into a set of signals.
    Recognition calculation: The above calculation process is a feature learning calculation, and a layer of network needs to be added for classification or regression calculation based on the business requirements (classification or regression problem) on the basis of the calculation.

3.2.2 Training process and advantages and disadvantages

Convolutional network is essentially a kind of input to output mapping. It can learn a large number of mapping relationships between input and output without requiring any precise mathematical expressions between input and output, as long as the known The model trains the convolutional network, and the network has the ability to map between input and output pairs. The convolutional network performs supervised training, so its sample set is composed of vector pairs in the form of (input signal, label value).

1. Advantages:

  • The weight sharing strategy reduces the parameters that need to be trained. The same weight allows the filter to detect the characteristics of the signal without being affected by the position of the signal, so that the generalization ability of the trained model is stronger;
  • Pooling operation can reduce the spatial resolution of the network, thereby eliminating small offsets and distortions of the signal, so that the translation invariance of the input data is not high.

2. Disadvantages:

  • Deep models are prone to gradient dissipation problems.

3.2.3 Improved model

Convolutional neural networks are the most widely studied and applied deep neural networks in recent years because they have achieved good results in various fields. The more famous convolutional neural network models mainly include Lenet in 1986, Alexnet in 2012, GoogleNet in 2014, VGG in 2014, and Deep Residual Learning in 2015. The improved versions of these convolutional neural networks have certain differences in the depth of the model, or the organizational structure of the model, but the structure of the model is the same, basically including convolution operation, pooling operation, fully connected operation and recognition Operation.

3.3 RNN (Recurrent neural network) Recurrent neural network

In addition to the above problems, the fully connected DNN has another problem—it cannot model changes in time series. However, the time sequence in which the samples appear is very important for applications such as natural language processing, speech recognition, and handwriting recognition. By the way, to adapt to this demand, there is another neural network structure mentioned by the subject-cyclic neural network RNN ​​(I don’t know why many are called cycles. In computer terms, cycles are generally at the same level. Recurrent is actually time recursion, so this article calls him recurrent neural network).
In an ordinary fully connected network or CNN, the signals of each layer of neurons can only propagate to the upper layer, and the processing of samples is independent at all times, so it is called Feed-forward Neural Networks. In RNN, the output of the neuron can directly affect itself at the next time stamp.
That is: the final result O(t+1) of the network at time (t+1) is the result of the input and all the history at that time. RNN can be seen as a neural network transmitted in time, and its depth is the length of time! As we said above, the phenomenon of "gradient disappearance" is about to appear again, but this time it happened on the time axis in
order to solve the disappearance of the gradient in time , the machine learning field has developed a long and short-term memory unit (LSTM), through the door The switch realizes the memory function in time and prevents the gradient from disappearing.

3.3.1 Network structure

Write picture description here
On the left is the original structure of the recurrent neural network. If the intimidating closed loop in the middle is discarded first, it is actually a simple three-layer structure of "input layer => hidden layer => output layer", but there is one more unfamiliar one in the figure. The closed loop means that after input to the hidden layer, the hidden layer will also input to itself, so that the network can have memory capabilities. We say that the recurrent neural network has memory ability, and this ability is to summarize the previous input state through W, and use it as an aid for the next input. The hidden state can be understood like this: h=f (existing input + past memory summary)

3.3.2 Training process and advantages and disadvantages

In the recurrent neural network, since the previous signal is superimposed on the input, the reverse conduction is different from the traditional neural network, because for the input layer at time t, the residual error not only comes from the output, but also from the hidden layer afterwards. Through the reverse transfer algorithm, the error of the output layer is used to solve the gradient of each weight, and then the gradient descent method is used to update each weight.
1. Advantages:

  • The model is a deep model in the time dimension, which can model the sequence content.

2. Disadvantages:

  • There are many parameters that need to be trained, and the problem of gradient dissipation or gradient explosion is prone to occur;
  • No feature learning ability.

3.3.3 Improved model

Recurrent neural network models can be used to process sequence data. Recurrent neural networks contain a large number of parameters and are difficult to train (gradient dissipation or gradient explosion in the time dimension). Therefore, a series of RNN optimizations have appeared, such as network structure, solution algorithm and parallelization. .
In recent years, bidirectional RNN (BRNN) and LSTM have made breakthroughs in image captioning, language translation, and handwriting recognition.

3.4 Hybrid structure

In addition to the above three networks, and the deep residual learning and LSTM I mentioned before, deep learning has many other structures. For example, since RNN can inherit historical information, can it also absorb some future information? Because in sequence signal analysis, if I can predict the future, it must be helpful for identification. Therefore, there is a two-way RNN and a two-way LSTM, which simultaneously utilize historical and future information. Two-way RNN, two-way LSTM, use historical and future information at the same time.
In fact, no matter what kind of network, they are often mixed in practical applications. For example, CNN and RNN are often connected to a fully connected layer before the upper layer output. It is difficult to say which category a certain network belongs to.
It is not difficult to imagine that as the enthusiasm for deep learning continues, more flexible combinations and more network structures will be developed. Although it seems ever-changing, the starting point of the researchers is definitely to solve a specific problem. If you want to conduct research in this area, you may wish to carefully analyze the respective characteristics of these structures and the means by which they achieve their goals.

3.5 Comparison of CNN and RNN

The important feature of RNN is that it can handle input of variable length and get a certain output. When your input can be long or short, for example, when training a translation model, your sentence length is not fixed, you can't use CNN to get it like a training fixed pixel image. And it can be easily done by using the cycle characteristics of RNN.
In the application of sequence signals, CNN only responds to the preset signal length (the length of the input vector), and the response length of RNN is learned.

The response of CNN to features is linear, and RNN is nonlinear in this progressive direction. This also makes a big difference.

CNN specializes in solving image problems, you can use it as a feature extraction layer, put it on the input layer, and finally use MLP for classification.
RNN specializes in solving time series problems. It is used to extract time series information and placed after the feature extraction layer (such as CNN).

RNN, recursive network, used for sequence data, and has a certain memory effect, supplemented by lstm.
CNN should focus on spatial mapping, and image data is especially suitable for this scene.

CNN convolution is good at approximating global features from local features, and
RNN is good at dealing with time series.

Four, some basic concepts and knowledge

4.1 Linear regression, linear neural network, Logistic/Softmax regression

This reference http://blog.csdn.net/erlib/article/details/53585134
or other information.

4.2 About convolution, pooling, activation function, etc.

Getting started reference: http://blog.csdn.net/u010859498/article/details/78794405
Learn more about Google Baidu

4.3 Recommend a good introductory material

Professor Li Hongyi of the Department of Electrical Engineering of National Taiwan University
has done a brief translation of the lecture "Understanding Deep Learning in One Day"
https://www.jianshu.com/p/c30f7c944b66

Reference materials:
http://blog.csdn.net/erlib/article/details/53585134
https://www.zhihu.com/question/34681168/answer/156552873
http://blog.csdn.net/u010859498/article /details/78794405

Guess you like

Origin blog.csdn.net/u013328649/article/details/113122837