In-depth notes Part 1 (reproduced)

Reprinted deep learning notes

Source|Sophia@知识, https://zhuanlan.zhihu.com/p/152362317
This article is for academic sharing only. If there is any infringement, please contact the background to delete the article

1. Basic concepts of deep learning

Insert picture description here
Supervised learning: All input data has certain corresponding output data. In various network architectures, the node layers of input data and output data are located at both ends of the network. The training process is to constantly adjust the network connection weight between them.
Top left: Supervised learning of various network architectures are listed. For example, a standard neural network (NN) can be used to train the function between house features and house prices, and a convolutional neural network (CNN) can be used to train images and categories. Function, Recurrent Neural Network (RNN) can be used to train the function between speech and text.
Lower left: The simplified architectures of NN, CNN and RNN are shown respectively. The forward process of these three architectures is different. NN uses a weight matrix (connection) and node value multiplied and successively propagated to the next layer of nodes; CNN uses a rectangular convolution kernel to sequentially roll on the image input Product operation, sliding, the way to get the next layer of input; RNN memorizes or forgets the information of the previous time step to provide long-term memory for the current calculation process.
Upper right: NN can handle structured data (tables, databases, etc.) and unstructured data (images, audio, etc.).
Bottom right: The development of deep learning is mainly due to the emergence of big data. The training of neural networks requires a lot of data; and big data itself also promotes the emergence of larger networks. A major breakthrough in deep learning research is the emergence of a new type of activation function. Replacing the sigmoid function with the ReLU function can maintain a fast gradient descent process in backpropagation. The sigmoid function will have a derivative that tends to zero at positive and negative infinity. This is the main reason why the gradient disappears and the training is slow or even fails. To study deep learning, you need to learn the virtuous circle of "idea-code-experiment-idea".

2. Logistic regression

Insert picture description here
Top left: logistic regression is mainly used for binary classification problems. As shown in the figure, logistic regression can solve the problem of whether an image is a cat, where the image is the input (x), and the cat (1) or non-cat (0) is the output . We can regard logistic regression as a problem of separating two sets of data points. If there is only linear regression (the activation function is linear), then for the data points on the non-linear boundary (for example, one set of data points is surrounded by another set) is Can not be effectively separated, so here need to replace the linear activation function with a nonlinear activation function. In this case, we use the sigmoid activation function, which is a smooth function with a value range of (0, 1), which can make the output of the neural network get a continuous and normalized (probability value) result. For example, when the output node is (0.2, 0.8), the image is judged to be non-cat (0).
Bottom left: The training goal of the neural network is to determine the most appropriate weight w and bias term b. What about this process?
This classification is actually an optimization problem. The purpose of the optimization process is to minimize the gap between the predicted value y hat and the true value y. Formally, it can be achieved by finding the minimum value of the objective function. So we first determine the form of the objective function (loss function, cost function), and then use gradient descent to gradually update w and b. When the loss function reaches the minimum or is small enough, we can get good prediction results.
__Upper right: __Simplified diagram of the change of loss function value on the parametric surface. The fastest descent path can be found by using the gradient. The size of the learning rate can determine the speed of convergence and the final result. When the learning rate is large, the initial convergence is very fast and it is not easy to stay at the local minimum, but it is difficult to converge to a stable value in the later stage; when the learning rate is small, the situation is just the opposite. Generally speaking, we hope that the learning rate in the initial stage of training will be larger, and the learning rate will be smaller in the later stage. Later, we will introduce the training method of changing the learning rate.
__Bottom right: __Summarize the entire training process, start from the input node x, get the predicted output y hat through forward propagation, use y hat and y to get the loss function value, start back propagation, update w and b, repeat Iterate the process until convergence.

3. Characteristics of shallow network

Insert picture description here
Top left: The shallow network means that there are fewer hidden layers. As shown in the figure, there is only one hidden layer.
Lower left: Here are the characteristics of different activation functions:
sigmoid: The sigmoid function is often used in binary classification problems, or the last layer of multi-class classification problems, mainly due to its normalization characteristics. The sigmoid function will have a gradient to zero on both sides, which will cause slow training.
tanh: Compared with sigmoid, the advantage of the tanh function is that the gradient value is larger, which can make the training speed faster.
ReLU: It can be understood as threshold activation (a special case of the spiking model, similar to the working method of biological nerves). This function is very commonly used. It is basically the activation function selected by default. The advantage is that it will not cause the problem of slow training, and because the activation value is zero The nodes of will not participate in back propagation, this function also has the effect of sparse network.
Leaky ReLU: Avoid the result of zero activation value, so that the back propagation process is always executed, but it is rarely used in practice.
__Upper right:__Why use the activation function? More precisely, why use a nonlinear activation function?
As can be seen from the example in the above figure, the neural network without activation function passes through two layers of propagation, and the final result is the same as the linear operation of a single layer, that is, if a nonlinear activation function is not used, no matter how many layers The neural networks of are equivalent to single-layer neural networks (not including the input layer).
__ Lower right: __ How to initialize the values ​​of parameters w and b?
When all parameters are initialized to zero, all nodes will become the same, and only the same features can be learned during the training process, but multi-level and diverse features cannot be learned. The solution is to initialize all parameters randomly, but only a small amount of variance is enough, so Rand (0.01) is used for initialization, where 0.01 is also one of the hyperparameters.

4. Features of deep neural networks

Insert picture description here
Top left: The parameterized capacity of the neural network increases exponentially with the increase of the number of layers, that is, some deep neural networks can solve the problems, and the shallow neural networks require relatively exponential calculations to solve.
Bottom left: CNN's deep network can combine the simple features at the bottom layer into more and more complex features. The greater the depth, the greater the complexity and diversity of the images that can be classified. RNN's deep network is the same. It can decompose speech into phonemes, and then gradually combine them into letters, words, and sentences to perform complex speech-to-text tasks.
Right: The deep network is characterized by a large amount of training data and computing resources, which involves a large number of matrix operations, which can be executed in parallel on the GPU, and also contains a large number of hyperparameters, such as learning rate, number of iterations, number of hidden layers, activation Function selection, learning rate adjustment scheme, batch size, regularization method, etc.

5. Bias and variance

So what should you pay attention to when deploying your machine learning model? The following figure shows the problems of data set segmentation, deviation, and variance needed to build ML applications.

Insert picture description here
As shown above, the number of samples required for classic machine learning and deep learning models is very different. The number of samples for deep learning is thousands of times that of classic ML. Therefore, the distribution of training set, development set and test set is also very different. Of course, we assume that these different data sets all obey the same distribution.
The problems of bias and variance are also common challenges in machine learning models. The figure above shows the underfitting caused by high deviation and the overfitting caused by high variance in turn. Generally speaking, to solve the problem of high deviation is to choose a more complex network or a different neural network architecture, and to solve the problem of high variance, you can add regularization, reduce model redundancy, or use more data for training.
Of course, machine learning models need to pay attention to more than these problems, but they are the most basic and important part in configuring our ML applications. Others such as data preprocessing, data normalization, and hyperparameter selection are all reflected in the following infographic.

6. Regularization

Regularization is the main method to solve high variance or model overfitting. In the past few years, researchers have proposed and developed a variety of regularization methods suitable for machine learning algorithms, such as data enhancement, L2 regularization (weight attenuation), and L1 regularization. Change, Dropout, Drop Connect, random pooling and early termination, etc.
Insert picture description here
As shown in the left column of the above figure, L1 and L2 regularization are also the most widely used regularization methods in machine learning. L1 regularization adds a regularization term to the objective function to reduce the sum of absolute values ​​of the parameters; while in L2 regularization, the purpose of adding a regularization term is to reduce the sum of squares of the parameters. According to previous research, many parameter vectors in L1 regularization are sparse vectors, because many models cause parameters to approach 0, so it is often used in feature selection settings. In addition, the parameter norm penalty L2 regularization allows the deep learning algorithm to "perceive" the input x with higher variance, so the feature weights with smaller covariance with the output target (relatively increasing variance) will shrink.
In the middle column, the figure above shows the Dropout technology, which is a method of temporarily discarding a part of neurons and their connections. Randomly discarding neurons can prevent overfitting while simultaneously connecting different network architectures exponentially and efficiently. Generally, a neural network using Dropout technology will set a retention rate p, and then each neuron will randomly choose whether to remove it with a probability of 1-p in a batch of training. In the final inference, all neurons need to be retained, so there is a higher accuracy.
Bagging is a technology that reduces generalization errors by combining multiple models. The main method is to train several different models separately, and then let all models vote on the output of the test sample. Dropout can be considered as a Bagging method that integrates a large number of deep neural networks, so it provides a cheap bagging integrated approximation method that can train and evaluate neural networks with the amount of data.
Finally, the figure above also describes regularization methods such as data enhancement and early termination. Data augmentation artificially increases the training data set by adding transformations or perturbations to the training data. Data enhancement techniques such as horizontal or vertical flipping of images, cropping, color transformation, expansion and rotation are usually applied in visual representation and image classification. And early termination is usually used to prevent poor generalization performance of the model that is overexpressed in training. If the number of iterations is too small, the algorithm is prone to underfitting (small variance and large deviation), and too many iterations, the algorithm is likely to overfit (large variance and small deviation). Therefore, early termination solves this problem by determining the number of iterations.

7. Optimization

Optimization is a very important module in the machine learning model. It not only dominates the entire training process, but also determines the performance of the final model and the time required for convergence. The following two infographics both show the knowledge points that the optimization method needs to pay attention to, including optimization preparations and specific optimization methods.
Insert picture description here
The above shows the problems that often occur in optimization and the required operations. First, before performing optimization, we need to normalize the input data, and the normalization constants (mean and variance) of the development set and the test set are the same as the training set. The above figure also shows the reason for normalization, because if the magnitude difference between the features is too large, then the surface of the loss function is a long and narrow ellipse, and the gradient descent or steepest descent method will be caused by the "jagging" phenomenon. It is difficult to converge, so normalizing to a circle helps reduce the oscillation in the descending direction.
The subsequent gradient disappearance and gradient explosion problems are also very common phenomena. "Gradient disappearance" refers to the phenomenon that the gradient norm of the parameter decreases exponentially as the network depth increases. The gradient is small, which means that the parameters change very slowly, which makes the learning process stagnate. Gradient explosion refers to the continuous accumulation of large error gradients during the neural network training process, resulting in a large update of the model weight. In extreme cases, the weight value becomes so large that there is a NaN value.
The gradient test may be used less now, because we only need to call the optimizer to execute the optimization algorithm on TensorFlow or other frameworks. Gradient testing generally uses a numerical method to calculate the approximate derivative and propagate, so it can check whether the gradient we calculated based on the analytical formula is correct.
The following is the specific optimization algorithm, including the most basic small batch stochastic gradient descent, momentum-driven stochastic gradient descent, and RMSProp and other adaptive learning rate algorithms.
Insert picture description here
Stochastic gradient descent in small batches (usually referred to as SGD) uses a batch of data to update parameters, thus greatly reducing the amount of calculation required for one iteration. This method reduces the variance of the updated parameters and makes the convergence process more stable; it can also use the highly optimized matrix operator in the popular deep learning framework to efficiently find the gradient of each small batch of data. Usually a small batch of data contains the number of samples between 50 and 256, but it will vary for different purposes.
Momentum strategy is designed to accelerate the learning process of SGD, especially in the case of high curvature. Generally speaking, the momentum algorithm uses the exponentially decayed moving average of the previous gradient to make corrections in this direction, thereby making better use of the historical gradient information. This algorithm introduces the variable v as the velocity vector of the parameter moving continuously in the parameter space. The velocity can generally be set as an exponentially decayed moving average of a negative gradient.
The adaptive learning rate algorithms such as RMSProp and Adam described later in the above figure are currently our most commonly used optimization methods. The RMSProp algorithm (Hinton, 2012) modifies AdaGrad to perform better in the non-convex case. It changes the gradient accumulation to an exponentially weighted moving average, thereby discarding the historical gradient information that is far away. RMSProp is the optimization algorithm proposed by Hinton in the open class. In fact, it can be regarded as a special case of AdaDelta. But practice has proved that RMSProp has very good performance, and it is currently widely used in deep learning.
Adam algorithm obtains the advantages of AdaGrad and RMSProp algorithms at the same time. Adam not only calculates the adaptive parameter learning rate based on the mean value of the first moments like the RMSProp algorithm, but also makes full use of the mean value of the second moment of the gradient (ie, uncentered variance).

To be continued. .

Guess you like

Origin blog.csdn.net/weixin_45680994/article/details/108553837