"Computer Vision Fundamentals Blue Book" Part 2 Deep Learning Fundamentals

This column will systematically explain the basics of computer vision, including the first machine learning foundation , the second deep learning foundation , the third convolutional neural network , the fourth classic popular network structure , the fifth target detection foundation , Chapter 6 Network Construction and Training , Chapter 7 Model Optimization Methods and Ideas , Chapter 8 Model Hyperparameter Adjustment Strategies , Chapter 9 Model Improvement Skills , Chapter 10 Model Deployment Fundamentals , etc., the number of full-column articles is 100,000+, articles This excellent article will make it easy for you to get started with computer vision. You are welcome to subscribe~ "Blue Book Catalogue of Basic Knowledge of Computer Vision"

Article directory

0. Basic Concepts

insert image description here

machine learning

Machine learning is the use of computer, probability theory, statistics and other knowledge, input data, let the computer learn new knowledge. The process of machine learning is the process of training data to optimize the objective function.

deep learning

Deep learning is a research direction of machine learning. Deep learning is to learn the inherent laws and representation levels of sample data. The information obtained in these learning processes is of great help to the interpretation of data such as text, images and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as words, images, and sounds. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far exceeding previous related technologies.

1. Forward Propagation and Back Propagation

There are two main types of neural network calculations: forward propagation (FP) acts on the input of each layer, and the output results are obtained by layer-by-layer calculation; backward propagation (BP) acts on the output of the network, through The computed gradient updates the network parameters from deep to shallow.

1.1 Forward propagation

insert image description here

Suppose the upper level nodes i , j , k , . . . i,j,k,...i,j,k,... etc. some nodes and the node of this layerwww has a connection, then the nodewwHow to calculate the value of w ? is to pass thei , j , k , . . . i,j,k,...i,j,k,... and other nodes and the corresponding connection weights for weighted sum operation, the final result plus a bias term (omitted for simplicity in the figure), and finally through a nonlinear function (ie activation function), such asR eL u ReLuReLu s i g m o i d sigmoid s i g m o i d and other functions, the final result is the nodewwoutput of w .

Finally, through this method layer by layer operations, the output layer results are obtained.

1.2 Backpropagation

insert image description here

Since the final result obtained by our forward propagation, taking classification as an example, there is always an error in the end, so how to reduce the error? A widely used algorithm is the gradient descent algorithm, but the partial derivative is required to find the gradient. Let's take the middle letter as an example:

Let the final error be EEE and the activation function of the output layer is a linear activation function, for the output thenEEE for output nodeyl y_lYlThe partial derivative of is yl − tl y_l - t_lYltl, where tl t_ltlis the true value, ∂ yl ∂ zl \frac{\partial y_l}{\partial z_l}zlylrefers to the activation function mentioned above, zl z_lWithlis the weighted sum mentioned above, then the EE of this layerE for zlz_lWithl的偏导数为∂ E ∂ zl = ∂ E ∂ yl ∂ yl ∂ zl \frac{\partial E}{\partial z_l} = \frac{\partial E}{\partial y_l} \frac{\partial y_l}{ \partial z_l}zl E=yl Ezlyl. Similarly, the next layer is calculated in the same way, except that ∂ E ∂ yk \frac{\partial E}{\partial y_k}yk EThe calculation method has changed, and it has been propagated back to the input layer, and finally ∂ E ∂ xi = ∂ E ∂ yj ∂ yj ∂ zj \frac{\partial E}{\partial x_i} = \frac{\partial E}{\ partial y_j} \frac{\partial y_j}{\partial z_j}xi E=yj Ezjyj,且 ∂ z j ∂ x i = w i j \frac{\partial z_j}{\partial x_i} = w_i j xizj=inij . Then adjust the weights in these processes, and then continue the process of forward propagation and back propagation, and finally get a better result.


2. How to choose a deep learning framework?

In recent years, with the development of deep learning algorithms, many deep learning frameworks have emerged. Each framework has its own characteristics and advantages, and you can choose which framework to master first according to your task needs.

2.1 TensorFlow

insert image description here

Google's TensorFlow is arguably the most popular open-source deep learning framework in the industry today and can be used for various deep learning-related tasks.

  • TensorFlow supports multiple programming languages ​​including Python, JavaScript, C++, Java, Go, C#, Julia, and R.
  • TensorFlow not only has a powerful computing cluster, but can also run models on mobile platforms such as iOS and Android.
  • Getting started with TensorFlow programming is difficult. Beginners need to carefully consider the architecture of a neural network, properly evaluating the dimensions and quantity of input and output data.
  • TensorFlow operates using a static computational graph. That is, we need to define the graph first, then run the computation, and if we need to make changes to the architecture, we need to retrain the model. Such an approach was chosen for efficiency, but many modern neural network tools are already able to improve on the learning process without significantly slowing down learning. In this regard, TensorFlow's main competitor is PyTorch.

2.2 Hard

insert image description here

Keras is a very friendly and simple deep learning framework for novice users. If you want to get started with deep learning quickly, Keras will be a good choice.

Keras is a TensorFlow high-level integration API that can be easily integrated with TensorFlow. Keras can call TensorFlow, CNTK, Theano at the high level, and more excellent libraries are being supported one after another. Keras is characterized by the ability to quickly build models, which is the key to efficient scientific research.

The basic features of Keras are as follows:

  • Highly modular, building a network is very simple;
  • The API is simple and has a unified style;
  • Easy to extend, easy to add new modules, just need to write new classes or functions imitating existing modules.

2.3 Caffe

insert image description here

Caffe was developed by AI scientist Jia Yangqing during his Ph.D. at the University of California, Berkeley. It is one of the early deep learning frameworks based on C++/CUDA code, which is earlier than TensorFlow, MXNet, and PyTorch. Caffe needs to be compiled and installed. It supports command line, Python and Matlab interfaces, and can be easily used for single-machine multi-card, multi-machine multi-card, etc.

The basic features of Caffe are as follows:

  • Based on C++/CUDA/Python code, it is fast and has high performance.
  • Factory design pattern, clear code structure, strong readability and extensibility.
  • Support command line, Python and Matlab interface, easy to use.
  • It is convenient to switch between CPU and GPU, and multi-GPU training is convenient.
  • Tools are abundant and the community is active.

The shortcomings of Caffe are also obvious, mainly including the following points:

  • The threshold for source code modification is high, and forward/backward propagation needs to be implemented.
  • Auto-derivation is not supported.
  • Model-level parallelism is not supported, only data-level parallelism is supported.
  • Not suitable for non-image tasks.

2.4 Pytorch

insert image description here

Pytorch is a deep learning framework released by the Facebook team in January 2017. Although it is later than TensorFlow, Keras and other frameworks, its attention has been on the rise since its release, and its popularity on GitHub has exceeded Theano, Caffe, MXNet and other frameworks, Pytorch is currently the most widely used framework in academia.

PyTroch mainly provides the following two core functions:

  • Support GPU-accelerated tensor computing;
  • Automatic differentiation mechanism to facilitate optimization of models.

The main advantages of PyTorch are as follows:

  • Concise and easy to understand: The API design of PyTorch is quite concise and consistent. It is basically a three-level package of tensor, autograd, and nn, which is very easy to learn.
  • Easy to debug: PyTorch uses dynamic graphs and can be debugged like normal Python code. Unlike TensorFlow, PyTorch's error messages are usually easy to understand.
  • Powerful and efficient: PyTorch provides a very rich model component that can quickly implement ideas.

2.5 Theano

insert image description here

Theano was born in 2008 and is developed and maintained by the LISA laboratory of the University of Montreal. It is a high-performance symbolic computing and deep learning framework. It is entirely based on Python and is dedicated to the definition, evaluation and optimization of mathematical expressions. Thanks to the transparent use of GU, Theano is especially suitable for mathematical expressions containing high-dimensional arrays, and the calculation efficiency is relatively high.

Because Theano appeared earlier, a group of deep learning libraries based on Theano emerged later, and completed the upper-level encapsulation and functional expansion of Theano. Among these derived libraries, the most famous is Keras. Keras encapsulates some basic components into modules, making it easier for users to write, debug, and read network code.

2.6 CNTK

insert image description here

CNTK (Microsoft Cognitive Toolkit) is Microsoft's open source deep learning toolkit, which describes a neural network as a series of computational steps through a directed graph. In a directed graph, leaf nodes represent input values ​​or network parameters, and other nodes represent matrix operations on their inputs.

CNTK allows users to implement and combine popular models, including Feedforward Neural Networks (DNN), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN, LSTM), very easily. Like most current frameworks, CNTK implements automatic derivation and uses stochastic gradient descent for optimization.

The basic features of CNTK are as follows:

  • CNTK has better performance. According to its official statement, it has better performance than other open source frameworks.
  • It is suitable for voice tasks. CNTK is open sourced by the Microsoft voice team. It is naturally more suitable for voice tasks, which is convenient for convolution when using models such as RNN and spatiotemporal scales.

2.7 MXNet

insert image description here

The MXNet framework allows mixing symbolic and imperative programming to maximize efficiency and productivity. At the core of MXNet is a dynamic dependency scheduler that automatically parallelizes symbolic and command operations on the fly. Its graphics optimization layer makes symbolic execution faster and more memory efficient.

The basic features of MXNet are as follows:

  • Flexible programming model: Supports imperative and symbolic programming models.
  • Multi-language support: supports C++, Python, R, Julia, JavaScript, Scala, Go, Perl, etc. In fact, it is the only framework that supports all R functions.
  • Local Distributed Training: Supports distributed training on multiple CPU/GPU devices, enabling it to take full advantage of the scale advantages of cloud computing.
  • Performance optimization: Use an optimized C++ backend engine for parallel I/O and computation to achieve the best performance regardless of the language used.
  • Cloud friendly: Directly compatible with S3, HDFS and Azure.

2.8 ONNX

insert image description here

The ONNX (Open Neural Network eXchange, Open Neural Network Exchange) project was jointly developed by companies such as Microsoft, Amazon, Facebook, and IBM to find deep learning models in an open format. ONNX simplifies the process of transferring models between different ways of working in artificial intelligence, with the benefits of various deep learning frameworks.

The basic features of ONNX are as follows:

  • ONNX enables models to be trained in one framework and transferred to make predictions in another.
  • ONNX models are currently supported in Caffe2, CNTK, MXNet, and PyTorch, and there are also connectors to other common frameworks and libraries.

3. Hyperparameters

3.1 What are hyperparameters?

There are two types of parameters in machine learning models:

  • One class needs to be learned and estimated from the data, called model parameters (Parameter)—that is, the parameters of the model itself. For example, the parameters of the convolution kernel and BN layer.
  • One is the tuning parameters in machine learning algorithms, which need to be set manually, called hyperparameters. Such as learning rate, batch size, parameters for different optimizers, and tunable parameters for some loss functions.

3.2 How do hyperparameters affect model performance?

Hyperparameters How to Affect Model Capacity Affect the reason
learning rate Adjust to the optimum, increase the effective capacity A learning rate that is too high or too low will reduce the effective tolerance of the model due to optimization failure.
Loss function partial hyperparameters Adjust to the optimum, increase the effective capacity In most cases, the loss function hyperparameters may affect the optimization. Inappropriate hyperparameters will make it difficult to optimize the model even if the loss function is very suitable for the target optimization, reducing the effective tolerance of the model.
batch sample size Too big or too small, it is easy to reduce the effective capacity In most cases, choosing a batch sample size suitable for your own hardware capacity will not affect the model tolerance.
discard method Decreasing the ratio increases the capacity of the model Fewer discarded parameters means that the number of model parameters is improved, the adaptability between parameters is improved, and the model capacity is improved, but it may not necessarily improve the effective tolerance of the model.
Weight decay coefficient Adjust to the optimum, increase the effective capacity Weight decay can effectively limit the range of parameter changes and play a regular role.
optimizer momentum Adjust to the optimal, may increase the effective capacity Momentum parameters are often used to speed up training, while making it easier to jump out of extreme points and avoid getting stuck in local optimal solutions.
Model depth Under the same conditions, the depth increases and the model capacity increases Under the same conditions, increasing the depth means that the model has more parameters and stronger fitting ability.
convolution kernel size Increased size, increased model capacity Increasing the size of the convolution kernel means an increase in the amount of parameters, and under the same conditions, the model parameters also increase accordingly.

3.3 Why hyperparameter tuning?

Essentially, this is the relationship between the model optimization finding the optimal solution and the regularization term. The purpose of the optimization and adjustment of the network model is to find the global optimal solution, and the regular term hopes that the model will fit as best as possible. There is usually a certain opposition between the two, but the goal of the two is the same, that is, to minimize the expected risk. Model optimization hopes to minimize the empirical risk, but it is easy to fall into overfitting, and the regular term is used to constrain the complexity of the model. Therefore, how to balance the relationship between the two and obtain the optimal or better solution is the purpose of hyperparameter adjustment and optimization.

Example of tuning hyperparameters:

The learning rate is arguably the most important hyperparameter for model training. Usually, an excellent learning rate or set of learning rates can not only speed up the training of the model, but also obtain a better or even optimal accuracy. A learning rate that is too large or too small will directly affect the convergence of the model. We know that when the model is trained to a certain level, the loss will no longer decrease. At this time, the first-order gradient of the model is close to zero, and the corresponding Hessian matrix is ​​usually two cases. First, positive definite, that is, all eigenvalues ​​are positive, this Usually, a local minimum can be obtained. If the local minimum is close to the global minimum, the model can already obtain good performance, but if the gap is large, the model performance needs to be improved. Usually, the latter is training Most common at first. Second, the eigenvalues ​​are positive or negative. At this time, the model is likely to fall into the saddle point. If it falls into the saddle point, the performance of the model will be very poor. The above two situations are in the early and middle stages of training. If the learning rate is still fixed at this time, the model will fall into a back-and-forth oscillation or a saddle point, and the optimization cannot be continued. Therefore, the decay or increase of the learning rate can help the model to effectively reduce the oscillation or escape from the saddle point.

3.3 How to find the optimal value of hyperparameters?

When working with machine learning algorithms, there are always some hard-to-tune hyperparameters. Such as weight decay size, Gaussian kernel width, etc. These parameters need to be set manually, and the set values ​​have a great impact on the results. Common ways to set hyperparameters are:

  1. Guess and check: According to experience or intuition, choose parameters and iterate all the time.
  2. Grid search: Let the computer try a set of values ​​that are uniformly distributed within a certain range.
  3. Random search: Let the computer pick a set of values ​​at random.
  4. Bayesian optimization: Using Bayesian optimization of hyperparameters will encounter the difficulty that the Bayesian optimization algorithm itself requires a lot of parameters.
  5. Genetic Algorithm.

3.4 General process of hyperparameter search?

The general process of hyperparameter search:

  1. Divide the dataset into training, validation, and test sets.
  2. On the training set, the model parameters are optimized according to the performance indicators of the model.
  3. The model's hyperparameters are searched against the model's performance metrics on the validation set.
  4. Steps 2 and 3 are alternately iterated, and the parameters and hyperparameters of the model are finally determined, and the pros and cons of the model are verified in the test set.

Among them, the search process requires a search algorithm, which generally includes: grid search, random search, heuristic intelligent search, and Bayesian search.


4. Activation function

4.1 The concept of activation function

Activation functions are very important for artificial neural network models to learn and understand very complex and nonlinear functions. They introduce nonlinear properties into neural networks. In the figure below, the input inputs are weighted and summed, and a function ff is also appliedf , the functionfff is the activation function. The activation function is introduced to increase the nonlinearity of the neural network model.

insert image description here

4.2 Why introduce activation function?

If the activation function is not used, the output of each layer is a linear function of the input of the upper layer. No matter how many layers of the neural network, the output is a linear combination of the input. This situation is the most primitive perceptron (Perceptron).

If the activation function is used, the activation function introduces nonlinear factors to the neurons, so that the neural network can approximate any nonlinear function arbitrarily, so that the neural network can be applied to many nonlinear models.

4.3 Why is the activation function a nonlinear function?

If a linear activation function is used, the relationship between input and output is linear, and no matter how many layers of the neural network are linear combinations, it is impossible to approximate any function with nonlinearity.

The use of nonlinear activation functions is to increase the nonlinearity of the neural network model in order to make the network more powerful and increase its ability to learn complex things, complex form data, and complex representations of nonlinearity between input and output an arbitrary function map of .

4.4 Common Activation Functions

4.4.1 Sigmoid
Sigmoid ⁡ ( x ) = σ ( x ) = 1 1 + exp ⁡ ( − x ) \operatorname{Sigmoid}(x)=\sigma(x)=\frac{1}{1+\exp (-x)} Sigmoid(x)=σ ( x )=1+exp(x)1

insert image description here

m = nn.Sigmoid()
input = torch.randn(2)
output = m(input)

advantage:

  • The output of the sigmoid function is between (0,1), the output range is limited, the optimization is stable, and it can be used as an output layer.

  • Continuous function, easy to derive.

shortcoming:

  • The sigmoid function saturates when the variable takes very large positive or negative values, meaning the function becomes flat and insensitive to small changes in the input. During backpropagation, when the gradient is close to 0, the weight will not be updated basically, and the gradient will easily disappear, so that the training of the deep network cannot be completed.

  • The output of the sigmoid function is not 0 mean, which will cause the input of the neurons in the later layer to be non-zero mean signals, which will affect the gradient.

  • The computational complexity is high because the sigmoid function is in exponential form.


4.4.2 Tanh
Tanh ⁡ ( x ) = tanh ⁡ ( x ) = exp ⁡ ( x ) − exp ⁡ ( − x ) exp ⁡ ( x ) + exp ⁡ ( − x ) \operatorname{Tanh}(x)=\ tanh (x)=\frac{\exp (x)-\exp (-x)}{\exp (x)+\exp (-x)}Tanh ( x )=tanh ( x )=exp(x)+exp(x)exp(x)exp(x)
insert image description here

m = nn.Tanh()
input = torch.randn(2)
output = m(input)

4.4.3 ReLU
ReLU ⁡ ( x ) = ( x ) + = max ⁡ ( 0 , x ) \operatorname{ReLU}(x)=(x)^{+}=\max (0, x)ReLU ( x )=(x)+=max(0,x)
insert image description here

m = nn.ReLU()
input = torch.randn(2)
output = m(input)

An implementation of CReLU - https://arxiv.org/abs/1603.05201
m = nn.ReLU()
input = torch.randn(2).unsqueeze(0)
output = torch.cat((m(input),m(-input)))

4.4.4 LeakyReLU
 LeakyReLU  ( x ) = max ⁡ ( 0 , x ) +  negative-slope  ∗ min ⁡ ( 0 , x ) \text { LeakyReLU }(x)=\max (0, x)+\text { negative-slope } * \min (0, x)  LeakyReLU (x)=max(0,x)+ negative-slope min(0,x)
insert image description here

m = nn.LeakyReLU(0.1)
input = torch.randn(2)
output = m(input)

4.4.5 Softmax
Softmax ⁡ ( x i ) = exp ⁡ ( x i ) ∑ j exp ⁡ ( x j ) \operatorname{Softmax}\left(x_{i}\right)=\frac{\exp \left(x_{i}\right)}{\sum_{j} \exp \left(x_{j}\right)} Softmax(xi)=jexp(xj)exp(xi)

m = nn.Softmax(dim=1)
input = torch.randn(2, 3)
output = m(input)

4.4.6 SiLU
silu ⁡ ( x ) = x ∗ σ ( x ) , where  σ ( x )  is the logistic sigmoid.  \operatorname{silu}(x)=x * \sigma(x) \text {, where } \sigma(x) \text { is the logistic sigmoid. } force ( x )=xσ(x), where σ(x) is the logistic sigmoid. 
insert image description here

m = nn.SiLU()
input = torch.randn(2)
output = m(input)

4.4.7 ReLU6
ReLU ⁡ 6 ( x ) = min ⁡ ( max ⁡ ( 0 , x ) , 6 ) \operatorname{ReLU} 6(x)=\min (\max (0, x), 6) resume6(x)=min(max(0,x),6 )

insert image description here

m = nn.ReLU6()
input = torch.randn(2)
output = m(input)

4.4.8 Mish
Mish ⁡ ( x ) = x ∗ Tanh ⁡ ( Softplus ⁡ ( x ) ) \operatorname{Mish}(x)=x * \operatorname{Tanh}(\operatorname{Softplus}(x)) Mish(x)=xTanh ( Softplus ( x ))

insert image description here

m = nn.Mish()
input = torch.randn(2)
output = m(input)

4.5 Activation function properties

  1. Nonlinear : When the activation function is nonlinear, a two-layer neural network can approximate almost all functions. But if the activation function is an identity activation function, that is, f ( x ) = xf(x)=xf(x)=x , does not satisfy this property, and if the MLP uses an identity activation function, then the entire network is actually equivalent to a single-layer neural network;
  2. Differentiability : Differentiability guarantees the computability of gradients in optimization. Traditional activation functions such as sigmoid are differentiable everywhere. For piecewise linear functions such as ReLU, it is only differentiable almost everywhere (that is, only non-differentiable at a finite number of points). For the SGD algorithm, since it is almost impossible to converge to a position where the gradient is close to zero, the finite non-differentiable points will not have a great influence on the optimization result.
  3. The calculation is simple : there are many nonlinear functions. The number of computations of the activation function in the forward direction of the neural network is proportional to the number of neurons, so a simple nonlinear function is naturally more suitable as an activation function. This is one of the reasons why ReLU is more popular than other activation functions that use operations such as Exp.
  4. Saturation : Saturation refers to the problem that the gradient is close to zero in some intervals (that is, the gradient disappears), so that the parameters cannot continue to be updated. The most classic example is the sigmoid, whose derivative is close to 0 when x is a relatively large positive value and a relatively small negative value. A more extreme example is a step function, which saturates everywhere because its gradient is 0 at almost all positions and cannot be used as an activation function. The derivative of ReLU is always 1 when x>0, so it will not saturate for even larger positive values. But at the same time, for x<0, its gradient is always 0, and it will also be saturated at this time. Leaky ReLU and PReLU are proposed to solve this problem.
  5. Monotonicity : The sign of the derivative does not change. When the activation function is monotonic, a single-layer network is guaranteed to be a convex function.
  6. Limited output range : The limited output range makes the network more stable for some relatively large inputs, which is why the early activation functions are dominated by such functions, such as Sigmoid and Tanh. But this leads to the aforementioned vanishing gradient problem, and forcing the output of each layer to a fixed range limits its expressiveness. Therefore, this type of function is only used in some occasions that require a specific output range, such as probability output (the log operation in the loss function can offset the effect of its gradient disappearance), and the gate function in LSTM.
  7. Approaching the identity transformation (identity): f(x)≈x, which is approximately equal to x. The advantage of this is that the magnitude of the output does not increase significantly with depth, which makes the network more stable and gradients can be passed back more easily. This is somewhat contradictory to nonlinearity, so the activation function basically only partially satisfies this condition.
  8. Few parameters : Most activation functions have no parameters. Taking a single parameter like PReLU slightly increases the size of the network. Another exception is Maxout. Although it has no parameters, under the same number of output channels, the number of input channels required by k-channel Maxout is k times that of other functions, which means that the number of neurons also needs to be changed to k times; but if not Considering maintaining the number of output channels, the activation function can reduce the number of parameters to k times the original.
  9. Normalization : The corresponding activation function is SELU. The main idea is to automatically normalize the sample distribution to a distribution of zero mean and unit variance, thereby stabilizing training. The idea of ​​normalization is also used in the design of network structures, such as Batch Normalization.

4.6 How to choose an activation function?

At present, there are many activation functions. It is not easy to choose the most suitable activation function. Many factors need to be considered. The usual practice is that if you are not sure which activation function is better, you can try them all, and then use the validation set. Or on the test set for evaluation. Then see which one performs better and use it. In the actual application process, the most widely used ones can usually be considered, such as ReLU, SiLU, etc.

The following are common choices:

  1. If the output is 0, 1 value (binary classification problem), the output layer chooses the sigmoid function, and then all other units choose the Relu function.
  2. If you are not sure which activation function to use on the hidden layer, the Relu activation function is usually used. Sometimes, the tanh activation function is also used, but one advantage of Relu is that the derivative is equal to 0 when it is negative.
  3. sigmoid activation function: It is basically not used except that the output layer is a binary classification problem.
  4. tanh activation function: tanh is very good and suitable for almost all occasions.
  5. ReLu activation function: the most commonly used default function, if you are not sure which activation function to use, use ReLu or Leaky ReLu, and then try other activation functions.
  6. If some dead neurons are encountered, we can use the Leaky ReLU function.

5. BatchSize

5.1 Relationship between Epoch, Iteration and BatchSize

We often see epoch, iteration and batchsize in deep learning. Let's talk about the difference between these three according to our own understanding:

batchsize: batch size. In deep learning, SGD training is generally used, that is, batchsize samples are taken in the training set for training each time;
iteration: 1 iteration is equal to training once with batchsize samples;
epoch: 1 epoch is equivalent to training once with all samples in the training set ;

For example, the training set has 1000 samples, batchsize=10, then:
training the entire sample set requires:
100 iterations, 1 epoch.

5.2 Selection of BatchSize

The choice of Batch first determines the direction of descent. Then the more accurate the amount of data, the more accurate the direction of gradient descent will be. For small datasets, BatchSize can select the size of all datasets, but for large datasets, if the BatchSize is too large, it will This leads to problems such as insufficient running memory and inability to train. For online learning datasets, we set BatchSize to 1.

BatchSize should not be selected too small, it is easy to correct the direction and lead to non-convergence, or it needs to go through a large Epoch to converge; it is not necessary to choose too large, if it is too large, it will cause the video memory to explode, and secondly, it may be due to the number of iterations. The reduction causes the parameter correction to become slow. If the dataset is sufficient, then training with half (or even much less) of the data will yield almost the same gradient as training with the full amount of data.

5.3 Is the larger the BatchSize the better?

  1. Using a large BatchSize improves the memory utilization and the parallelization efficiency of large matrix multiplications, but the memory may explode.
  2. The number of iterations required to run an epoch (full data set) with a large BatchSize is reduced, and the processing speed for the same amount of data is further accelerated.
  3. Within a certain range, generally speaking, the larger the BatchSize is, the more accurate the determined descending direction will be, and the smaller the training oscillation will be.
  4. And if the BatchSize is too large, it is easy to cause problems such as memory overflow, increased training time, slow convergence, local optimization, and poor generalization.

5.4 What effect does adjusting BatchSize have on the training effect?

  1. BatchSize is too small, the model performance is extremely bad (error soars).
  2. As the BatchSize increases, the faster the processing of the same amount of data, the more epochs are required to achieve the same accuracy.
  3. Due to the contradiction between the above two factors, the BatchSize is increased to a certain time to reach the optimal time.
  4. Since the final convergence accuracy will fall into different local extremums, the BatchSize is increased to a certain time to reach the optimal final convergence accuracy.

6. Normalization and Normalization

Normalization: xi − x min ⁡ x max ⁡ − x min ⁡ Normalization: \frac{\mathrm{x}_{\mathrm{i}}-\mathrm{x}_{\min }}{\ mathrm{x}_{\max }-\mathrm{x}_{\min }}Normalized:xmaxxminxixmin

Normalization: xi − u σ Normalization: \frac{\mathrm{x}_{\mathrm{i}}-\mathrm{u}}{\sigma}standardization:pxiin

Normalization is to limit the data that needs to be processed (by some algorithm) to a certain range that you need. First of all, normalization is for the convenience of later data processing, and the second is to ensure that the convergence of the program is accelerated when running. The specific function of normalization is to summarize the statistical distribution of the unified sample. Normalization between 0-1 is a statistical probability distribution, and normalization in a certain interval is a statistical coordinate distribution.

Standardization and normalization are essentially linear changes to the data without changing the order of the data, and their biggest difference is that normalization (Normalization) will specify the original data in a range interval, while standardization (Standardization) It is to adjust the data to have a standard deviation of 1 and a mean of 0 for the distribution.

In addition, normalization is only related to the maximum and minimum values ​​of the data, and the scaling ratio is α = X max − X min α=X_{max}-X_{min}a=XmaxXmin, while the normalized scaling is equal to the standard deviation α = σ α=σa=σ , the amount of translation is equal to the meanβ = μ β=μb=μ , its scaling and translation will also change when data other than maxima and minima changes.

6.1 Why normalization?

Normalization is introduced because different features often have different dimensions or dimensional units, and the variation intervals are also in different orders of magnitude. If normalization is not performed, some indicators may be ignored, affecting data analysis. the result of.
For example, the two characteristics that affect the housing price are area and the number of rooms. The area is 80, 90, 100, etc., and the number of rooms is 1, 2, 3, etc. The measurement methods of these two indicators are not on the same order of magnitude.
In order to eliminate the dimensional influence between features, it is necessary to perform normalization processing to solve the comparability between feature indicators. After normalization of the original data, each indicator is in the same order of magnitude and can be directly compared and evaluated.

6.2 Why does normalization increase the speed of finding the optimal solution?

insert image description here

The above figure is the optimal solution search process representing whether the data is uniform (circles can be understood as contour lines). The figure on the left shows the solution-seeking process without normalization, and the figure on the right shows the solution-seeking process after normalization.

When using the gradient descent method to find the optimal solution, it is very likely to take a "zigzag" route (vertical contour line), resulting in many iterations to converge; the right image normalizes the two original features The corresponding contour lines appear to be very round, which can converge faster when solving gradient descent.

Therefore, if the machine learning model uses the gradient descent method to find the optimal solution, normalization is often very necessary, otherwise it will be difficult to converge or even unable to converge.

6.3 Common normalization methods

  1. Linear normalization

x ′ = x − m i n ( x ) m a x ( x ) − m i n ( x ) x^{\prime} = \frac{x-min(x)}{max(x) - min(x)} x'=max(x)min(x)xmin(x)

Scope of application: It is more suitable for the case where the numerical value is relatively concentrated.

Disadvantage: If max and min are unstable, it is easy to make the normalization result unstable, and the subsequent use effect is also unstable.

  1. standard deviation normalization

x ′ = x − μ σ x^{\prime} = \frac{x-\mu}{\sigma}x'=pxm

Meaning: The processed data conforms to a standard normal distribution, that is, the mean is 0 and the standard deviation is 1 where μ \muμ is the mean of all sample data,σ \sigmaσ is the standard deviation of all sample data.

  1. Nonlinear Normalization

    Scope of application: It is often used in scenarios with large data differentiation. Some values ​​are large and some are small. The original value is mapped by some mathematical function. The method includes log logl o g , exponent, tangent, etc.

6.4 Local Response Normalization (LRN)

This is a controversial method. LRN technology is mainly a technical method to improve accuracy during deep learning training. Among them, caffe, tensorflow, etc. are very common methods, which are different from activation functions. LRN is generally a processing method after activation and pooling. The LRN normalization technique is the first to propose this concept in the AlexNet model. Experiments have indeed proved that it can improve the generalization ability of the model, but the improvement is so small that it is no longer used later, and some people even think it is a "pseudo proposition", so it is controversial, so I won't introduce it here. .

For a detailed description, you can refer to this paper: http://www.cs.toronto.edu/~hinton/absps/imagenet.pdf

6.5 Batch Normalization (BN)

In the previous neural network training, only the input layer data was normalized, but no normalization was performed in the middle layer. Be aware that although we normalize the input data, the input data is processed by σ ( WX + b ) \sigma(WX+b)σ ( W X+b ) After such matrix multiplication and nonlinear operation, its data distribution is likely to be changed, and after the multi-layer operation of the deep network, the change of the data distribution will become larger and larger. If we can also normalize in the middle of the network, will it improve the training of the network? The answer is yes.

This method of normalization in the middle layer of the neural network to make the training effect better is Batch Normalization (BN).

So what are the advantages of using Batch Normalization?

  1. Reduced manual selection of parameters. In some cases, dropout and L2 regular term parameters can be cancelled, or a smaller L2 regular term constraint parameter can be adopted;
  2. Reduced learning rate requirements. Now we can use a large initial learning rate or choose a small learning rate, and the algorithm can also quickly train and converge;
  3. Local response normalization can no longer be used. BN itself is a normalization network (local response normalization exists in the AlexNet network)
  4. Destroy the original data distribution and alleviate over-fitting to a certain extent (preventing a certain sample in each batch of training from being often selected, the literature says that this can improve the accuracy by 1%).
  5. Reduce gradient disappearance, speed up convergence, and improve training accuracy.

6.6 Batch Normalization Algorithm Process

The process of the BN algorithm during training is given below

Input: The output result of the previous layer X = x 1 , x 2 , . . . , xm X = {x_1, x_2, ..., x_m}X=x1,x2,... ,xm, learning parameters γ , β \gamma, \betac ,b

Algorithm flow:

  1. Calculate the mean of the output data of the previous layer

μ β = 1 m ∑ i = 1 m ( x i ) \mu_{\beta} = \frac{1}{m} \sum_{i=1}^m(x_i) mb=m1i=1m(xi)

where, mmm is the size of the training sample batch.

  1. Calculate the standard deviation of the output data of the previous layer

σ β 2 = 1 m ∑ i = 1 m ( x i − μ β ) 2 \sigma_{\beta}^2 = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_{\beta})^2 pb2=m1i=1m(ximb)2

  1. normalized to get

x ^ i = xi + μ β σ β 2 + ϵ \hat x_i = \frac{x_i + \mu_{\beta}}{\sqrt{\sigma_{\beta}^2} + \epsilon}x^i=pb2 +ϵxi+mb

where ϵ \epsilonϵ is a small value close to 0 that is added to avoid a denominator of 0

  1. Reconstruction, reconstruct the data obtained by the above normalization processing, get

yi = γ x ^ i + β y_i = \gamma \hat x_i + \betaYi=cx^i+b

Among them, γ , β \gamma, \betac ,β is a learnable parameter.

Note: The above is the process of BN training, but when it is put into use, only one sample is often input, and there is no so-called mean μ β \mu_{\beta}mband standard deviation σ β 2 \sigma_{\beta}^2pb2. At this time, the mean μ β \mu_{\beta}mbis to calculate all batch μ β \mu_{\beta}mbThe mean of the values ​​is obtained, the standard deviation σ β 2 \sigma_{\beta}^2pb2Take each batch σ β 2 \sigma_{\beta}^2pb2An unbiased estimate of .

6.7 Group Normalization (GN)

Group Normalization was proposed by He Yuming's team in 2018. GN optimizes the disadvantage that BN does not perform well in the case of relatively small mini-batches.

Group Normalization (GN) first divides Channels into multiple groups, and then calculates the mean and variance within each group for normalization. The calculation of GN has nothing to do with Batch Size, so it is very stable for high-precision pictures with small BatchSize.

6.8 Weight Normalization (WN)

BN/LRN/GN are normalized at the data level, and Weight Normalization (WN) is the network weight WWNormalization done by W. The practice of WN is to convert the weight vectorwwAfter w is decoupled into a parameter vector v and a parameter scalar g in its Euclidean norm and its direction, SGD is used to optimize these two parameters respectively. WN is also independent of sample size, so it can be used in dynamic networks such as small batch size and RNN; in addition, BN uses mini-batch-based normalized statistics instead of global statistics, which is equivalent to introducing noise in gradient calculation. . WN does not have this problem, so the effect of WN is better than BN in noise-sensitive environments such as generative models and reinforcement learning.

WN has no additional parameters and does not require additional storage space to save the mean and variance of the mini batch, which saves memory. At the same time, the computational efficiency of WN is also better than that of BN, which needs to calculate normalized statistics.

However, WN does not have the effect of BN to fix the output Y of each layer in a varying range. Therefore, when using WN, special attention should be paid to the selection of the initial value of the parameters.

6.9 The location of batch normalization in the network

  • Putting batch normalization before the activation layer can effectively avoid batch normalization destroying the distribution of nonlinear features; in addition, batch normalization can also make data points not fall into the saturation region of the activation function as much as possible, and alleviate the problem of gradient disappearance .
  • Since the current activation function is ReLU, it does not have the problems of Sigmoid and Tanh functions, so batch normalization can also be placed after the activation function to prevent the data from being converted into a similar pattern before the activation layer, thereby making nonlinear feature distribution. tend to assimilate.

In specific practice, the original paper put batch normalization before the activation function, but many people in academia and industry have expressed their tendency to put batch normalization after the activation function (Keras authors Francois, Kaggle Chief scientist jeremy Howard, etc.), from the papers of the past two years, a large part of the batch normalization is placed after the activation layer, such as MobileNet V2, ShuffleNet V2, etc., where exactly is placed is still a controversial issue The problem.


7. Pre-training and fine-tuning

7.1 Feature extraction and model fine-tuning

Feature extraction

Fine-tuning starts with figuring out a concept: feature extraction.
A convolutional neural network for image classification consists of two parts: a series of convolutional layers and pooling layers (convolutional basis) + a densely connected classifier. For convolutional neural networks, feature extraction is to take out the convolutional base of the previously trained network and train a new classifier with new data. So why re-use the previous convolutional base and train a new classifier? This is because what the convolutional basis learns is more general, while what the classifier learns is specific to the output classes the model was trained on, and densely connected layers discard spatial information.
The generality of a convolutional base depends on how deep the layer is in the model. The features extracted by layers closer to the input in the model are more general, and the features extracted by layers closer to the output are more abstract.
During feature extraction, the convolution base should be frozen and not trained, that is, the weight of the convolution base should not be changed during the training process, and only the last dense layer should be trained. In keras, the freezing method is to set the trainable attribute of each layer of the convolution base to False.

Model fine-tuning

Model fine-tuning and feature extraction complement each other. For a frozen convolutional base used for feature extraction, fine-tuning refers to unfreezing several layers close to the output, and jointly training these layers with the classifier to make the model more suitable for the current problem to be solved.

Fine-tuning case: CNNs have made huge progress in the field of image recognition. If we want to apply CNN to our own datasets, we usually face a problem: our datasets are not particularly large, generally not more than 10,000, or even less, and each type of pictures has only a few dozen Or a dozen. At this time, the idea of ​​directly applying these data to train a network is not feasible, because a key factor for the success of deep learning is the training set composed of a large number of labeled data. If we only use this data at hand, even if we use a very good network structure, we will not achieve high performance. At this time, the idea of ​​fine-tuning can solve our problem very well: we fine-tune the model (such as CaffeNet, VGGNet, ResNet) trained on ImageNet (that is, freeze the convolution base, and unfreeze the part near the exit for training. ) and then applied to our own dataset.

7.2 What is the difference between fine-tuning and direct training

  1. The process of finetune is equivalent to continuing training, and the difference from direct training is the initialization method.
  2. Direct training is initialized as specified by the network definition.
  3. finetune is initialized with the parameter file you already have.

7.3 Three ways to fine-tune the model

  1. Method 1: Only predict, not train.
    Features: Relatively fast and simple, very efficient for those projects that have been trained and now need to actually label unknown data;

  2. Method 2: Train, but only train the last classification layer.
    Features: The final classification of the fine-tuning model meets the requirements, and now only category dimension reduction is performed on their basis.

  3. Method 3: Complete training, both classification layer + previous convolution layer are trained
    Features: The difference from state 2 is very small. Of course, state 3 is more time-consuming and requires GPU resources to train, but it is very suitable for fine-tuning to the model you want. , the prediction accuracy is also much improved compared to the second state.

7.4 Fine-tuning methods in practical applications

Whether to fine-tune and the method of fine-tuning should be selected according to the size of your own data set and the similarity of the data set to the pre-trained model data set.

Fine-tuning methods in different situations:

Small amount of data, high similarity: modify the last few layers;
small amount of data, low similarity: keep the first few layers of the pre-trained model and train the latter layers;
large amount of data, high similarity: this is the most ideal situation. Initialize the model with pre-trained weights and retrain the entire model;
large amount of data and low similarity: directly retrain the entire model.


8. Weight initialization

Weight initialization is also known as parameter initialization. The essence of the deep learning model training process is to update the weight (that is, the parameter W), but it cannot be updated at the beginning of training, which requires each parameter to have a corresponding initial value. After the weight is initialized, the neural network can iteratively update the weight parameter w to achieve better performance.

8.1 All-zero initialization

All-zero initialization in neural networks is something we want to avoid, it fails to train the network. Because after all-zero initialization, when the neural network is trained, the gradient is the same during backpropagation, and the parameter update is the same. In the end, the two weights of the output layer will be the same, and the parameters of the hidden layer neurons will be the same, that is to say, the neural network has lost its features. ability to learn.

In layman's terms, when considering gradient descent in a neural network, imagine that you are climbing a mountain, but you are in a straight valley with symmetrical peaks on both sides. Because of the symmetry, the gradient where you are can only follow the direction of the valley, not the mountain; after you take a step, the situation remains the same. The result is that you can only converge to a maximum value in the valley, but not up to the mountain.

8.2 Random initialization

Random initialization is a method often used by many people. Generally, the initialized weights are randomly selected values ​​from a Gaussian or uniform distribution. However, this has drawbacks. Once the random distribution is improperly selected, network optimization will be in trouble.

Initialize the weights to a small value , such as a Gaussian distribution with a mean of 0 and a variance of 0.01. As the number of layers increases, the output values ​​rapidly approach 0. In the latter layers, almost all output values ​​x are very close 0. Gaussian initialization, giving a small value to the weight, this update method is very common in small networks, but when the network is deep, the gradient disappears .

However, if the weight is initially set to a relatively large value , such as a Gaussian distribution with a mean value of 0 and a variance of 1, almost all values ​​are concentrated around -1 or 1, which will cause neurons to be inhibited during forward propagation. or be saturated. When the number of layers of the neural network increases, it will be found that the output values ​​of the activation functions of the later layers (the gradients of tanh near -1 and 1 are close to 0) are almost all close to 0. When the gradient is updated, the gradient is very close to 0, which will cause the gradient to disappear .

8.3 Xavierc initialization

Also known as Glorot initialization because the inventor is Xavier Glorot. Xavier initialization is another initialization method proposed by Glorot and others in order to solve the problem of random initialization. Their idea is to make the input and output obey the same distribution as much as possible, so as to avoid the output value of the activation function of the later layer. at 0.
Xavier initialization has good performance on sigmoid and tanh activation functions, but poor performance on Relu activation function.

8.4 He initialization

He initialization is a robust neural network parameter initialization method proposed by He Kaiming et al., which can ensure that information can flow effectively during forward propagation and back propagation, and make the variance of input signals of different layers roughly equal. He initialization corresponds to nonlinear activation functions (Relu and Prelu).

8.5 Bias initialization

It is possible and common to initialize the bias to zero, since the asymmetry breaking is caused by small random numbers of the weights. Because ReLU is non-linear, some people like to use a small constant value such as 0.01 for all deviations, as this ensures that all ReLU units fire at the very beginning and thus can acquire and propagate some gradient value. However, it is less clear whether this will provide lasting improvement (in fact some results suggest that doing so makes performance worse), so it is more common to simply initialize the bias to 0.

8.6 Summary of initialization methods

A good initialization method can prevent the disappearance of information during the forward pass, and can also solve the gradient disappearance during the backward pass.

When the activation function selects Hyperbolic Tangent or Sigmoid, it is recommended to use the Xaizer initialization method.

When the activation function selects ReLU or Leakly ReLU , it is recommended to use the He initialization method .


9. Learning rate

9.1 The role of learning rate

In machine learning, supervised learning works by defining a model and estimating optimal parameters based on the data on the training set. Gradient descent is a parameter optimization algorithm that is widely used to minimize model error. The gradient descent method estimates the parameters of the model through multiple iterations and minimizes the cost function (cost) in each step. The learning rate (learning rate) controls the learning progress of the model during the iteration process.

In the gradient descent method, a uniform learning rate is given, and the whole optimization process is updated with a certain step size. In the early stage of iterative optimization, the larger the learning rate, the longer the forward step size. At this time, the gradient descent can be performed at a faster speed, and in the later stage of the iterative optimization, gradually reduce the value of the learning rate and reduce the step size, which will help the algorithm to converge and it is easier to approach the optimal solution. Therefore, how to update the learning rate has become the focus of researchers.

9.2 Common Learning Rate Decay Parameters

parameter name Parameter Description
learning_rate initial learning rate
global_step global number of steps used for decay calculation, non-negative, used for stepwise calculation of decay exponent
decay_steps The number of decay steps, which must be a positive value, determines the decay period
decay_rate decay rate
end_learning_rate Minimum final learning rate
cycle Whether the learning rate rises again after falling
alpha Minimum learning rate
num_periods Number of cycles in the decaying cosine part
initial_variance the initial variance of the noise
variance_decay Variance of attenuated noise

9.3 Common Learning Rate Decay Methods

In model optimization, several learning rate decay methods are commonly used: piecewise constant decay, polynomial decay, exponential decay, natural exponential decay, cosine decay, linear cosine decay, noise linear cosine decay.

9.3.1 Exponential decay

The learning rate is updated in an exponential decay manner. The size of the learning rate is exponentially related to the number of training times. The update rule is:
decayed _ learning _ rate = learning _ rate ∗ decay _ rateglobal _ stepdecay _ steps decayed{\_}learning{\ _}rate =learning{\_}rate*decay{\_}rate^{\frac{global{\_step}}{decay{\_}steps}}decayed_learning_rate=learning_ratedecay_ratedecay_stepsglobal_step
This decay method is simple and direct, and has fast convergence speed. It is the most commonly used learning rate decay method. As shown in the figure below, the green one is the exponential decay method of the learning rate with the number of training times, and the red one is the piecewise constant decay. Keep the learning rate constant in the training interval.

insert image description here

9.3.2 Natural Exponential Decay

It is similar to the exponential decay method, the difference is that its decay base is eee , so its speed of convergence is faster, and it is generally used for networks that are relatively easy to train to facilitate faster convergence. The update rules are as follows:
decayed _ learning _ rate = learning _ rate ∗ e − decay _ rateglobal _ step decayed{\ _}learning{\_}rate =learning{\_}rate*e^{\frac{-decay{\_rate}}{global{\_}step}}decayed_learning_rate=learning_rateandglobal_stepdecay_rate
The figure below is a comparison chart of three methods: piecewise constant decay, exponential decay, and natural exponential decay. The red one is the piecewise constant decay graph, which is a stepped curve. The blue line is the exponential decay graph, and the green line is the natural exponential decay graph. It is clear that the learning rate decay degree in the natural exponential decay method is greater than that in the general exponential decay method, which is helpful for faster convergence.

insert image description here

9.3.3 Polynomial decay

The learning rate is updated by applying polynomial decay. The initial learning rate and the minimum learning rate will be given here, and then the learning rate will be attenuated from the initial value to the minimum value according to the given decay method. The update rule is as follows: Show.
global _ step = min ( global _ step , decay _ steps ) global{\_}step=min(global{\_}step,decay{\_}steps)global_step=min(global_step,decay_steps)

d e c a y e d _ l e a r n i n g _ r a t e = ( l e a r n i n g _ r a t e − e n d _ l e a r n i n g _ r a t e ) ∗ ( 1 − g l o b a l _ s t e p d e c a y _ s t e p s ) p o w e r + e n d _ l e a r n i n g _ r a t e decayed{\_}learning{\_}rate =(learning{\_}rate-end{\_}learning{\_}rate)* \left( 1-\frac{global{\_step}}{decay{\_}steps}\right)^{power} \\ +end{\_}learning{\_}rate decayed_learning_rate=(learning_rateend_learning_rate)( 1decay_stepsglobal_step)power+end_learning_rate

It should be noted that there are two mechanisms. After the minimum learning rate is reduced, the minimum learning rate can be used to update until the end of the training. The result is shown in the following formula. It is used to prevent the neural network from oscillating around a local minimum in the later stage of training due to the learning rate being too small, so that it can jump out of the local extreme by increasing the learning rate in the later stage. small value.
decay _ steps = decay _ steps ∗ ceil ( global _ stepdecay _ steps ) decay{\_}steps = decay{\_}steps*ceil \left( \frac{global{\_}step}{decay{\_} steps}\right)decay_steps=decay_stepsceil(decay_stepsglobal_step)
As shown in the figure below, the red line represents that after the learning rate is reduced to the minimum, the learning rate is kept unchanged for updating, and the green line represents that after the learning rate has decayed to the minimum, it will increase and decrease again and again.

insert image description here

9.3.4 Cosine decay

Cosine decay is to use the cosine correlation method to decay the learning rate, and the decay graph is similar to the cosine function. Its update mechanism is as follows:
global _ step = min ( global _ step , decay _ steps ) global{\_}step=min(global{\_}step,decay{\_}steps)global_step=min(global_step,decay_steps)

c o s i n e _ d e c a y = 0.5 ∗ ( 1 + c o s ( π ∗ g l o b a l _ s t e p d e c a y _ s t e p s ) ) cosine{\_}decay=0.5*\left( 1+cos\left( \pi* \frac{global{\_}step}{decay{\_}steps}\right)\right) cosine_decay=0.5( 1+cos( pdecay_stepsglobal_step) )

decayed = ( 1 − α ) ∗ cosine _ decay + α decayed=(1-\alpha)*cosine{\_}decay+\alphadecayed=( 1a )cosine_decay+a

d e c a y e d _ l e a r n i n g _ r a t e = l e a r n i n g _ r a t e ∗ d e c a y e d decayed{\_}learning{\_}rate=learning{\_}rate*decayed decayed_learning_rate=learning_ratedecayed

As shown in the figure below, the red color is the standard cosine decay curve, and the learning rate remains unchanged after dropping from the initial value to the minimum learning rate. The blue line is the linear cosine decay curve, which is the learning rate decreasing linearly from the initial learning rate to the lowest learning rate value. Green noise linear cosine decay method.

insert image description here

In model optimization, several learning rate decay methods are commonly used: piecewise constant decay, polynomial decay, exponential decay, natural exponential decay, cosine decay, linear cosine decay, noise linear cosine decay

9.3.5 Piecewise constant decay

Piecewise constant decay requires a pre-defined training frequency interval, and different learning rate constant values ​​are set in the corresponding interval. Generally, the learning rate at the beginning should be larger, and then it will become smaller and smaller, which should be set according to the size of the sample. The interval size of the interval, the larger the sample size, the smaller the interval interval. The following figure is a graph of the learning rate change with piecewise constant decay. The abscissa represents the number of training times, and the ordinate represents the learning rate.

insert image description here

The following introduces some specific learning rate adjustment methods in actual training.

9.3.6 StepLR

This is the simplest and most commonly used learning rate adjustment method. After each step_size round, the previous learning rate is multiplied by gamma.

scheduler=lr_scheduler.StepLR(optimizer, step_size=30, gamma=0.1)

insert image description here

9.3.7 MultiStepLR

MultiStepLR is also a very common learning rate adjustment strategy, which multiplies the previous learning rate by gamma at each milestone.

scheduler = lr_scheduler.MultiStepLR(optimizer, milestones=[30,80], gamma=0.5)

insert image description here

9.3.8 ExponentialLR

ExponentialLR is an exponentially decreasing learning rate regulator. Each round will multiply the learning rate by gamma, so here, be careful not to set the gamma too small, otherwise the learning rate will drop to 0 after a few rounds.

scheduler=lr_scheduler.ExponentialLR(optimizer, gamma=0.9) 

insert image description here

9.3.9 LinearLR

LinearLR is a linear learning rate. Given the starting factor and the final factor, LinearLR will do linear interpolation in the middle stage. For example, the learning rate is 0.1, the starting factor is 1, and the final factor is 0.1, then the 0th iteration, learning The rate will be 0.1 and the learning rate for the final round will be 0.01. The total number of rounds total_iters set below is 80, so when it exceeds 80, the learning rate is always 0.01.

scheduler=lr_scheduler.LinearLR(optimizer,start_factor=1,end_factor=0.1,total_iters=80)

insert image description here

9.3.10 CyclicLR

CyclicLR has more parameters, its curve looks like continuous uphill and downhill, base_lr is the learning rate at the bottom of the valley, max_lr is the learning rate at the peak, step_size_up is the number of rounds needed from the bottom of the valley to the peak, step_size_down The number of rounds from peak to trough. As for why this is set, you can refer to the paper. In short, the optimal learning rate will be at base_lr and max_lr. CyclicLR is not a blind decay but an increase process to avoid falling into a saddle point.

scheduler=lr_scheduler.CyclicLR(optimizer,base_lr=0.1,max_lr=0.2,step_size_up=30,step_size_down=10)

insert image description here

9.3.11 OneCycleLR

OneCycleLR, as the name suggests, is like a one-cycle version of CyclicLR. It also has multiple parameters, max_lr is the maximum learning rate, pct_start is the proportion of the rising part of the learning rate, the initial learning rate is max_lr/div_factor, and the final learning rate is max_lr/ final_div_factor, the total number of iterations is total_steps.

scheduler=lr_scheduler.OneCycleLR(optimizer,max_lr=0.1,pct_start=0.5,total_steps=120,div_factor=10,final_div_factor=10)

insert image description here

9.3.12 CosineAnnealingLR

CosineAnnealingLR is the cosine annealing learning rate, T_max is half the period, the maximum learning rate is specified in optimizer, and the minimum learning rate is eta_min. This also helps to escape from the saddle point. It is worth noting that the maximum learning rate should not be too large, otherwise the loss may fluctuate violently in a period similar to the learning rate.

scheduler=lr_scheduler.CosineAnnealingLR(optimizer,T_max=20,eta_min=0.05)

insert image description here

9.3.13 CosineAnnealingWarmRestarts

It is relatively responsible here. The formula is as follows, where T_0 is the first cycle, which will drop from the learning rate in the optimizer to eta_min, and each subsequent cycle becomes the previous cycle multiplied by T_mult.

scheduler=lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=20, T_mult=2, eta_min=0.01)

insert image description here

9.3.14 LambdaLR

LambdaLR does not actually have a fixed learning rate curve. The lambda in the name refers to a lambda function that can customize the learning rate as an epoch. For example, we define an exponential function below to implement the function of ExponentialLR.

scheduler=lr_scheduler.LambdaLR(optimizer,lr_lambda=lambda epoch:0.9**epoch)

insert image description here

9.3.15 SequentialLR

SequentialLR can connect multiple learning rate adjustment strategies in sequence, and switch to the next learning rate adjustment strategy at milestone. The following is a combination of an exponentially decaying learning rate and a linearly decaying learning rate.

scheduler=lr_scheduler.SequentialLR(optimizer,schedulers=[lr_scheduler.ExponentialLR(optimizer, gamma=0.9),lr_scheduler.LinearLR(optimizer,start_factor=1,end_factor=0.1,total_iters=80)],milestones=[50])

insert image description here

9.3.16 ChainedScheduler

ChainedScheduler is similar to SequentialLR, and it also calls multiple learning rate adjustment strategies connected in series in sequence. The difference is that the learning rate change in ChainedScheduler is continuous.

scheduler=lr_scheduler.ChainedScheduler([lr_scheduler.LinearLR(optimizer,start_factor=1,end_factor=0.5,total_iters=10),lr_scheduler.ExponentialLR(optimizer, gamma=0.95)])

insert image description here

9.3.17 ConstantLR

ConstantLRConstantLR is very simple. In the total_iters round, multiply the learning rate specified in the optimizer by the factor, and restore the original learning rate outside the total_iters round.

scheduler=lr_scheduler.ConstantLRConstantLR(optimizer,factor=0.5,total_iters=80)

insert image description here


10. Regularization

Regularization can be understood as regularization, and a rule is equivalent to a restriction. Adding a regularization term to the loss function can limit their fitting ability. Regularization is to prevent overfitting.

10.1 Why regularization?

There may be an overfitting problem in deep learning. There are two solutions, one is regularization, and the other is to prepare more data. This is a very reliable method, but you may not be able to prepare enough training data all the time or Getting more data is expensive, but regularization often helps avoid overfitting or reduce the error of your network.

10.2 Why does regularization reduce overfitting?

insert image description here

If the model we want to build is able to distinguish the red and white parts in the figure, look at the fitting states of the three models in the above figure to the training set:

The first model: underfitting, this model cannot distinguish the red and white parts of the graph very well.

The second model: The fitting state is just right. Although some red parts are not distinguished, considering the existence of noise in the actual test set, the fitting degree is just right.

The third model: overfitting, this kind of model fits the training set to a high degree, resulting in its generalization ability ("generalization" refers to the ability of a hypothetical model to be applied to new samples) is relatively high Low. Moreover, there will be noise in the actual test set, and the accuracy rate obtained in the subsequent test set is not high, which will also increase the complexity of the model and make the calculation complex, which cannot play an ideal role.

We can use regularization to solve overfitting, which roughly works as follows:

insert image description here

Our purpose is to fit the data in the graph, for the first graph we use a 2nd order function to fit the data, which seems to work well, when we use a higher order function to fit the data, like the second This is a better fit for this data, but this is not the model we want, because it overfits the data, we can think that this is due to the appearance of high-order terms, so we have to The coefficients of the terms are penalized.
We add 1000 times θ 3 θ_3 to the end of the loss functioni3squared, plus 1000 times θ 4 θ_4i4The square of , 1000 here is just a random value. That is,
min ⁡ θ 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 + 1000 θ 3 2 + 1000 θ 4 2 \min _{\theta} \frac{1} {2 m} \sum_{i=1}^{m}\left(h_{\theta}\left(x^{(i)}\right)-y^{(i)}\right)^{2 }+1000θ_3^2+1000θ_4^2imin2 m1i=1m(hi(x(i))Y(i))2+1000 i32+1000 i42
If we want the minimum value of the loss function, we have to let θ 3 θ_3i3θ 4 θ_4i4is very small because two terms about them are added to the loss function, if θ 3 θ_3i3θ 4 θ_4i4If the value of is very large, the value of the loss function will also become very large, so θ 3 θ_3i3θ 4 θ_4i4value approaches 0.

That is, θ 3 θ_3 in the fitting functioni3θ 4 θ_4i4The value of the two terms is approximately 0, so the fitting function is close to the quadratic function, so that the fitting degree of the fitting function is just right

Here we just have a purpose for θ 3 θ_3i3θ 4 θ_4i4Two items are penalized, then if you don't know which coefficients in the fitting function are high-order coefficients?

We are going to penalize the coefficients of all terms, that is, we add a term (regularization term) to the loss function:
λ ∑ j = 1 n θ j 2 \lambda \sum_{j=1}^{n } \theta_{j}^{2}lj=1nij2

There is no penalty here θ 0 θ_0i0, which makes only a small difference. Penalizing the coefficients of all terms is still more punishing for higher-order terms.

where λ \lambdaλ is called the regularization parameter,λ \lambdaThe larger the λ , the greater the penalty, butλ \lambdaλ is not the bigger the better whenλ \lambdaWhen λ is too large, the parameters in the fitting function are so small that the fitting function is equal toθ 0 θ_0i0becomes a straight line, resulting in underfitting.

In addition, regularization is also divided into L1 regularization and L2 regularization, also known as L1 norm and L2 norm, which are defined as follows:

L1范数:
∥ x ∥ 1 = ∑ i = 1 N ∣ x i ∣ \|x\|_{1}=\sum_{i=1}^{N}\left|x_{i}\right| x 1=i=1Nxi
is the sum of the absolute values ​​of the vector elements

L2 norm:

∥ x ⃗ ∥ 2 = ∑ i = 1 N ∣ x i ∣ 2 \|\vec{x}\|_{2}=\sqrt{\sum_{i=1}^{N}\left|x_{i}\right|^{2}} x 2=i=1Nxi2

That is, the sum of the squares of each element of the vector is re-squared

Lp范数:
L p = ∥ x ⃗ ∥ p = ∑ i = 1 N ∣ x i ∣ p p L_{p}=\|\vec{x}\|_{p}=\sqrt[p]{\sum_{i=1}^{N}\left|x_{i}\right|^{p}} Lp=x p=pi=1Nxip

10.3 Dropout Regularization

insert image description here

Dropout operation means that in the training phase of the network, a certain proportion of neurons will be randomly dropped from the base network at each iteration, and then data forward propagation and error back propagation are performed on the modified network. Note: The model recovers all neurons during the testing phase. Dropout is a common regularization method that can alleviate the overfitting problem of the network.

Dropout can randomly delete neural units in the network. Why can it play such a big role through regularization?

Intuitively understand: don't rely on any one feature, because the input of the unit may be cleared at any time, so the unit propagates down this way and adds a little weight to the four inputs of the unit, by propagating all the weights, dropout will produce The effect of shrinking the square norm of the weights is similar to the L2 regularization mentioned earlier; the result of implementing dropout is that it will compress the weights and complete some outer regularization to prevent overfitting; L2 attenuation of different weights is different , which depends on the size of the activation function multiplication.

Another understanding is that Dropout can reduce the complex co-adaptive relationship between neurons. Since the neurons discarded by Dropout are randomly selected each time, the retained network will contain different neurons each time, so that during the training process, the update of network weights will not depend on the fixed relationship between hidden nodes. In other words, each neuron in the network is not very sensitive to another neuron, which allows the network to learn some more generalized features.

10.4 Disadvantages of Dropout

  • A big disadvantage of dropout is that the cost function cannot be well defined . Because each iteration randomly removes the influence of some neuron nodes, the cost function cannot be guaranteed to be monotonically decreasing.

  • The training time is significantly increased, because the introduction of dropout is equivalent to only training a sub-network of the original network each time, and the number of training times required to achieve the same accuracy will increase. The disadvantage of dropout is that the training time is 2-3 times as long as the network without dropout.


[1]Zhou Zhihua. "Machine Learning"

[2] Li Hang. "Statistical Learning Methods (Second Edition)"

[3] Zhuge Yue, Huluwa. "Hundred-faced Machine Learning"

[4] François Cholet. "Python Deep Learning"

[5] Ian Goodfellow. Deep Learning

[6] Tan Jiyong. "Deep Learning"

[7] KnowingAI Zhizhi. Bilibili

[8] Introduction and tuning of deep learning hyperparameters

[9] https://www.zhihu.com/question/67366051/answer/262087707

[10] https://blog.csdn.net/weiman1/article/details/125647517?spm=1001.2014.3001.5506

Guess you like

Origin blog.csdn.net/weixin_43694096/article/details/127138038