"Deep learning notes" accumulation of deep learning knowledge points


One, the optimization algorithm

1.1 Batch gradient descent BGD

  • Calculate the gradient of the entire data set, and then update the weights.
  • Advantages: Convergence to the global minimum for convex optimization problems, at least to the local minimum for non-convex optimization problems;
  • Disadvantages: slower speed; unable to conduct online learning; high cost

1.2 Stochastic gradient descent SGD

  • If you completely follow the formula and derivation formula of back propagation gradient descent, if you input all samples in each iteration, the training speed will be too slow. In order to solve this problem, stochastic gradient descent can be used, that is, a set of data xi, yi x_i, y_i is randomly selected from the sample each timexi,andi, Update the parameters according to the gradient, and then randomly select a group, and so on.
  • In the case of a large amount of data, it is not necessary to extract all the data, and a loss value within an acceptable range can be obtained.
  • Advantages: Compared with BGD, SGD drops faster, and you can learn online
  • Disadvantages: The parameter update error is large, and it can't even reach the local optimum, and the training of a single set of samples will introduce a lot of noise, and the loss function will oscillate seriously, resulting in slower convergence speed in the later stage.

1.3 Adam

Insert picture description here

  • It is also an adaptive algorithm that adjusts the learning rate for each parameter, using the first-order moment estimation and second-order moment estimation of the gradient to dynamically adjust the learning rate of each parameter
  • Advantages: After bias correction, the learning rate of each iteration has a certain range, making the parameters more stable
    . 1. Inertia preservation: Adam algorithm records the first moment of the gradient, that is, the average of all past gradients and the current gradient , So that every update, the gradient of the last update and the gradient of the current update will not be too different, that is, the smooth and stable transition of the gradient can adapt to the unstable objective function.
    2. Environment perception: Adam records the second moment of the gradient, that is, the average of the square of the past gradient and the square of the current gradient, which reflects the ability of environment perception and generates adaptive learning rates for different parameters.
    3. Hyperparameters , namely α, β 1, β 2, ϵ \alpha, \beta_1, \beta_2, \epsilonα β1Β2, Ε has a very good explanatory nature, and usually does not require adjustment or only a few fine-tuning.

Two, performance optimization recommendations

2.1 Optimization skills before training

2.1.1 Network structure optimization

  • Choosing ReLU and BN is very effective to speed up neural network training. For some RNNs, because the time step is very long, tanh will be chosen to prevent excessive growth, while sigmoid is usually only used for the output of specific targets
  • Convolution kernel convolution kernel selected from a plurality of small to large modified convolution kernel, or n*nmodify n*1+1*n, the number of parameters can be reduced so that
  • Concat between different output content can also be used to identify features, this method is usually more effective than simply increasing the number of features
  • You can introduce cross-layer branches to solve the gradient problem, and ResNet uses this idea

2.1.2 Selection of initial value

  • Weight initialization is a very important technique that is easily overlooked. A good initialization method can not only speed up the convergence, but sometimes even improve the accuracy.

1. Xavier

  • Speaking of initialization, the first reaction of many people may be to generate random numbers based on Gaussian distribution, but this will cause the variance of neuron output values ​​to increase during forward propagation.
  • The purpose of Xavier's design is that after the model is initialized, the variance of the output of each layer should be independent of the number of inputs, and the variance of the gradient should not be affected by the number of outputs.

2. MSRA

  • MSRA is an initialization method for the ReLU function. For a layer of convolution: yi = W ixi + bi y_i=W_{i}x_i+b_iandi=Wixi+bi, The final derivation shows that MSRA initialization produces a Gaussian distribution with a mean value of 0 and a variance of 2/n.

2.1.3 Data preprocessing

  • Remove the mean
  • Balanced variance

2.2 Optimization skills during training

2.2.1 Selection of optimization algorithm

Adam is considered superior in many situations.

2.2.2 Gradient step size (learning rate)

You can start iterating from a large step size, and then gradually reduce the learning rate

2.2.3 batch_size selection

2.2.4 model ensembles

Use different initial values ​​to train multiple models at the same time, and average the output results of multiple models during the prediction process, which can effectively improve the accuracy of the results

Three, template matching

  • Function: Same scale target detection
  • Template: real picture
  • Operation: Scan the entire picture using the template picture
  • Matching result: similarity measure
    • Return similarity graph
    • Similarity distance calculation

Guess you like

Origin blog.csdn.net/libo1004/article/details/111031419