One, the optimization algorithm

1.1 Batch gradient descent BGD

Calculate the gradient of the entire data set, and then update the weights.

Advantages: Convergence to the global minimum for convex optimization problems, at least to the local minimum for non-convex optimization problems;

Disadvantages: slower speed; unable to conduct online learning; high cost

1.2 Stochastic gradient descent SGD

If you completely follow the formula and derivation formula of back propagation gradient descent, if you input all samples in each iteration, the training speed will be too slow. In order to solve this problem, stochastic gradient descent can be used, that is, a set of data $x_i, y_i$ is randomly selected from the sample each time $x_{i}, and_{i}$ , Update the parameters according to the gradient, and then randomly select a group, and so on.

In the case of a large amount of data, it is not necessary to extract all the data, and a loss value within an acceptable range can be obtained.

Advantages: Compared with BGD, SGD drops faster, and you can learn online

Disadvantages: The parameter update error is large, and it can't even reach the local optimum, and the training of a single set of samples will introduce a lot of noise, and the loss function will oscillate seriously, resulting in slower convergence speed in the later stage.

1.3 Adam

Insert picture description here

It is also an adaptive algorithm that adjusts the learning rate for each parameter, using the first-order moment estimation and second-order moment estimation of the gradient to dynamically adjust the learning rate of each parameter

Advantages: After bias correction, the learning rate of each iteration has a certain range, making the parameters more stable
. 1. Inertia preservation: Adam algorithm records the first moment of the gradient, that is, the average of all past gradients and the current gradient , So that every update, the gradient of the last update and the gradient of the current update will not be too different, that is, the smooth and stable transition of the gradient can adapt to the unstable objective function.
2. Environment perception: Adam records the second moment of the gradient, that is, the average of the square of the past gradient and the square of the current gradient, which reflects the ability of environment perception and generates adaptive learning rates for different parameters.
3. , namely $\alpha, \beta_1, \beta_2, \epsilon$ has a very good explanatory nature, and usually does not require adjustment or only a few fine-tuning.

Two, performance optimization recommendations

2.1 Optimization skills before training

2.1.1 Network structure optimization

Choosing ReLU and BN is very effective to speed up neural network training. For some RNNs, because the time step is very long, tanh will be chosen to prevent excessive growth, while sigmoid is usually only used for the output of specific targets

Convolution kernel convolution kernel selected from a plurality of small to large modified convolution kernel, or n*nmodify n*1+1*n, the number of parameters can be reduced so that

Concat between different output content can also be used to identify features, this method is usually more effective than simply increasing the number of features

You can introduce cross-layer branches to solve the gradient problem, and ResNet uses this idea

2.1.2 Selection of initial value

Weight initialization is a very important technique that is easily overlooked. A good initialization method can not only speed up the convergence, but sometimes even improve the accuracy.

1. Xavier

Speaking of initialization, the first reaction of many people may be to generate random numbers based on Gaussian distribution, but this will cause the variance of neuron output values to increase during forward propagation.

The purpose of Xavier's design is that after the model is initialized, the variance of the output of each layer should be independent of the number of inputs, and the variance of the gradient should not be affected by the number of outputs.

2. MSRA

MSRA is an initialization method for the ReLU function. For a layer of convolution: $y_i=W_{i}x_i+b_i$ , The final derivation shows that MSRA initialization produces a Gaussian distribution with a mean value of 0 and a variance of 2/n.

2.1.3 Data preprocessing

Remove the mean

Balanced variance

2.2 Optimization skills during training

2.2.1 Selection of optimization algorithm

Adam is considered superior in many situations.

2.2.2 Gradient step size (learning rate)

You can start iterating from a large step size, and then gradually reduce the learning rate

2.2.3 batch_size selection

2.2.4 model ensembles

Use different initial values to train multiple models at the same time, and average the output results of multiple models during the prediction process, which can effectively improve the accuracy of the results

Three, template matching

Function: Same scale target detection

Template: real picture

Operation: Scan the entire picture using the template picture

Matching result: similarity measure

Return similarity graph

Similarity distance calculation

"Deep learning notes" accumulation of deep learning knowledge points

Article Directory

One, the optimization algorithm

1.1 Batch gradient descent BGD

1.2 Stochastic gradient descent SGD

1.3 Adam

Two, performance optimization recommendations

2.1 Optimization skills before training

2.1.1 Network structure optimization

2.1.2 Selection of initial value

2.1.3 Data preprocessing

2.2 Optimization skills during training

2.2.1 Selection of optimization algorithm

2.2.2 Gradient step size (learning rate)

2.2.3 batch_size selection

2.2.4 model ensembles

Three, template matching

Guess you like