Deep learning-error propagation, weight update and optimizer for image processing

Take a three-layer BP neural network as an example: the
Insert picture description here
output is as follows: the activation function of the last layer is unified with softmax, so it is taken out separately and not written in it. Why use the softmax activation function? Because the output y1 and y2 do not belong to any distribution, we want to make the output meet a probability distribution, we must perform softmax processing, the calculation formula is as follows: It can be seen that o1+o2 is equal to 1.
Insert picture description here
Insert picture description here
The general choice for the calculation of loss is cross entropy. For different classification problems, the activation function of the last layer is different, resulting in a slight difference in the calculation formula of cross entropy:
Insert picture description here
Bring o1 and o2 into the calculation formula of cross entropy:
Insert picture description here

Backpropagation of errors

Insert picture description here
Take w11(2) as an example to illustrate. According to the chain derivation rule, you can get the conclusion as shown in the figure. Continue to evaluate each partial derivative: the
Insert picture description here
right side of the figure below is the partial derivative of each term, and the left side is w11( 2) Loss gradient:
Insert picture description here
update of weights: Insert picture description here
Does the method of the required loss gradient point to the global optimum (the fastest loss reduction)?
The following figure shows a clear explanation: training batches will point to a local optimum, but do not point to global best Insert picture description here
optimizer : the networks get faster convergence, is determined by dynamically adjusting the rate of learning how to update the network weights
just said The training is carried out in batches, and each batch is subjected to a loss calculation and error back propagation. In fact, it is often referred to as the SGD optimizer.
Insert picture description here
Here is the expression for the update parameters of the SGD optimizer: wt+1 is the updated parameter, wt is the parameter before the update, but the shortcomings of the SGD optimizer are also obvious. For example, some annotations in the training set are wrong, then find out There will definitely be a problem with the loss gradient. In addition, the ideal situation is to reach the optimal solution along the black path in the figure, but in batches, it is possible that the direction of the gradient will not be so ideal, and the deviation will fall into the local optimum. So how to solve it, just introduce the SGD+Momentum optimizer.
Insert picture description here
Introduce SGD+momentum for optimization: In addition to calculating the current gradient, the previous gradient will also be added. After the momentum is introduced, the last gradient direction will be considered, as shown in the lower left corner, which can effectively suppress the interference of sample noise . The momentum coefficient is generally 0.9.
Insert picture description here
Another optimizer: Adagrad optimizer, which is mainly to do tricks on the learning rate, St is to sum the square of the loss gradient, which will make the scoring mother larger and larger during the training process, resulting in the learning rate Getting smaller and smaller, this will achieve the purpose of an adaptive learning rate. But it also has a problem, that is, the learning rate drops too fast. The solution: introduce RMSProp optimizer.
Insert picture description here
Compared with the Adagrad optimizer, the RMSProp optimizer adds two more coefficients, the purpose of which is to control the attenuation speed.
Insert picture description here
The last Adam optimizer, which looks a little more complicated than the previous optimizers, contains first-order momentum and second-order momentum.
Insert picture description here
How to choose a suitable optimizer, the blogger only said one sentence: practice is the only criterion for testing truth .

Guess you like

Origin blog.csdn.net/qq_42308217/article/details/109910215