(1) Backpropagation algorithm understanding (Back-Propagation)

This article refers to Chapter 6 6.5 Back-Propagation and Other Difierentiation of deeplearningbook.org

Algorithms

The backpropagation algorithm is divided into two parts, the first part is about conceptual understanding, and the second part uses RNN for simple derivation.

The first one is this article  (1) Understanding of Back Propagation Algorithm (Back-Propagation)

The second link (2) Detailed derivation of RNN's backpropagation algorithm

Let's start now~

First, clarify what the backpropagation algorithm is used for: find the gradient.

1. Forward propagation forward propagation

" Backpropagation Algorithm", in which  the word "reverse" just as the name implies, there is a forward process first, and then there is a reverse. So before you can understand backpropagation , you need to figure out forward propagation first . A simple flow chart is drawn as follows: (LOSS in the figure is a real value).

  

In the figure above, the forward propagation path is from input to LOSS, and the back propagation path is from LOSS to parameters (because we only need the gradient of the parameters to find the gradient).

There are two differences between forward and reverse. One is that the direction is opposite, and the other is that the calculation content is different. They both have the same thing, that is, step by step forward/backward. Backpropagation to find the gradient is to find it step by step, not in one step.

2. Why use the backpropagation algorithm to find the gradient

First of all, we can use the numerical method to obtain the gradient (according to the definition of the derivative to solve the gradient), but there is a disadvantage, that is, when there are many parameters, it is very time-consuming and slow. This method excludes.

Furthermore, there will be a complete functional expression from input to LOSS, and the derivative expression of the parameter can be derived directly according to this functional expression, and then substituted into the numerical calculation to obtain the derivative. There are two ways to derive the derivative expression of the parameter, one is to simplify and directly derive, and the other is to chain derivative.

Suppose  y=g(x) and  z=f(g(x))=f(y)=h(x) , we find the derivative of z with respect to x:

Direct derivative: \frac{\mathrm{d} z}{\mathrm{d} x}={h}'(x)     (Formula 1)

Chain derivation: \frac{\mathrm{d} z}{\mathrm{d} x}=\frac{\mathrm{d} z}{\mathrm{d} y}\frac{\mathrm{d} y}{\mathrm{d} x}={f}'(y){g}'(x)={f}'(g(x)){g}'(x)    (Formula 2)

For the direct derivation method, in formula 1, the h function must be obtained first, that is, to simplify f(g(x)). When there are many nested functions or the functions are complex, the simplification will be more complicated, and the simplification The expression Jane obtained is also difficult to be simple enough to be easily derived, and I feel that there is no need to take such an extra step. This method excludes.

Therefore, we only discuss formula 2 now, which is to use the chain derivation method. However, in practical applications, it will involve the question of whether to save the intermediate calculation results during the forward propagation process. The difference between the two is marked as shown in the figure below:

The forward 1 in the figure does not save the intermediate result y during the process, so {f}'(y)it is necessary to calculate y first in the reverse calculation. That is, it needs to be calculated  during the forward propagation process g(x), and then calculated  , and the memory occupied by y is released after f(y)the calculation . f(y)When calculating the gradient in reverse, calculateg(x) , then calculate{f}'(y) , then calculate{g}'(x) . So g(x)it is calculated twice.

Forward 2 in the figure saves the intermediate result y during the process, so it will not be calculated when the gradient is calculated in the reverse direction g(x). That is, it needs to be calculated  during the forward propagation process g(x), and then calculate f(y) and save the data of y. When calculating the gradient in reverse, it must be calculated{f}'(y) , and then calculated{g}'(x) . In this way, g(x)it is only counted once. 

The more intermediate functions like the g function, the more functions are recalculated. For example, for a function z=f(g(a(b(x))))=f(y)=f(g(v))=f(g(a(w))), the derivative expression is as follows:

\frac{\mathrm{d} z}{\mathrm{d} x}=\frac{\mathrm{d} z}{\mathrm{d} y}\frac{\mathrm{d} y}{\mathrm{d} v}\frac{\mathrm{d} v}{\mathrm{d} w}\frac{\mathrm{d} w}{\mathrm{d} x}={f}'(g(a(b(x)))){g}'(a(b(x))){a}'(b(x)){b}'(x)

 If the intermediate result is not saved in the forward direction, the forward propagation needs to be calculatedb(x) , and then the calculationa(w) is followed by the release of w to occupy the memory, and then the calculation isg(v) followed by the release of v to occupy the memory, and then the calculation isf(y) followed by the release of y to occupy the memory. When calculating the gradient in reverse b(x), calculate, a(w), g(v), {f}'(y), calculateb(x)  , a(w), {g}'(v), calculateb(x) , {a}'(w), calculate{b}'(x)  (the intermediate results are not saved during reverse calculation at this time). b(x)We can calculate that it needs to be calculated 3 times during backpropagation , a(w)2 times, and g(v)1 time. In fact, these have been calculated during forward propagation. If you save , b(x)a(w)you only need to calculate , , g(v)during backpropagation .{f}'(y){g}'(v){a}'(w){b}'(x)

But the cost of saving intermediate data is the consumption of memory resources. Computational resources are consumed without storing intermediate data.

The backpropagation algorithm is a gradient algorithm that saves intermediate data and uses chain derivation. It is an algorithm that trades space for time.

3. Calculation graph

As mentioned above, if the intermediate results are stored in the forward calculation, then in the reverse calculation, you only need to ask  {f}'(y), {g}'(v), {a}'(w), {b}'(x) and that’s it. Is there any order for these? There is. Our above formulas are all about finding the gradient of the final result z to the bottom variable x. In fact, in deep learning, the middle layer has many parameters, and the parameters of each layer are different. What we want is the gradient of z to the middle layer. The gradient of the parameter, then we need to calculate it first when backpropagating {f}'(y), then {g}'(v), then {a}'(w), and then {b}'(x), in this order, we can calculate the gradient of each variable in the middle layer. Explain below.

As mentioned above, backpropagation is calculated step by step like forward propagation, and this step is reflected in the calculation sequence , which involves calculation graphs.

Let's take\frac{\mathrm{d} z}{\mathrm{d} x}=\frac{\mathrm{d} z}{\mathrm{d} y}\frac{\mathrm{d} y}{\mathrm{d} v}\frac{\mathrm{d} v}{\mathrm{d} w}\frac{\mathrm{d} w}{\mathrm{d} x}={f}'(g(a(b(x)))){g}'(a(b(x))){a}'(b(x)){b}'(x) this formula as an example.

The calculation graph uses variables as nodes and operations as edges. We can draw the calculation graph of the above formula:

 Then the order of calculating the gradient during backpropagation is as follows:

 Calculate first \frac{\mathrm{d} z}{\mathrm{d} y}, and save; then calculate \frac{\mathrm{d} y}{\mathrm{d} v} , and then multiply with the saved \frac{\mathrm{d} z}{\mathrm{d} y}value, calculate it \frac{\mathrm{d} z}{\mathrm{d} v}, and save;

Then calculate \frac{\mathrm{d} v}{\mathrm{d} w}, and then multiply it with the saved one \frac{\mathrm{d} z}{\mathrm{d} v}, calculate it \frac{\mathrm{d} z}{\mathrm{d} w}, and save it;

Then calculate \frac{\mathrm{d} w}{\mathrm{d} x}, and then multiply it with the saved one \frac{\mathrm{d} z}{\mathrm{d} w}, calculate it \frac{\mathrm{d} z}{\mathrm{d} x}, and save it.

That is to say, by calculating from the back to the front, the gradient of z (LOSS) to each variable in the middle can be calculated.

Four. Summary

To sum up, the backpropagation algorithm is an algorithm that trades space for time and uses chain derivation to find the gradient step by step from the back to the front.

This is the end of the general understanding of the backpropagation algorithm. If there is something wrong, please correct me~

Guess you like

Origin blog.csdn.net/qq_32103261/article/details/120285686