Derivative of tensor operation: gradient

Gradient is the derivative of tensor operation. It is the extension of the concept of derivative to the derivative of multivariate functions. Multivariate functions are functions that take tensors as input.
Suppose there is an input vector x, a matrix W, a target y, and a loss function loss. You can use W to calculate the prediction y_pred, and then calculate the loss, or the distance between the predicted value y_pred and the target y.
y_pred = dot (W, x)
loss_value = loss (y_pred, y)
If the input data x and y remain unchanged, then this can be seen as a function that maps W to the loss value.
loss_value = f (W)
assumes that the current value of W is W0. The derivative of f at W0 is a tensor gradient (f) (W0), whose shape is the same as W, and each coefficient gradient (f) (W0) [i, j] means that the loss_value changes when W0 [i, j] is changed Direction and size.
The tensor gradient (f) (W0) is the derivative of the function f (W) = loss_value at W0. It has been seen earlier that the derivative of the univariate function f (x) can be regarded as the slope of the function f curve. Similarly, gradient (f) (W0) can also be regarded as a tensor representing the curvature of f (W) around W0.

Stochastic gradient descent

Given a differentiable function, the minimum value can be found analytically in theory: the minimum value of the function is the point whose derivative is 0, so you only need to find all points whose derivative is 0, and then calculate at which point the function has The minimum value.
Applying this method to a neural network is to obtain the weight value corresponding to the minimum loss function analytically. This method can be achieved by solving W for the equation gradient (f) (W) = 0. This is a polynomial equation with N variables, where N is the number of coefficients in the network.

Chain Derivation: Back Propagation Algorithm

In the previous algorithm, we assume that the function is differentiable, so its derivative can be calculated explicitly. In practice, neural network functions contain many tensor operations connected together, and each operation has a simple, known derivative. For example, the following network f contains three tensor operations a, b, and c, and three weight matrices W1, W2, and W3.
f (W1, W2, W3) = a (W1, b (W2, c (W3)))
According to the knowledge of calculus, this kind of function chain can be derived using the following identity, which is called the chain rule (chain rule): (f (g (x))) '= f' (g (x)) * g '(x). Applying the chain rule to the calculation of gradient values ​​in neural networks, the resulting algorithm is called backpropagation (sometimes also called reverse-mode differentiation). Backpropagation starts from the final loss value and reversely acts from the top layer to the bottom layer, using the chain rule to calculate the contribution of each parameter to the loss value.

Iterating over all training data once is called an epoch

Published 304 original articles · 51 praises · 140,000 views

Guess you like

Origin blog.csdn.net/qq_39905917/article/details/104665450