[Linear Algebra of Deep Learning Mathematics Foundation] Research on the derivation algorithm for backpropagation using the chain rule

chain rule

To put it simply, the chain rule is the original partial derivative of y to x, but because the process is more complicated, we need to split the function and perform separate derivatives through the chain, which will make the whole calculation easier.

Formula f = k ( a + bc ) f = k(a + bc)f=k(a+bc)

In layman's terms, the chain rule says that knowing the instantaneous rate of change of z with respect to y and the instantaneous rate of change of y with respect to x, one can calculate the instantaneous rate of change of z with respect to x as the product of these two rates. In fact, it is the process of finding the derivative of a composite function.
insert image description here
Use the chain rule (chain these gradient expressions to multiply.) Differentiate the variables a, b, and c separately:
insert image description here

forward propagation

Forward propagation (or forward pass) refers to: calculating and storing the results of each layer in the neural network in order (from the input layer to the output layer).

For intermediate variables:
insert image description here
W is the parameter weight, b is the function bias, and the function result passes through the activation function C (common activation functions include Sigmoid, tanh, and ReLU).
insert image description here
Assuming that the loss function is l and the actual value is h, we can calculate a single data sample The loss item,
insert image description here
regardless of the optimization function, a single neuron ends from input to output, and then it is necessary to backpropagate the error, update the weight, and recalculate the output.

backpropagation

Backpropagation (backward propagation or backpropagation) refers to the method of calculating the gradient of neural network parameters. Briefly, the method traverses the network in reverse order from the output layer to the input layer according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) needed to compute gradients for certain parameters.

(1) Gradient descent
Before talking about the backpropagation algorithm, let’s briefly understand some gradient descent. For the loss function (here, it is assumed that the loss is MSE, that is, the mean square error loss), there are some other
insert image description here
insert image description here
gradient descents optimized on the basis of this. Method: Small batch sample gradient descent (Mini Batch GD), stochastic gradient descent (Stochastic GD), etc.
(2) Backpropagation Backpropagation
calculates the gradient of the loss function with respect to the network weights of a single input-output example. To illustrate this process, a 2-layer neural network with 2 inputs and 1 output is used, as shown in the figure below :
insert image description here
Regardless of the optimization algorithm, a single neural structure is shown in the figure below. The first unit adds the product of the weight coefficient and the input signal. The second unit is the neuron activation function (backpropagation needs the activation function to be differentiable during network design), as shown in the figure below:
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
insert image description here
Calculate all updated weights W WW gradients according to the chain rule, and use the same bias value method. Through backpropagation, calculate the steepest descent direction of the loss function and the weight of the current neuron. The weights can then be modified along the steepest descent and reduce the loss in an efficient manner.
(3) Backpropagation code

def optimize(w, b, X, Y, num_iterations, learning_rate):
    costs = []

    for i in range(num_iterations):

        # 梯度更新计算函数
        grads, cost = propagate(w, b, X, Y)

        # 取出两个部分参数的梯度
        dw = grads['dw']
        db = grads['db']

        # 按照梯度下降公式去计算
        w = w - learning_rate * dw
        b = b - learning_rate * db

        if i % 100 == 0:
            costs.append(cost)
        if i % 100 == 0:
            print("损失结果 %i: %f" % (i, cost))
            print(b)
    params = {
    
    "w": w,
              "b": b}
              
    grads = {
    
    "dw": dw,
             "db": db}
    return params, grads, costs

def propagate(w, b, X, Y):
    m = X.shape[1]

    # 前向传播
    A = basic_sigmoid(np.dot(w.T, X) + b)
    cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))

    # 反向传播
    dz = A - Y
    dw = 1 / m * np.dot(X, dz.T)
    db = 1 / m * np.sum(dz)

    grads = {
    
    "dw": dw,
             "db": db}

    return grads, cost

Reference :

https://blog.csdn.net/Peyzhang/article/details/125479563

Guess you like

Origin blog.csdn.net/lingchen1906/article/details/128740082
Recommended