chain rule
To put it simply, the chain rule is the original partial derivative of y to x, but because the process is more complicated, we need to split the function and perform separate derivatives through the chain, which will make the whole calculation easier.
Formula f = k ( a + bc ) f = k(a + bc)f=k(a+bc)
In layman's terms, the chain rule says that knowing the instantaneous rate of change of z with respect to y and the instantaneous rate of change of y with respect to x, one can calculate the instantaneous rate of change of z with respect to x as the product of these two rates. In fact, it is the process of finding the derivative of a composite function.
Use the chain rule (chain these gradient expressions to multiply.) Differentiate the variables a, b, and c separately:
forward propagation
Forward propagation (or forward pass) refers to: calculating and storing the results of each layer in the neural network in order (from the input layer to the output layer).
For intermediate variables:
W is the parameter weight, b is the function bias, and the function result passes through the activation function C (common activation functions include Sigmoid, tanh, and ReLU).
Assuming that the loss function is l and the actual value is h, we can calculate a single data sample The loss item,
regardless of the optimization function, a single neuron ends from input to output, and then it is necessary to backpropagate the error, update the weight, and recalculate the output.
backpropagation
Backpropagation (backward propagation or backpropagation) refers to the method of calculating the gradient of neural network parameters. Briefly, the method traverses the network in reverse order from the output layer to the input layer according to the chain rule from calculus. The algorithm stores any intermediate variables (partial derivatives) needed to compute gradients for certain parameters.
(1) Gradient descent
Before talking about the backpropagation algorithm, let’s briefly understand some gradient descent. For the loss function (here, it is assumed that the loss is MSE, that is, the mean square error loss), there are some other
gradient descents optimized on the basis of this. Method: Small batch sample gradient descent (Mini Batch GD), stochastic gradient descent (Stochastic GD), etc.
(2) Backpropagation Backpropagation
calculates the gradient of the loss function with respect to the network weights of a single input-output example. To illustrate this process, a 2-layer neural network with 2 inputs and 1 output is used, as shown in the figure below :
Regardless of the optimization algorithm, a single neural structure is shown in the figure below. The first unit adds the product of the weight coefficient and the input signal. The second unit is the neuron activation function (backpropagation needs the activation function to be differentiable during network design), as shown in the figure below:
Calculate all updated weights W WW gradients according to the chain rule, and use the same bias value method. Through backpropagation, calculate the steepest descent direction of the loss function and the weight of the current neuron. The weights can then be modified along the steepest descent and reduce the loss in an efficient manner.
(3) Backpropagation code
def optimize(w, b, X, Y, num_iterations, learning_rate):
costs = []
for i in range(num_iterations):
# 梯度更新计算函数
grads, cost = propagate(w, b, X, Y)
# 取出两个部分参数的梯度
dw = grads['dw']
db = grads['db']
# 按照梯度下降公式去计算
w = w - learning_rate * dw
b = b - learning_rate * db
if i % 100 == 0:
costs.append(cost)
if i % 100 == 0:
print("损失结果 %i: %f" % (i, cost))
print(b)
params = {
"w": w,
"b": b}
grads = {
"dw": dw,
"db": db}
return params, grads, costs
def propagate(w, b, X, Y):
m = X.shape[1]
# 前向传播
A = basic_sigmoid(np.dot(w.T, X) + b)
cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
# 反向传播
dz = A - Y
dw = 1 / m * np.dot(X, dz.T)
db = 1 / m * np.sum(dz)
grads = {
"dw": dw,
"db": db}
return grads, cost
Reference :
https://blog.csdn.net/Peyzhang/article/details/125479563