[Deep Learning] Basic Backpropagation Guidelines

This is actually a question that I have always wanted to record. I saw an easy-to-understand description of this in the book "Deep and Simple Neural Network and Deep Learning". Combined with the recently run routine, let's review it visually.

In the introduction of many books, backpropagation in deep learning is regarded as a black box. Users only need to know that backpropagation can be used to update weights to train the network. But the basic mechanism essence of this kind of training actually constitutes an interpretation of the neural network itself , and this interpretation is the key to understanding the applicability and scalability of the neural network, which is why this article decides to expand on it.


Network structure and parameter definitions


Before deriving the specific backpropagation mechanism, the definition of the network structure is first given.

Take a simple three-tier network structure as an example. The first layer is the input layer , here it is assumed that there are three input neurons; the second layer is the hidden layer (or middle layer), which means that the network is performing more complex feature extraction operations, the so-called deep network, that is, there are multiple hidden layers The neural network, the deepening of the network will improve its nonlinear expression ability; the third layer is the output layer , where the two activation values ​​output will compare the input cost function with the expected value and output the error result.

insert image description here

In the perceptron model in the previous section , we already know the connection weight www , biasbbb and the activation functionσ \sigmaThe activation value aacorresponding to σThe basic equations formed by a . For neural networks composed of multi-layer perceptrons, this equation for a single node remains the same. Using vector form, we will more concisely expressthe generation of activation values ​​​​of a certain layer:

al = σ ( wlal − 1 + bl ) which defines the weighted input zl = wlal − 1 + bl a_{l} = \sigma(w^{l} a^{l-1} + b^{l}) \ \ \ \ Which defines weighted input \ \ z^{l} =w^{l} a^{l-1} + b^{l}al=s ( whe tol1+bl )where the weighted inputz      l=whe tol1+bl

That is: the dot product of the connection weight between layers and the activation value of the upper layer plus the bias of this layer will get the weighted input zlz^{l} of this layerzl , and the weighted input passes through the activation function to obtain the activation value of the layer.


Supplement a matrix operation symbol Adamar product or Schur product ⊙ \odot , which subsequently simplifies the writing of the equation.

assuming sss andttt is two vectors with the same dimension, thens ⊙ ts \odot tst represents the element-wise product, ie( s ⊙ t ) j = sjtj (s \odot t)_{j} = s_j t_{j}(st)j=sjtj

Examples are as follows:

[ 1 2 ] ⊙ [ 3 4 ] = [ 1 ∗ 3 2 ∗ 4 ] = [ 3 8 ] \begin{bmatrix} 1 \\ 2 \end{bmatrix} \odot \begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 1*3 \\ 2*4 \end{bmatrix} = \begin{bmatrix} 3 \\ 8 \end{bmatrix} [12][34]=[1324]=[38]


derivation of the equation


In fact, backpropagation considers how to change the weight and bias to control the cost function , and its ultimate meaning is to calculate the partial derivative ∂ C / ∂ wjkl \partial C / \partial w_{jk}^{l}C/wjkl ∂ C / ∂ b j l \partial C / \partial b_{j}^{l} C/bjl. An enlightening realization - ∂ C / ∂ zjl \partial C / \partial z_{j}^{l}C/zjlis a measure of neuron error (here choose zjl z_{j}^lzjlThe expression error is a simplification for the subsequent equation form), that is, the definition of the lll layerjjError δ jl \delta _{j}^{l} on j neuronsdjlDefinition:
δ jl ≡ ∂ C ∂ zjl \delta_{j}^{l} \equiv \frac{\partial C}{\partial z_{j}^{l}}djlzjlC

By convention, use δ l \delta^{l}dl means the same as thellThe error vector associated with layer l . Each layer of δ l \delta^{l}can be calculated using backpropagationdl , and then compare these errors with the actual required amount∂ C / ∂ wjkl \partial C / \partial w_{jk}^{l}C/wjkl ∂ C / ∂ b j l \partial C / \partial b_{j}^{l} C/bjlconnect.

Before starting the proof process, first give four formulas about the backpropagation equation , as follows:


Summary: Backpropagation Equation

δ L = ∇ a C ⊙ σ ′ ( z L ) output layer error ( 1 ) \delta^{L} = \nabla_{a} C \odot \sigma^{\prime}(z^{L}) \qquad \qquad output layer error \qquad (1)dL=aCp(zL)output layer error(1)

δ l = ( ( wl + 1 ) T δ l + 1 ) ⊙ σ ′ ( zl ) Front layer transmission error ( 2 ) \delta^{l} = ((w^{l+1})^{T} \ delta ^{l+1}) \odot \sigma ^{\prime}(z^{l}) \qquad \qquad Front layer transfer error\qquad (2)dl=((wl+1)T δl+1)p(zl)front pass error(2)

∂ C ∂ bjl = rate of change of δ jl bias ( 3 ) \frac{\partial C}{\partial b_{j}^{l}} = \delta_{j}^{l} \qquad \qquad bias The rate of change\qquad (3)bjlC=djlBias rate of change(3)

∂ C ∂ wjkl = akl − 1 δ jl weight change rate ( 4 ) \frac{\partial C}{ \partial w_{jk}^{l}} = a_{k}^{l-1}\delta_{ j}^{l} \qquad rate of change of weights\qquad (4)wjklC=akl1djlrate of change of weight(4)


Let's start to prove one by one, please remember the core multivariate calculus chain rule .


(1) Output layer error

For the output layer error, it is defined as:
δ j L = ∂ C ∂ zj L \delta_{j}^{L} = \frac{\partial C}{\partial z_{j}^{L}}djL=zjLC
Among them, L is the number of layers of the network, which means that this is the error on the jth neuron of the output layer . In fact, we can rewrite the partial derivative above in terms of the partial derivative of the output activation value, that is,
δ j L = ∑ k ∂ C ∂ ak L ∂ ak L ∂ zj L \delta_{j}^{L} =\sum_ {k} \frac{\partial C}{\partial a_{k}^{L}} \frac{\partial a_{k}^{L}}{\partial z_{j}^{L}}djL=kakLCzjLakL
where the summation is performed over all neurons in the output layer. Of course, the kkThe output activation value of k neuronsak L a_{k}^{L}akLonly depends on k = jk =jk=The input weight zj of the jth neuron at time j L z_{j}^{L}zjL, so when k ≠ jk \ne jk=j∂ ak L / ∂ zj L \partial a_{k}^{L}/ \partial z_{j}^{L}akL/zjLDoes not exist (i.e. no impact is 0). Therefore, the equation can be reduced to
δ j L = ∂ C ∂ aj L ∂ aj L ∂ zj L \delta_{j}^{L} =\frac{\partial C}{\partial a_{j}^{L} } \frac{\partial a_{j}^{L}}{\partial z_{j}^{L}}djL=ajLCzjLajL
This shows that the error of the neuron is only related to the partial derivative of the loss function to the activation value of the neuron and the partial derivative of the activation value to the weighted input on the neuron (that is, the derivative value of the activation function at this point). Based on aj L = σ ( zj L ) a_{j}^{L} = \sigma (z_{j}^{L})ajL=s ( zjL) , the second item on the right can be written as
σ j L = ∂ C ∂ aj L σ ′ ( zj L ) \sigma_{j}^{L}= \frac{ \partial{C} }{\partial a_{j} ^{L}} \sigma^{\prime}(z_{j}^{L})pjL=ajLCp(zjL)
Extend the above formula to each neuron in the output layer, and use the Hadamard product to express, you can get the form of formula (1), that is,
δ L = ∇ a C ⊙ σ ′ ( z L ) \delta^{L} = \nabla_{a} C \odot \sigma^{\prime}(z^{L})dL=aCp(zL)


(2) Front layer transfer error

This should be the formula that best explains the meaning of the word "reverse" in the backpropagation algorithm, that is, the error is passed from the back to the front. Specifically, the consideration is to use the next layer of error δ l + 1 \delta^{l+1}dl + 1 to represent the errorδ l \delta^{l}dl . Similarly, we useδ kl + 1 = ∂ C / ∂ zkl + 1 \delta^{l+1}_{k} = \partial{C} / \partial z_{k}^{l+1}dkl+1=C/zkl+1重写δ jl = ∂ C / ∂ zjl \delta^{l}_{j} = \partial{C} / \partial z_{j}^{l}djl=C/zjlLet
δ jl = ∂ C ∂ zjl = ∑ k ∂ C ∂ zkl + 1 ∂ zkl + 1 ∂ zjl = ∑ k ∂ zkl + 1 ∂ z jl δ kl + 1 \begin{aligned } \delta_j^l &=\frac{\partial C}{\partial z_j^l} \\ &=\sum_k \frac{\partial C}{\partial z_k^{l+1}} \frac{\partial z_k^{l+1}}{\partial z_j^l} \\ &=\sum_k \frac{\partial z_k^{l+1}}{\partial z_j^l} \delta_k^{l+1}\ end{aligned}djl=zjlC=kzkl+1Czjlzkl+1=kzjlzkl+1dkl+1
The last line swaps the two terms on the right and substitutes δ kl + 1 \delta_{k}^{l+1}dkl+1The definition of , correspondingly realizes the substitution of the lower layer error. To evaluate the last row, expand
zkl + 1 = ∑ jwkjl + 1 ajl + bkl + 1 = ∑ jwkjl + 1 σ ( zjl ) + bkl + 1 z_k^{l+1}=\sum_j w_{kj}^{ l+1} a_j^l+b_k^{l+1}=\sum_j w_{kj}^{l+1} \sigma\left(z_j^l\right)+b_k^{l+1}zkl+1=jwkjl+1ajl+bkl+1=jwkjl+1p(zjl)+bkl+1
Differentiate it to get
∂ zkl + 1 ∂ zjl = wkjl + 1 σ ′ ( zjl ) \frac{\partial z_k^{l+1}}{\partial z_j^l}=w_{kj}^{l+ 1} \sigma^{\prime}\left(z_j^l\right)zjlzkl+1=wkjl+1p(zjl)
back to the original formula, there will be
δ jl = ∑ kwkjl + 1 δ kl + 1 σ ′ ( zjl ) \delta_j^l=\sum_k w_{kj}^{l+1} \delta_k^ {l+1} \sigma^{\prime}\left(z_j^l\right)djl=kwkjl+1dkl+1p(zjl)
This is also the component form of (2). After introducing the Hadamard product operator, the vector form of transferring the error from the back layer to the front layer can be written, that is, δ l = ( ( wl + 1 ) T δ l +
1 ) ⊙ σ ′ ( zl ) \delta^{l} = ((w^{l+1})^{T} \delta ^{l+1}) \odot \sigma ^{\prime}(z^{ l})dl=((wl+1)T δl+1)p(zl)


(3) Rate of change of bias

Similar to the proofs of the previous two steps, we can imagine that the proof at this time is still unfolded by applying the chain rule. Considering
δ jl = ∂ C ∂ zjl = ∂ C ∂ bjl ∂ bjl ∂ zjl \delta_j^l = \frac{\partial C}{\partial z_j^l} = \frac{\partial C}{\partial b_j^ l} \frac{\partial b_j^l}{\partial z_j^l}djl=zjlC=bjlCzjlbjl
Among them, zjl z_{j}^{l} can bezjl展开
z j l = ∑ k w j k l a k l − 1 + b j l z_j^{l}=\sum_k w_{jk}^{l} a_k^{l-1}+b_j^{l} zjl=kwjklakl1+bjl
Both sides are facing bjl b_{j}^{l} at the same timebjlFind the partial derivative, because wjkl w_{jk}^{l}wjklwith bjl b_{j}^{l}bjlirrelevant, because each is a variable that can change freely. So
∂ zjl ∂ bjl = 1 \frac{\partial z_{j}^{l}}{\partial b_{j}^{l}} = 1bjlzjl=1
Then, there is the proof of (3).
∂ C ∂ bjl = δ jl \frac{\partial C}{\partial b_j^l} = \delta_j^lbjlC=djl


(4) Rate of change of weight

Similar to the proof in (3), we have
δ jl = ∂ C ∂ zjl = ∂ C ∂ wjkl ∂ wjkl ∂ zjl \delta_j^l = \frac{\partial C}{\partial z_j^l} = \frac{\ partial C}{\partial w_{jk}^{l}} \frac{\partial w_{jk}^{l}}{\partial z_j^l}djl=zjlC=wjklCzjlwjkl
Same as above, General zkl z_{k}^{l}zkl展开
z j l = ∑ k w j k l a k l − 1 + b j l z_j^{l}=\sum_k w_{jk}^{l} a_k^{l-1}+b_j^{l} zjl=kwjklakl1+bjl
Both sides at the same time to wjkl w_{jk}^{l}wjklFind the partial derivative, wjkl w_{jk}^{l}wjklNot only with bjl b_{j}^{l}bjlhas nothing to do with other weights, but only with itself, so we have:
∂ zjl ∂ wjkl = akl − 1 \frac{\partial z_{j}^{l}}{\partial w_{jk}^ {l}} = a_{k}^{l-1}wjklzjl=akl1
After transformation, there is a proof of (4)
∂ C ∂ wjkl = akl − 1 δ jl \frac{\partial C}{ \partial w_{jk}^{l}} = a_{k}^{l- 1}\delta_{j}^{l}wjklC=akl1djl

summary

Through the above proof process, we can easily see how we pass back the error layer by layer through the backpropagation algorithm, and update the mathematical principles of weights and biases.

Now we look back at equation (1), and in its component form, look at the term σ ′ ( zj L ) \sigma^{\prime}(z_{j}^{L})p(zjL) . Looking back at the graph of the sigmoid function, whenσ ( zj L ) \sigma (z_{j}^{L})s ( zjL) is approximately 0 or 1, the sigmoid function becomes very smooth, thenσ ′ ( zjl ) ≈ 0 \sigma^{\prime}(z_{j}^{l}) \approx 0p(zjl)0 . Therefore, if the output neuron is at a small activation value (about 0) or a large activation value (about 1), the weight learning of the final layer will be very slow. At this time, it can be said that the output neuron is saturated, and the weight learning isalso will terminate (or learn very slowly), and the output neuron will be biased similarly. Because we look back at equations (3) and (4), we can find that the update of weights and biases actually depends on the errorσ ( zjl ) \sigma (z_{j}^{l})s ( zjl) , and for the output layer, the error depends on the derivative of the activation functionσ ′ ( zj L ) \sigma^{\prime}(z_{j}^{L})p(zjL) , so in a saturated state, learning becomes difficult.

insert image description here

Let's look at (2) finally, notice the term σ ′ ( zjl ) \sigma^{\prime}(z_{j}^{l}) in (2)p(zjl) , which means that the error will be multiplied by the derivative value of an activation function when it is passed back to the front layer. For the sigmoid function, we can know that the derivative value is between 0 and 1. If the number of layers of the network is large, when it propagates from the output layer to the shallow layer, we can know that the error will be reduced many times, and the error will eventually Will drop to a negligible level (close to 0), so shallow learning will become very difficult, which is thegradient disappearance.

Also, consistent with the analysis in the output layer neuron saturation scenario, the saturation of interneurons will also lead to δ jl \delta _{j}^{l}djlbecomes small because σ ′ ( zjl ) \sigma^{\prime}(z_{j}^{l})p(zjl) is very small. That is to say:learn slowly for any weight input to a saturated neuron.


The backpropagation equation is valid for any activation function. The discussion of the cost function is ignored here, because that is another piece of content, but it can be proved that the above inference itself has nothing to do with any specific cost function . At the same time, we may have new ideas, and the sigmoid function may not be suitable as an ideal activation function in some applications. In fact, designing an activation function with specific learning properties is actually an important part of the neural network .


try to observe

Here is a simple example to illustrate how to specifically observe the iterative update of weights in a neural network. This is an example of using a neural network for nonlinear fitting.

First we generate a noisy sinusoidal signal (non-linear signal), as follows:

x = torch.unsqueeze(torch.linspace(-np.pi, np.pi, 100), dim = 1) # 构建等差数列
y = torch.sin(x) + 0.5*torch.rand(x.size())  # 条件随机数

Then we define a network structure:

class Net(nn.Module):  # 定义类,存储网络结构
    def __init__(self):
        super(Net, self).__init__()
        self.predict = nn.Sequential(
            nn.Linear(1,10),  # 全连接层,1个输入,10个输出
            nn.ReLU(),  #ReLU激活函数
            nn.Linear(10,1)  # 全连接层,10个输入,1个输出
        )

The activation function used here is the ReLU function. Compared with the Sigmoid function, ReLU does not have the problem of gradient disappearance in the positive interval, and the function derivation is relatively simple . It is a commonly used activation function in CNN networks.
insert image description here
During training, we can obtain the weights corresponding to the fully connected layer by calling modules and traversing the layers. The corresponding code is as follows:

        # 输出权重变化,net.modules返回的是generator类型
        for layer in net.modules():
            if isinstance(layer, nn.Linear):
                print(layer.weight)

The code of the whole process is as follows:

# 这是一个用神经网络进行函数拟合的函数
import  numpy as np
import torch
import torch.nn as nn
import matplotlib.pyplot as plt

x = torch.unsqueeze(torch.linspace(-np.pi, np.pi, 100), dim = 1) # 构建等差数列
y = torch.sin(x) + 0.5*torch.rand(x.size())  # 条件随机数

class Net(nn.Module):  # 定义类,存储网络结构
    def __init__(self):
        super(Net, self).__init__()
        self.predict = nn.Sequential(
            nn.Linear(1,10),  # 全连接层,1个输入,10个输出
            nn.ReLU(),  #ReLU激活函数
            nn.Linear(10,1)  # 全连接层,10个输入,1个输出
        )

    def forward(self,x):  # 定义前向传播过程
        prediction = self.predict(x)  # 将x传入网络
        return prediction


net = Net()
optimizer = torch.optim.SGD(net.parameters(), lr = 0.05)  # 设置优化器
loss_func = nn.MSELoss()  # 设置损失函数
plt.ion()   # 打开交互模式

for epoch in range(1000):
    out = net(x)   # 实际输出
    loss = loss_func(out, y)  # 实际输出和期望输出传入损失函数
    optimizer.zero_grad()  # 清除梯度
    loss.backward()  # 误差反向传播

    optimizer.step()  # 优化器开始优化
    if epoch % 25 == 0:  # 每25epoch显示
        plt.cla()   # 清除上一次绘图
        plt.scatter(x,y)  # 绘制散点图

        # 输出权重变化,net.modules返回的是generator类型
        for layer in net.modules():
            if isinstance(layer, nn.Linear):
                print(layer.weight)

        plt.plot(x, out.data.numpy(), 'r', lw = 5 )  # 绘制曲线图
        plt.text(0, 0 , f'loss={
      
      loss}',fontdict={
    
    'size':20,'color':'red'})  #  添加文字来显示loss值

        plt.pause(0.1) # 显示时间0.1s

    plt.show()

plt.ioff()  # 关闭交互模式
plt.show()  # 定格显示最后结果

You can see the fitting effect displayed in real time on the graphical interface. As the loss value decreases, the network will gradually fit the shape of the sinusoidal curve.Please add a picture description

We can also see the output network weights in the console. Here, the weight obtained during the last training is taken as an example. You can see a weight matrix of 10×1 (1→10) and a weight matrix of 1×10 (10→1), which are different from the previous definition. The network structure of the two fully connected layers of is exactly corresponding.
insert image description here


Reference:
"Python Neural Network and Deep Learning" Author: [Australia] Michael Nielsen Translator: Zhu Xiaohu Chapter 2
"Python Neural Network Introduction and Practical Combat" Author: Wang Kai Chapter 7

Guess you like

Origin blog.csdn.net/weixin_47305073/article/details/127680476