Quick learning in one article - let the neural network no longer be mysterious, learn the basics of neural network in one day (7) - error-based backpropagation

foreword

I have been thinking for a long time whether to publish in-depth learning content. After all, more than half of the machine learning content in the mathematical modeling column has not been updated. I have considered for a long time and decided to come up with a series of articles on neural networks. Otherwise, if neural networks are used in mathematical modeling competitions or other more optimized models in the future (such as using LSTM for time series model prediction), then it will be better to explain to everyone. And explained the principle. However, the content of deep learning is not so easy to master. It contains a lot of mathematical theoretical knowledge and a lot of calculation formula principles that require reasoning. And it is difficult to understand what the code we write represents in the neural network computing framework without actual operation. However, I will try my best to simplify the knowledge and convert it into something we are more familiar with. I will try my best to let everyone understand and be familiar with the neural network framework, so as to ensure smooth understanding and smooth deduction, try not to use too many mathematical formulas and Professional theoretical knowledge. Quickly understand and implement the algorithm in one article, proficient in this knowledge in the most efficient way.

Although many competitions do not limit the use of algorithm frameworks, more and more award-winning teams have used deep learning algorithms, and traditional machine learning algorithms are gradually declining. For example, in the 2022 American College Students Mathematical Modeling Question C, the parameter team used the deep learning network team, and the winning ratio is very high. Now artificial intelligence competitions and data mining competitions are increasing one after another, and the demand for neural network knowledge is also increasing, so it is very useful. It is necessary to master various neural network algorithms.

The blogger has focused on modeling for four years, participated in dozens of mathematical modeling, large and small, and understands the principles of various models, the modeling process of each model, and various topic analysis methods. The purpose of this column is to quickly use various mathematical models, machine learning and deep learning, and codes with zero foundation. Each article contains practical projects and runnable codes. Bloggers keep up with all kinds of digital and analog competitions. For each digital and analog competition, bloggers will write the latest ideas and codes into this column, as well as detailed ideas and complete codes. I hope that friends in need will not miss the column carefully created by the author.

The last article was originally intended to end the neural network, but I forgot to write the calculation of backpropagation based on the gradient, not the neural network based on the backpropagation of the activation function error. For data differentiation, its calculation is very time-consuming, which will lead to low efficiency of epoch iterative data propagation, which naturally leads to low training accuracy. If you are familiar with error backpropagation, there is no need to use numerical differentiation, so you need to be proficient in mastering error backpropagation.


We still understand the calculation method of backpropagation step by step from the basic principles, so that the foundation is relatively solid and easy to understand.

1. Realization of ReLU backpropagation

Activation function We have a basic understanding of ReLU:

The ReLU (Rectified Linear Activation) function is one of the commonly used non-linear activation functions in deep learning. It is widely used in neural networks because it is simple and effective, it can solve the problem of gradient disappearance, and it has achieved good results in practical applications.

The definition of the ReLU function is simple: for any input value x, the output is equal to the input x (if x is greater than or equal to zero), or the output is zero (if x is less than zero). The mathematical expression is as follows:

 That is to say, if the input of the forward propagation is greater than 0, it will be passed directly to the next layer; if it is 0, it will be passed directly to the next layer.

Through the above description, we can find the derivative of y with respect to x:

\frac{\partial y}{\partial x}=\begin{cases} dount & \text{ if } x>0 \\ 0 & \text{ if } x=<0 \end{cases}

Then the implementation code of ReLU's backpropagation is:

class Relu:
    def __init__(self):
        self.x=None
        
    def forward(self,x):
        self.x = np.maximum(0,x)
        out = self.x
        return out
    
    def backward(self,dout):
        dx = dout
        dx[self.x <= 0]=0
        return dx

Is it easier to understand that the direction propagation is the partial derivative of the original calculation equation, then let's look at the back propagation of Sigmoid.

Two, Sigmoid backpropagation

The formula of the Sigmoid function we know as:

\sigma (z)=\frac{1}{1+e^{-z}}, usually for binary classification models.

Here is a book that can learn neural networks more systematically: Deep Learning and Image Recognition: Principles and Practice

 There is a very detailed derivation process, which is shown here by borrowing the Sigmoid calculation diagram in the book:

 Then for backpropagation, we need to push it backwards, from right to left:

  1. y=\frac{1}{1+exp(-x)}I don't know how you are doing in high mathematics in college, so the partial derivative of it is\frac{\partial y}{\partial x}=-y^{2}
  2. In the second step of reverberation propagation, the upstream value will -y^{2}be multiplied by the derivative of this stage, and 1+e^{-x}the derivative obtained for the derivative is -e^{-x}, so the derivative of the second step is-y^{2}*(-e^{-x})=y^{2}*(e^{-x})
  3. The third step x*-1of derivation is naturally -1. Therefore, the final derivation is y^{2}*e^{-x}, and then multiplied by the result of the upper layer derivation, the output is and(1-and).

Finally, let's implement it in Python:

class _sigmoid:
    def __init__(self):
        self.out = None
    
    def forward(self,x):
        out = 1/(1+np.exp(-x))
        self.out=out
        return out
    def backward(self,dout):
        dx = dout*self.out*(1-self.out)
        return dx

3. Affine layer

The Affine layer (also known as fully connected layer or linear layer) in the neural network plays an important role in the neural network, and its main function is to introduce linear transformation and weight parameters. This layer is used in a feed-forward neural network to multiply the input data with weights and then apply a bias to produce the output.

Affine is usually added in a convolutional neural network or recurrent neural network as the top layer of the output before the final prediction. The general form is:

y=f(W*b+b), where x is the layer input, w is the parameter, b is a bias, and f is a nonlinear activation function.

It should be noted here that X is basically multiple, that is, a matrix. If an offset of 1 is added, the offset will be added to each XW.

class Affine:
    def __init__(self,W,b):
        self.W=W
        self.b=b
        self.x=None
        self.dW=None
        self.db=None
        
    def forward(self,x):
        self.x=x
        out=np.dot(x,self.W)+self.b
        return out
    
    def backward(self,dout):
        dx = np.dot(dout,self.W.T)
        self.dW = np.dot(self.x.T,dout)
        self.db = np.sum(dout,axis=0)
        return dx

 4. Comparison based on numerical differentiation and error backpropagation

We are now exposed to two methods of gradient calculation: one is based on numerical differentiation, and the other is based on error backpropagation. For numerical differentiation, the calculation consumption is relatively large and it takes a long time. Therefore, it is generally recommended to use error backpropagation. The specific code is as follows:

from collections import OrderedDict
import numpy as np
class TwoLayerNet:
    
    def __init__(self,input_size,hidden_size,output_size,weight_init_std = 0.01):
        #权重
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size,hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size,output_size)
        self.params['b2'] = np.zeros(output_size)
        
        #生成层
        self.layers = OrderedDict()
        self.layers['Affine1'] = Affine(self.params['W1'],self.params['b1'])
        self.layers['Relu1'] = Relu()
        self.layers['Affine2'] = Affine(self.params['W2'],self.params['b2'])
        self.layers['Relu2'] = Relu()
        self.lastLayer = SoftmaxWithLoss()
    
    def predict(self,x):
        for layer in self.layers.values():
            x = layer.forward(x)
            
        return x
    #x:输入数据,y:监督数据
    def loss(self,x,y):
        p = self.predict(x)
        return self.lastLayer.forward(p,y)
    
    def accuracy(self,x,y):
        p = self.predict(x)
        p = np.argmax(y,axis=1)
        if y.ndim != 1:
            y = npp.argmax(y,axis=1)
            
        accuracy = np.sum(p==y)/float(x.shape[0])
        return accuracy

    #x:输入数据,y:监督数据
    def numerical_gradient(self,x,y):
        loss_W = lambda W: self.loss(x,y)
        
        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W, self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W, self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W, self.params['b2'])
        
        return grads
    
    def gradient(self , x, y):
        #forward
        self.loss(x,y)
        
        #backward
        dout = 1
        dout = self.lastLayer.backward(dout)
        
        layers = list(self.layers.values())
        layers.reverse()
        for layer in layers:
            dout = layer.backward(dout)
            
        #设定
        grads = {}
        grads['W1'], grads['b1'] = self.layers['Affine1'].dW, self.layers['Affine1'].db
        grads['W2'], grads['b2'] = self.layers['Affine2'].dW, self.layers['Affine2'].db
        
        return grads
    
network = TwoLayerNet(input_size = 784,hidden_size = 50 , output_size = 10)
x_batch = x_train[:100]
y_batch = y_train[:100]
grad_numerical = network.numerical_gradient(x_batch,y_batch)
grad_backprop = network.gradient(x_batch,y_batch)

for key in grad_numerical.keys():
    diff = np.average(np.abs(grad_backprop[key]-grad_numerical[key]))
    print(key+":"+str(diff))
    

 The difference between the two is not very large, so let's look at the accuracy rate:

Does it feel very powerful, then the basic content of the neural network is over here, we have completed the input layer-forward propagation-weight bias-activation function-backpropagation-forward propagation----... . The computing framework of the network is built, and the basic content has been mastered. So we can now start the in-depth study of the deep learning network, so stay tuned for the next article.

Guess you like

Origin blog.csdn.net/master_hunter/article/details/132685677