Multi-layer neural network and code derivation of backpropagation

I found that there is another study note from a long time ago in the draft box. I hope it can help children in need~

Table of contents

sequence

Construction of multi-layer fully connected neural network

(1), input -> (affine_forward) -> out* -> (relu_forward) -> out, full connection and relu activation

(2), batch-normalization batch normalization

   3. Random deactivation (DropOut)

(4), any deep neural network

Neural network optimization - for gradient descent during training

1、SGD with momentum

2、RMSProp

3、Adam


sequence

       It turns out that they all use traditional image segmentation algorithms learned in C++. Mainly learn clustering segmentation, level sets, and graph cuts. Welcome to discuss and learn together.

       I just started taking the CS231n course, and I happened to be learning Python, and I also did some practical exercises to deepen my understanding of the model.

       Course link

       1. This is my own study note. I will refer to other people’s content. If there is any infringement, please contact us to delete it.

       2. The code refers to WILL  and Duke , but with a lot of own learning notes.

       3. I won’t explain some basic content, but I will put links to blogs that I think explain it well.

       4. Since I have not used numpy much before and am not familiar with python, this is also a study note of python and numpy modules.

       5. References for this article

       Preface to this chapter: This chapter implements the use of multi-layer fully connected neural networks and optimization algorithms, such as batch normalization, SGD+Momentum, Adam, etc. The focus of this chapter: backpropagation and optimization algorithms

       The code written in jupyter needs to be downloaded as a .py file to import. If the content in the .py file is modified after import, jupyter needs to be reopened, which is very troublesome. Now add the following code after the import. After changing the .py file There is no need to reopen jupyter.

#自动加载外部模块
%reload_ext autoreload
%autoreload 2

Construction of multi-layer fully connected neural network

       What was previously implemented was a two-layer neural network with the structure input -> hidden ->relu -> score -> softmax - output. The number of layers is small, and the derivation given is derived step by step from back to front. This is not realistic in multi-layer neural networks. As the number of layers increases, it is too troublesome to push layer by layer. In the actual process, modular backpropagation derivation is often used.

       The multi-layer fully connected neural network structure will be modularly divided:

(1), input -> (affine_forward) -> out* -> (relu_forward) -> out, full connection and relu activation

Next is the code to implement forward propagation:

def affine_forward(x,w,b):
    out = None
    x_reshape = np.reshape(x,(x.shape[0],-1))
    out = x_reshape.dot(w) + b 
    cache = (x,w,b)
    return out,cache     #返回线性输出,和中间参数(x,w,b)


def relu_forward(x):
    out = np.maximum(0,x)
    cache = x     #缓存线性输出a
    return out,cache


#模块化
def affine_relu_forward(x,w,b):
    a,fc_cache = affine_forward(x,w,b)   #a是线性输出,fc_cache中存储(x,w,b)
    out,relu_cache = relu_forward(a)     #relu_cache存储线性输出a
    cache = (fc_cache,relu_cache)        #缓冲元组:(x,w,b,(a))
    return out,cache                     #返回激活值out 和参数(x,w,b,(a))

Now that we have the forward propagation module, we must have the back propagation module:

def affine_backward(dout,cache):
    """
    输出层反向传播
    dout 该层affine_forward正向输出数据out的梯度,对应softmax_loss/relu中的输出dz
    cache 元组,正向流入输入层的数据x,和输出层的参数(w,b)
    """
    z,w,b = cache    #z为上一层的激活值,也是本层的输入
    dx,dw,db = None, None,None
    x_reshape = np.reshape(z, (z.shape[0],-1))
    dz = np.reshape(dout.dot(w.T),z.shape)    #参考公式
    dw = (x_reshape.T).dot(dout)              #参考公式
    db = np.sum(dout,axis=0)                  #参考公式
    return dz,dw,db

def relu_backward(dout,cache):    #传入的是
    """
    relu激活,小于0的梯度为0,大于0的梯度为1
    """
    dx,x = None, cache
    dx = (x>0) * dout
    return dx

def affine_relu_backward(dout,cache):
    fc_cache, relu_cache = cache    #relu_cache 存储线性输出a
    da = relu_backward(dout,relu_cache)    #计算关于relu的梯度,这边是一个复合函数,z=relu(a),a=w1x+b1 -> dz /dx = dz/da *da/dx
    dx,dw,db = affine_backward(da,fc_cache)
    return dx,dw,db

(2), batch-normalization batch normalization

       It is said that batch normalization can reduce the impact of random initialization weights, accelerate convergence, increase the learning rate appropriately, reduce overfitting, use lower dropout, reduce the L2 regularization coefficient and other advantages.

First, normalize the input data so that the feature mean of the data is 0 and the variance is one, that is, it obeys the standard Gaussian distribution. Then the data is transformed and reconstructed so that the original feature distribution can be restored.

       Then the current neural network structure becomes: input -> affine_forward -> BN(batch_normlize) -> relu_forward, that is, adding BN after full connection, and then performing activation output.

#批量归一化(加速收敛,学习率适当增大,加快寻,减少过拟合,使用较低的dropout,减小L2正则化系数)
def batchnorm_forward(x,gamma,beta,bn_param):
    mode = bn_param['mode']
    eps = bn_param.get('eps',1e-5)          #防止除以0
    momentum = bn_param.get('momentum',0.9) 
    
    N,D = x.shape            #N样本个数,D特征个数
    #移动均值和方差,会随着train过程不断变化
    running_mean = bn_param.get('running_mean',np.zeros(D, dtype = x.dtype))
    running_var = bn_param.get('running_var',np.zeros(D, dtype = x.dtype))
    
    out,cache=None,None
    if mode =='train':
        sample_mean = np.mean(x,axis=0)
        sample_var = np.var(x,axis = 0)
        x_hat = (x-sample_mean)/(np.sqrt(sample_var+eps))
        
        out = gamma*x_hat +beta
        cache = (x,sample_mean,sample_var,x_hat,eps,gamma,beta)
        running_mean = momentum*running_var + (1-momentum)*sample_mean
        running_var = momentum*running_var+(1-momentum)*sample_var
    elif mode == 'test':
        out = (x-running_mean)*gamma/(np.sqrt(running_var+eps))+beta
    else:
        raise ValueError('invalid forward batchnorm mode "%s"' %mode)
    
    bn_param['running_mean'] = running_mean
    bn_param['running_var'] = running_var
    
    return out,cache   #cache(线性输出,均值,方差,归一化值,eps,gamma,beta)   #out 变换重构的值


def affine_bn_relu_forward(x,w,b,gamma,beta,bn_param):
    a,fc_cache = affine_forward(x,w,b)
    a_bn, bn_cache = batchnorm_forward(a,gamma,beta,bn_param)  #BN层
    out,relu_cache = relu_forward(a_bn)    #将归一化后的值激活,relu_cache中缓存变换重构值
    cache = (fc_cache,bn_cache,relu_cache)
    return out,cache

With forward propagation, it is natural to have back propagation: In fact, I didn’t even look at the formula, so I just copied the code... because the chain derivation principle of back propagation is the same.

def batchnorm_backward(dout,cache):
    x,mean,var,x_hat,eps,gamma,beta = cache
    N = x.shape[0]
    dgamma = np.sum(dout*x_hat,axis=0)
    dbeta = np.sum(dout*1, axis=0)
    dx_hat = dout*gamma
    dx_hat_numerator = dx_hat/np.sqrt(var +eps)
    dx_hat_denominator = np.sum(dx_hat * (x-mean),axis=0)
    dx_1 = dx_hat_numerator
    dvar = -0.5*((var+eps)**(-1.5))*dx_hat_denominator
    dmean = -1.0*np.sum(dx_hat_numerator,axis = 0)+dvar*np.mean(-2.0*(x-mean),axis=0)
    dx_var = dvar*2.0/N*(x-mean)
    dx_mean =dmean*1.0/N
    dx = dx_1+dx_var+dx_mean
    
    return dx,dgamma,dbeta


def affine_bn_relu_backward(dout,cache):
    fc_cache,bn_cache,relu_cache = cache
    da_bn = relu_backward(dout,relu_cache)
    da,dgamma,dbeta = batchnorm_backward(da_bn,bn_cache)
    dx,dw,db = affine_backward(da,fc_cache)
    return dx,dw,db,dgamma,dbeta

(3) Random deactivation (DropOut)

In a fully connected neural network, the deeper the layers, the more parameters, and the accuracy of the test set becomes higher and higher. However, the effect on the verification set is not good because of overfitting. In layman's terms, in order to cater to the test set, the network extracts too many features, and the effect of these features is not that great. Random deactivation is to randomly disable neurons, that is, the role of certain features is canceled, thus achieving the ability to prevent overfitting to a certain extent.

#随机失活
def dropout_forward(x,dropout_param):
    """
    dropout_param p 失活概率
    """
    p,mode = dropout_param['p'],dropout_param['mode']
    if 'seed' in dropout_param:
        np.random.seed(dropout_param['seed'])
        
    mask = None
    out = None
    
    if mode == 'train':
        keep_prob = 1-p 
        mask = (np.random.rand(*x.shape)<keep_prob)/keep_prob   #随机失活后损失下降,所以除以概率保持分布同意
        out = mask * x 
    elif mode == 'test':
        out = x
        
    cache = (dropout_param,mask)
    out = out.astype(x.dtype,copy=False)
    return out,cache

Backpropagation: The principle is similar to Relu. Random deactivation is implemented using a probability mask. If the mask is 1, it will not be deactivated. Otherwise, the deactivation will be 0, that is, the gradient is 1 or 0.

def dropout_backward(dout,cache):
    dropout_param,mask = cache
    mode = dropout_param['mode']
    
    dx = None
    if mode == 'train':
        dx = mask * dout
    elif mode == 'test':
        dx = dout
    return dx 

(4), any deep neural network

       After all the modules have been built, the neural network can be modularized. It is divided into three parts in total, one is parameter initialization, one is loss function calculation, and the other should be training. There are only two parts here for now

Forward propagation network structure (BN, Dropout): (input -> BN -> relu -> Dropout) * N (repeat N times) -> affine_forward -> softmax -> loss

Backpropagation (BN, Dropout): softmax_loss -> affine_backward -> dropout_backward -> (affine_bn_relu_backward) * N

from layers import *
import numpy as np


class FullyConnectedNet(object):
    def __init__(self
                 ,hidden_dims            #列表,元素个数是隐藏层层数,元素值为神经元个数
                 ,input_dim = 3*32*32    #输入神经元3072
                 ,num_classes = 10       #输出10类
                 ,dropout = 0            #默认不开启dropout,(0,1)
                 ,use_batchnorm = False  #默认不开批量归一化
                 ,reg = 0.0              #默认无L2正则化
                 ,weight_scale =1e-2     #权重初始化标准差
                 ,dtype=np.float64       #精度
                 ,seed = None            #无随机种子,控制dropout层
                ):
        self.use_batchnorm = use_batchnorm
        self.use_dropout = dropout>0     #dropout为0,则关闭随机失活
        self.reg = reg                   #正则化参数
        self.num_layers = 1+len(hidden_dims)
        
        self.dtype =dtype
        self.params = {}                 #参数字典
        
        in_dim = input_dim
        #有几个隐藏层,就有几个对应的W,最后输出层还有一个W
        for i,h_dim in enumerate(hidden_dims):
            self.params['W%d' %(i+1,)] = weight_scale*np.random.randn(in_dim,h_dim)
            self.params['b%d' %(i+1,)] = np.zeros((h_dim,))
            if use_batchnorm:
                #使用批量归一化
                self.params['gamma%d' %(i+1,)] = np.ones((h_dim,))
                self.params['beta%d'  %(i+1,)] = np.zeros((h_dim,))
            in_dim = h_dim   #将隐藏层中特征个数传递给下一层
        
        #输出层参数
        self.params['W%d'%(self.num_layers,)] = weight_scale*np.random.randn(in_dim,num_classes)
        self.params['b%d'%(self.num_layers,)] = np.zeros((num_classes,))
        
        #dropout
        self.dropout_param = {}   #dropou参数字典
        if self.use_dropout:      #如果use_dropout为(0,1),启用dropout
            self.dropout_param = {'mode':'train','p':dropout}
        
        if seed is not None:
            self.dropout_param['seed'] = seed
            
        #batch normalize
        self.bn_params = []  #bn算法参数列表
        if self.use_batchnorm:   #开启批量归一化,设置每层的mode为训练模式
            self.bn_params=[{'mode':'train'} for i in range(self.num_layers - 1)]
        
        #设置所有参数的计算精度为np.float64
        for k,v in self.params.items():
            self.params[k] = v.astype(dtype)
    
    
    def loss(self,X,y = None):
        #调整精度
        #X的数据是N*3*32*32
        #Y(N,)
        X = X.astype(self.dtype)
        mode = 'test' if y is None else 'train'
        
        if self.dropout_param is not None:
            self.dropout_param['mode'] = mode
        if self.use_batchnorm:
            for bn_params in self.bn_params:
                bn_params['mode'] = mode
        
        scores = None
        
        
        
        #向前传播
        fc_mix_cache = {}       #混合层向前传播缓存
        if self.use_dropout:    #开启随机失活
            dp_cache = {}       #随即失活层向前传播缓存
            
        out = X
        #只计算隐藏层中的向前传播,输出层单独的全连接
        for i in range(self.num_layers -1):
            w = self.params['W%d'%(i+1,)]
            b = self.params['b%d'%(i+1,)]
            if self.use_batchnorm:
                #利用模块向前传播
                gamma = self.params['gamma%d'%(i+1,)]
                beta = self.params['beta%d'%(i+1,)]
                out,fc_mix_cache[i] = affine_bn_relu_forward(out,w,b,gamma,beta,self.bn_params[i])
            else:
                out,fc_mix_cache[i] = affine_relu_forward(out,w,b)
            if self.use_dropout:
                #开启随机失活,并且记录随机失活的缓存
                out,dp_cache[i] = dropout_forward(out,self.dropout_param)
        
        #输出层向前传播
        w = self.params['W%d'%(self.num_layers,)]
        b = self.params['b%d'%(self.num_layers,)]
        out,out_cache = affine_forward(out,w,b)
        scores = out 
        
        if mode == 'test':
            return scores
        
        #反向传播
        loss,grads=0.0, {}
        #softmaxloss
        loss,dout = softmax_loss(scores,y)
        #正则化loss,只计算了输出层的W平方和
        loss += 0.5*self.reg*np.sum(self.params['W%d'%(self.num_layers,)]**2)
        
        #输出层的反向传播,存储到梯度字典
        dout,dw,db = affine_backward(dout,out_cache)
        #正则化
        grads['W%d'%(self.num_layers,)] = dw+self.reg*self.params['W%d'%(self.num_layers,)]
        grads['b%d'%(self.num_layers,)] = db
        
        #隐藏层梯度反向传播
        for i in range(self.num_layers-1):
            ri = self.num_layers -2 - i  #倒数第ri+1层
            loss+=0.5*self.reg*np.sum(self.params['W%d'%(ri+1,)]**2)    #继续正则化loss
            if self.use_dropout:     #如果使用随即失活,加上梯度下降
                dout = dropout_backward(dout,dp_cache[ri])
            if self.use_batchnorm:   #如果使用BN,加上梯度下降部分
                dout,dw,db,dgamma,dbeta = affine_bn_relu_backward(dout,fc_mix_cache[ri])
                grads['gamma%d'%(ri+1,)] = dgamma
                grads['beta%d'%(ri+1,)] = dbeta
            else:             #否则直接梯度下降
                dout,dw,db = affine_relu_backward(dout,fc_mix_cache[ri])
                #存储到字典中
            grads['W%d'%(ri+1,)] = dw+self.reg * self.params['W%d'%(ri+1,)]
            grads['b%d'%(ri+1,)] = db 
            #返回本次loss和梯度值
        return loss,grads


Neural network optimization - for gradient descent during training

1、SGD with momentum

       The above loss function outputs the current loss value and the gradient of the model parameters. During the gradient descent process, that is, during the training process, the model parameters are updated in the negative gradient direction.

       Stochastic Gradient Descent (SGD) was used before, w -= learning_rate * dW

       SGD + momentum (momentum update method for stochastic gradient descent). w0 = w0*mu - learning_rate*dW Personal understanding, originally it was based on the gradient, but now it has its own speed after the update. The speed cannot change instantaneously. Think of the gradient as a force. This force will be the sum of the conceptual speed and the speed. direction.

def sge_momentum(w,dw,config = None):
    if config is None:
        config = {}
    config.setdefault('learning_rate',1e-2)
    config.setdefault('momentum',0.9)
    v = config.get('velocity',np.zeros_like(w))
    
    next_w = None
    
    v = config['momentum']*v - config['learning_rate']*dw
    next_w = w + v 
    
    config['velocity'] = v 
    return next_w,config

2、RMSProp

3、Adam


Instantiating neural networks and training

Reference link: Solver on page 52 of the coursework

Guess you like

Origin blog.csdn.net/qq_41828351/article/details/90257272