Summary 2-Deep learning network construction and learning

  1. Share a PyTorch medical image segmentation open source library
  2. Looking at it from a practical perspective, there is a little gain, video:CNN helps you quickly get started with PyTorch and TensorFlow from 0 to 1. github webpage:Convolutional Neural Network

  3. Understanding and summary of commonly used activation functions (excitation functions)

------------------Pytorch learning--------------

  1. Books, introduction and practice of deep learning framework and pytorch:Chapter 4 Neural Network Toolboxnn

  2. "Detailed explanation of PyTorch's nn.Linear()", Official website "torch.nn.Linear"
    1) Learn specific parameters,

    class:torch.nn.Linear(in_featuresout_featuresbias=Truedevice=None dtype=None)

    2) Linear< a i=17>in_features, out_featuresSize calculation:
         in_features refers to the size of the input two-dimensional tensor, that is, the input [batch_size, size] size in .
      out_features refers to the size of the output two-dimensional tensor, that is, the shape of the output two-dimensional tensor is [batch_size, output_size] , of course, it also represents the number of neurons in the fully connected layer.
     From the perspective of the shape of the input and output tensors, it is equivalent to an input of [batch_size, in_features] The tensor is transformed into the output tensor of [batch_size, out_features].
    3) Fully connected layer:

    Fully connected layers (FC) play the role of "classifier" in the entire convolutional neural network. If operations such as the convolutional layer, pooling layer and activation function layer map the original data to the hidden layer feature space, < a i=3>The fully connected layer plays the role ofmapping the learned "distributed feature representation" to the sample label space. In actual use, the fully connected layer can be implemented by convolution operation: The front layer is a fully connected fully connected layerCan be converted into convolution kernel as 1x1 convolution; and The front layer is a fully connected layer of the convolutional layer, which can be transformed into a global convolution whose convolution kernel is hxw. h and w are the height and width of the previous layer convolution result respectively (Note 1). Reference articles "Introduction to CNN: What is a fully connected layer (Fully Connected Layer)?", "Fully connected What is the role of layers? 》

    In computer vision, there is not much difference between fully connected layers and convolutional layers. The convolutional layer is actually a structure that is partially connected spatially and fully connected on the channel. Therefore, if you abandon the partial connection on the spatial and replace the convolution kernel with 1x1, then you will only be fully connected on the channel. At this time, the convolutional layer = fully connected layer. This is why the last layer of imagenet classification is pooling+fc. After pooling, the feature will be 1x1xC. At this time, fc is to fully connect the channel and map the feature to the label. Careful people will find that many feature fusion convolutions use 1x1 kernel, which is also a function of the fully connected layer.

     
    # 例子1
    # in_features指的是输入的二维张量的大小,即输入的[batch_size, size]中的size。
    input = torch.randn(128, 20) # input(batch_size, size) 
    m = nn.Linear(20, 30)# Linear(in_features, out_features)等同于Linear(size, out_features)
    output = m(input) # output二维张量输出形状为[batch_size,out_features],
    print(output.size())# torch.Size([128, 30])
    
    #例子2
    import torch as t
    from torch import nn
    
    # in_features由输入张量的形状决定,out_features则决定了输出张量的形状 
    connected_layer = nn.Linear(in_features = 64*64*3, out_features = 1)
    
    # 假定输入的图像形状为[64,64,3]
    input = t.randn(1,64,64,3)
    
    # 将四维张量转换为二维张量之后,才能作为全连接层的输入
    input = input.view(1,64*64*3)
    print(input.shape) # torch.Size([1, 12288])
    output = connected_layer(input) # 调用全连接层
    print(output.shape) # torch.Size([1, 1])
    
    # 例子3
    def __init__(self):
        super(LeNet, self).__init__()
        # Conv2d的第一个参数是输入的channel数量,
        # 第二个是输出的channel数量,第三个是kernel size
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # 由于上一层有16个channel输出,
        # 每个feature map大小为5*5,
        # 所以linear的输入样本大小是16*5*5
        self.fc1 = nn.Linear(16*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        # 最终有10类,所以最后一个Linear输出样本大小是10
        self.fc3 = nn.Linear(84, 10)
        self.pool = nn.MaxPool2d(2, 2)
  3. 《Conv2d函数详解(Pytorch)》,官网《torch.nn.Conv2d》
    学习具体参数,了解推算过程。
    定义:torch.nn.Conv2d(in_channelsout_channelskernel_sizestride=1padding=0dilation=1groups=1bias=Truepadding_mode='zeros'device=Nonedtype=None)[
     
    # With square kernels and equal stride
    m = nn.Conv2d(16, 33, 3, stride=2) #nn.Conv2d(Cin, Cout, kernel_size, stride=1)
    input = torch.randn(20, 16, 50, 100) # torch.Size([20, 16, 50, 100])
    output = m(input) # torch.Size([20, 33, 24, 49])
    
    # input和output对比,Conv2d只改变通道数和图像大小

    shape:

  4. 《pytorch avg_pool2d》, official website《torch.nn.AvgPool2d》. Walk through the example to familiarize yourself with the specific parameters of average pooling.
    Definition: Average pooling, sampling operation, only need to pay attention to kernel_size , stride and padding Three parameters are enough. That is, N and C remain unchanged, and only the changes in image size H and W are concerned.
    torch.nn.AvgPool2d(kernel_size, stride=Nonepadding=0ceil_mode=Falsecount_include_pad=Truedivisor_override=None)
    shape :

     Code example:
    # pool of square window of size=3, stride=2
    m = nn.AvgPool2d(3, stride=2)
    input = torch.randn(20, 16, 50, 32) # torch.Size([20, 16, 50, 32])
    output = m(input) # torch.Size([20, 16, 24, 15])
    
    # input和output之间,N和C不变,只关心图像大小变化H、W。
  5. torch.nn.MaxPool2d detailed explanation》, official website《torch.nn.MaxPool2d a>》.
    Function:Maximize the feature points in the neighborhood and reduce the convolution layer parameter errorResults in estimationdeviation of mean value, retaining moretextureInformation.
    Definition: torch.nn.MaxPool2d( kernel_size , stride = None , padding = 0 , dilation = 1 , return_indices = False , ceil_mode = False ) Code example: Shape: - If equal to True, when calculating the output signal size, up will be used Rounding, instead of the default rounding operationceil_mode- If equal to True , will return the sequence number of the maximum output value, which will be helpful for upsampling operationsreturn_indices – A parameter that controls the stride of elements in the window dilation(int or tuple, optional) - each input edge is supplemented with the number of layers of 0padding(int or tuple, optional)- The step size of max pooling window movement. The default value is kernel_sizestride( int or tuple, optional) - max pooling window sizekernel_size(int or tuple)
    Parameters:








    # pool of square window of size=3, stride=2
    m = nn.MaxPool2d(3, stride=2)
    # pool of non-square window
    m = nn.MaxPool2d((3, 2), stride=(2, 1))
    input = torch.randn(20, 16, 50, 32)
    output = m(input)
    
    
    input.shape
    # Out[33]: torch.Size([20, 16, 50, 32])
    output.shape
    # Out[34]: torch.Size([20, 16, 24, 31])
    
    

     
  6. Batch normalization.
    In machine learning, before model training, the data needs to be normalized to make its distribution consistent. In the process of deep neural network training, usually one training is a batch, not all the data. Each batch has a different distribution, resulting in an internal covarivate shift problem - during the training process, the data distribution will change, causing difficulty in learning the next layer of the network.

    Batch Normalization returns the data to a normal distribution with a mean of 0 and a variance of 1 (normalization). On the one hand, it makes the data distribution consistent, and on the other hand, it avoids gradient disappearance and gradient explosion. . Batch Noramlization is to make the input satisfy the same distribution, so what features of the input should be made to satisfy the same distribution? That is, the mean and variance of all pixel values ​​of the same channel in each image are the same. For example, we have two pictures (both 3 channels), and we only talk about the R channel now. We hope that the mean value of the R channel of the first picture is the same as the mean value of the R channel of the second picture, and the variance is the same.

    In order to reduce the Internal Covariate Shift, it is enough to normalize each layer of the neural network. Assume that the output data of each layer is normalized to 0 mean, 1 The variance satisfies the normal distribution. However, there is a problem at this time. The data distribution of each layer is a standard normal distribution, which makes it unable to learn the characteristics of the input data at all. Because the characteristic distribution learned with great effort is normalized, therefore, it is directly It is obviously unreasonable to normalize each layer. But if you modify it slightly and add trainable parameters for normalization, that is what Batch Norm implements.

    This article explains the whole process, "Pytorch: nn.BatchNorm2d() Function"
    Describe each one: < /span>pytorch——nn.BatchNorm1d()》
    《BatchNorm1d, BatchNorm2d, BatchNorm3d in pytorch》《nn > BatchNorm1d》;

    dai码验证Criticizing the theoretical process reference《【PyTorch】Explanation of the BatchNorm2d() function in pytorch》;
    Fixed:
    torch.nn.BatchNorm2d(< /span>track_running_stats=True< /span>)dtype=Nonedevice=Noneaffine=Truemomentum=0.1eps=1e-05num_features

    >>> # With Learnable Parameters
    >>> m = nn.BatchNorm2d(100)
    >>> input = torch.randn(20, 100, 35, 45)
    >>> output = m(input)


    advantage:

    ① Not only greatly improves the training speed, but also greatly speeds up the convergence process;

    ② It can also increase the classification effect. One explanation is that this is a regular expression similar to Dropout to prevent over-fitting, so it can achieve equivalent results without Dropout;

    ③In addition, the parameter adjustment process is much simpler, the initialization requirements are not so high, and a large learning rate can be used.

  7.  nn.dropout:
    During forward propagation, we let the activation value of a neuron stop working with a certain probability p, which can make the model more generalizable. Strong because it does not rely too much on certain local features. The network prevents overfitting.
    For the test code, refer to this article "The Difference between torch.nn.dropout and torch.nn.dropout2d" , on the Internet For actual use, please refer to this article "Dropout of pytorch"
    Definition:
    torch.nn.Dropout(p =0.5inplace=False)
    >>> m = nn.Dropout(p=0.2)
    >>> input = torch.randn(20, 16)
    >>> output = m(input)

  8. Activation function:
    1) Principle:
    Activation function is used for artificial neural network models to learn and understand very complex and nonlinear functions plays a very important role. They introduce nonlinear properties into our networks. In the neuron, the input inputs are weighted and summed, and then a function is applied to them. This function is the activation function. The activation function is introduced to increase the nonlinearity of the neural network model. Each layer without an activation function is equivalent to matrix multiplication. Even after you superimpose several layers, it's nothing more than matrix multiplication. ”
           If you don’t use an excitation function (actually the equivalent of the excitation function is f(x) = x), in this case the input of each layer of your nodes is a linear function of the upper layer’s output, which is very It is easy to verify that no matter how many layers your neural network has, the output is a linear combination of the inputs, which is equivalent to having no hidden layer. In this case, it is the most primitive perceptron (Perceptron), so the approximation ability of the network is quite limited. Right Because of the above reasons, we decided to introduce a nonlinear function as the excitation function, so that the expression ability of the deep neural network will be more powerful (it is no longer a linear combination of inputs, but can approximate almost any function) .
    2) Parameter explanation, refer to the official website"nn.ReLU".
         Example reference: "About nn.ReLU function" .
    3) torch .nn.ReLU: relu is called linear rectification function (corrected linear unit), tf.nn.relu( ) is used to increase the input value less than 0 to 0, and the input value greater than 0 remains unchanged.


    4) torch.nn.sigmoid(): The value is mapped to between 0-1.

    5) torch.nn.Tanh:Map a real-valued input to the range of [-1, 1]

    6) To summarize the distinction, refer to  " Detailed explanation of the activation function (Sigmoid/Tanh/ReLU/Leaky ReLu, etc. >》.

    The activation function introduces nonlinear factors into the neural network. Through the activation function neural network, various curves can be fitted. Activation functions are mainly divided into saturated activation functions (Saturated Neurons) and non-saturated functions (One-sided Saturations). Sigmoid and Tanh are saturated activation functions, while ReLU and its variants are non-saturated activation functions. The main advantages of non-saturated activation functions are as follows:
    1. Non-saturated activation functions can solve the vanishing gradient problem.
    2. Non-saturated activation function can speed up convergence.
     

    1. Sigmoid can easily lead to the vanishing gradient problem. Saturated neurons will make the gradient disappearance problem worse.Assume that the value of the neuron input Sigmoid is particularly large or small, and the corresponding gradient is approximately equal to 0. Even if the gradient transmitted from the previous step is large, The gradient of the neuron's weight (w) and bias (bias) will also approach 0, causing the parameters to be unable to be effectively updated.
    2. Time-consuming calculation. In neural network training, it is often necessary to calculate the value of Sigmid for power calculation, which will increase time consumption.
    3. The Sigmoid function is not zero-centered about the origin.
    4.Tanh activation function solves the problem of origin center symmetry.

    5. The ReLU activation function was proposed to solve the vanishing gradient problem. The gradient of ReLU can only take two values: 0 or 1. When the input is less than 0, the gradient is 0; when the input is greater than 0, the gradient is 1. The advantage is that the continuous multiplication of ReLU gradients will not converge to 0, and the result of the continuous multiplication can only take on two values: 0 or 1. If the value is 1, the gradient remains unchanged and forward propagates; if the value is 0, the gradient stops forward propagation from this position.
    6. Unilateral saturation like ReLU can also make neurons more robust to noise interference. 7.The ReLU activation function is also computationally efficient. Compared with the calculation of the gradient of the Sigmoid function, the gradient value of the ReLU function is only 0 or 1. And ReLU truncates negative values ​​to 0, introducing sparsity into the network and further improving computational efficiency. 8. Although ReLU sparsity can improve computational efficiency, it may also hinder the training process. Usually, the input value of the activation function has a bias term (bias). Assuming that the bias becomes too small, so that the value of the input activation function is always negative, then the gradient of the backpropagation process there is always 0, corresponding to Weights and bias parameters cannot be updated at this time. If the input to the activation function is negative for all sample inputs, then the neuron can no longer learn, which is called the neuron "death" problem. 9. Leaky ReLU can solve the problem of neuron "death"



    Leaky ReLU was proposed to solve the problem of neuron "death". Leaky ReLU is very similar to ReLU. The only difference is the part where the input is less than 0. The part where the input of ReLU is less than 0 has a value of 0, while the part where the input of LeakyReLU is less than 0 is. The value is negative and has a slight gradient. The function graph is (d).

    The advantage of using Leaky ReLU is that during the backpropagation process, the gradient can also be calculated for the part where the Leaky ReLU activation function input is less than zero (instead of the value being 0 like ReLU), thus avoiding Gradient direction aliasing problem.

    10. Gradient error
    is the direction and gradient calculated during the training of the neural network, which updates the network weights with the correct direction and value. In deep networks or recurrent neural networks, gradient errors may accumulate during updates, resulting in very large gradients. This in turn can lead to massive updates of network weights, which can lead to network instability. In extreme cases, the weight values ​​may become so large that they overflow and cause NaN values ​​to occurgradient explosion phenomenon.

    Gradient explosion occurs through exponential growth by repeatedly multiplying gradients in network layers (whose values ​​are greater than 1.0).
    11.Gradient explosion phenomenon

    Obvious phenomena:
    1. The model cannot "join" the training data, for example, the loss function is very poor.
    2. The model is unstable and the loss changes greatly with each update.
    3. The model loss becomes NaN during the training process

    There are also some less obvious phenomena:
    1. Model weights change greatly quickly during training.
    2. The model weight becomes NaN during the training process.
    3. The gradient error of each node and layer during training is always higher than 1.0.
    12.How to solve gradient explosion

    1. Redesign the neural network
    Reduce the number of network layers, reduce batch size, and truncate.
    2. Use LSTM
    3. Use gradient clipping
    clipnorm=1.0 clipvalue=0.5
    4. Use weight regularization
    L1 & L2

    13. How to choose activation function

    1. Please use the Sigmoid function with caution except in binary classification problems.
    2. You can try Tanh, but in most cases its effect will not be as good as ReLU and Maxout.
    3. If you don’t know which activation function should be used, then please choose ReLU first.
    4. If you use ReLU, you need to pay attention to the Dead ReLU problem. At this time, you need to carefully select the Learning rate to avoid large gradients that lead to too many neurons "Dead".
    5. If the Dead ReLU problem occurs, you can try leaky ReLU, ELU and other ReLU variants, which may have good results.

  9. torch.nn.Sequential: The network module is initialized in order and the forword function is automatically implemented. It must be ensured that the output size of the previous module is consistent with the input size of the next module. . For detailed explanation, refer to"pytorch Series 7 -----nn.Sequential Explanation" 
     
    from collections import OrderedDict
    net3= nn.Sequential(OrderedDict([
              ('conv1', nn.Conv2d(3, 3, 3)),
              ('bn1', nn.BatchNorm2d(3)),
              ('relu1', nn.ReLU())
            ]))
    print('net3:', net3)

    输出:
    net3: Sequential(
      (conv1): Conv2d(3, 3, kernel_size=(3, 3), stride=(1, 1))
      (bn1): BatchNorm2d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu1): ReLU()
    )

  10. nn.ModuleList, compared with Sequential, Modulelist has no internal forword function and no sequential requirement , and there is no cascading requirement. ModuleList is a subclass of Module. When it is used in Module, it can be automatically recognized as a submodule.
  11. Loss function
    1. Comparison of different loss functions
    , refer to the article "Summary of Loss Functions (pytorch )》, generally using cross entropy as the loss function for classification problems.
    Cross entropy:It mainly describes the distance between the actual output (probability) and the expected output (probability). That is, the smaller the value of cross entropy, the smaller the value of the cross entropy. The probability distribution is closer. Assuming that the probability distribution p is the expected output, the probability distribution q is the actual output, and H(p,q) is the cross entropy, then the CrossEntropyLoss() function in Pytorch has the following calculation formula:



    2)nn Calculation method in .CrossEntropyLoss():
    The last step is to find the loss of multiple sets of outputs:
    If reduction='mean', it is to find mean.
    If reduction='mean', it is summation.

    import torch
    import torch.nn as nn
    import math
    criterion = nn.CrossEntropyLoss()
    output = torch.randn(3, 5, requires_grad=True)
    label = torch.empty(3, dtype=torch.long).random_(5)
    loss = criterion(output, label)
    print("网络输出为3个5类:")
    print(output)
    print("要计算loss的类别:")
    print(label)
    print("计算loss的结果:")
    print(loss)
    first = [0, 0, 0]
    for i in range(3):
        first[i] = -output[i][label[i]]
    second = [0, 0, 0]
    for i in range(3):
        for j in range(5):
            second[i] += math.exp(output[i][j])
    res = 0
    for i in range(3):
        res += (first[i] + math.log(second[i]))
    print("自己的计算结果:")
    print(res/3)
    
    ##########################
    网络输出为3个5类:
    tensor([[ 1.1499, -0.2208, -0.8943, -1.5002,  0.3065],
            [-0.0155,  0.7495,  0.4617,  0.5376,  1.2006],
            [-0.3464,  0.7741, -1.9237,  0.0350,  0.5038]], requires_grad=True)
    要计算loss的类别:
    tensor([1, 0, 4])
    计算loss的结果:
    tensor(1.8443, grad_fn=<NllLossBackward0>)
    自己的计算结果:
    tensor(1.8443, grad_fn=<DivBackward0>)


    nn.CrossEntropyLoss example:
    # Example of target with class probabilities
    loss = nn.CrossEntropyLoss()
    input = t.randn(3, 5, requires_grad=True)
    target = t.randn(3, 5).softmax(dim=1)
    output = loss(input, target)
    output.backward()
    # batch_size=3,计算对应每个类别的分数(只有两个类别)
    score = t.randn(3, 2)
    # 三个样本分别属于1,0,1类,label必须是LongTensor
    label = t.Tensor([1, 0, 1]).long()
    
    # loss与普通的layer无差异
    criterion = nn.CrossEntropyLoss()
    loss = criterion(score, label)
    loss

  12. Optimizer torch.optim: For usage, please refer to the tutorial "Optimizer: torch.optim".
    1) For comparison of four commonly used optimizers, refer to "The four commonly used optimizers in Pytorch: SGD, Momentum, RMSProp, and Adam< a i=5>》
    The author recommends:
        a) The RMSProp algorithm has been empirically proven to be an effective and practical deep neural network optimization algorithm. It is currently one of the optimization methods often adopted by deep learning practitioners.
        b) SGD is the most common optimizer, which can also be said to have no acceleration effect, while Momentum is an improved version of SGD, which adds the momentum principle. The following RMSprop is an upgraded version of Momentum. And Adam is an upgraded version of RMSprop.
        c) In actual operation, Adam is recommended as the default algorithm, which is generally better than RMSProp.
    In order to verify the performance of the four algorithms, the same network was optimized in pytorch and the changes in the loss functions of the four algorithms over time were compared. The code is as follows:
    opt_SGD=torch.optim.SGD(net_SGD.parameters(),lr=LR)
    opt_Momentum=torch.optim.SGD(net_Momentum.parameters(),lr=LR,momentum=0.8)
    opt_RMSprop=torch.optim.RMSprop(net_RMSprop.parameters(),lr=LR,alpha=0.9)
    opt_Adam=torch.optim.Adam(net_Adam.parameters(),lr=LR,betas=(0.9,0.99))


    2) How to adaptively adjust the learning rate, refer to "PyTorch Notes (18) - Use of torch.optim Optimizer" a>;
    3) Interpretation of torch.optim through source code《PyTorch source code interpretation of torch.optim: Detailed explanation of optimization algorithm interface》;PyTorch: 6 of torch.optim Introduction to optimizers and optimization algorithms
    4) Distinguishing between different optimizers》;
  13. Initialization strategy torch.nn.init: The module parameters of nn.Module in PyTorch adopt a more reasonable initialization strategy, so we generally do not need to consider it. When we use Parameter, custom initialization is particularly important, because t.Tensor() returns a random number in memory, which is likely to have a maximum value, which will cause overflow or gradient disappearance in the actual training network. Thenn.init module in PyTorch is specially designed for initialization.
    Initialize the Tensor distribution, refer to this article "torch.nn.init official user manual "Summary" ", and There is an official website "torch.nn.init"
    Example:
    import torch 
    import torch.nn as nn
    
    w = torch.empty(3,5)
    
    #1.均匀分布 - u(a,b)
    #torch.nn.init.uniform_(tensor, a=0.0, b=1.0)
    print(nn.init.uniform_(w))
    # =======================================================
    # tensor([[0.9160, 0.1832, 0.5278, 0.5480, 0.6754],
    #         [0.9509, 0.8325, 0.9149, 0.8192, 0.9950],
    #         [0.4847, 0.4148, 0.8161, 0.0948, 0.3787]])
    # =======================================================

  14. In-depth analysis of nn.Module:
    If you want to understand nn.Module more deeply, it is necessary to study its principles. First, let’s take a look at the constructor of the nn.Module base class:
    def __init__(self):
        self._parameters = OrderedDict()
        self._modules = OrderedDict()
        self._buffers = OrderedDict()
        self._backward_hooks = OrderedDict()
        self._forward_hooks = OrderedDict()
        self.training = True
    The explanation of each attribute is as follows:
    _parameters: dictionary, which saves the parameters set directly by the user, self.param1 = nn.Parameter( t.randn(3, 3)) will be detected, and a key is added to the dictionary as "param", and the value is the item corresponding to the parameter. The parameters in self.submodule = nn.Linear(3, 4) will not exist here.
    _modules: submodule, the submodule specified by self.submodel = nn.Linear(3, 4) will be saved here.
    _buffers: Cache. If batchnorm uses the momentum mechanism, each forward propagation needs to use the result of the previous forward propagation.
    _backward_hooks and _forward_hooks: Hook technology, used to extract intermediate variables, similar to variable hooks.
    training:BatchNorm and Dropout layers adopt different strategies in the training phase and testing phase. The forward propagation strategy is determined by judging the training value.
    Among the above attributes, the key values ​​in the three dictionaries _parameters, _modules and _buffers can be obtained through self.key, and the effect is equivalent to self._parameters[' ;key'].


     
  15. "Pytorch - Gradient Calculation"The lecture is good
  16. ddd

 

 

 

 

Guess you like

Origin blog.csdn.net/qimo601/article/details/121285195