Neural Network Study Notes 1 - ResNet Residual Network, Batch Normalization Understanding and Code


foreword

In the general impression, the more complex features have the stronger ability to express features. In a deep network, each feature will be continuously calculated linearly and nonlinearly, and the deeper the network output, the stronger the feature. Therefore, the depth of the network is very important for learning features with stronger expressive ability, that is, the deeper the neural network structure (complex, more parameters), the stronger the expressive ability. This is well represented in VGGNet.

In the deep model, the size of the output feature map of each layer mostly changes with the depth of the network, mainly because the length and width are getting smaller and smaller, and the depth of the output feature map increases with the depth of the network layer. On the other hand, reducing the length and width helps to reduce the amount of computation, while increasing the depth of the feature map increases the number of features available in the output of each layer.

The first problem with increasing depth is the problem of exploding/vanishing gradients. This is because as the number of layers increases, the gradient backpropagated in the network will become unstable with the multiplication, and become particularly large or small. Among them, the problem of gradient disappearance often appears. Before the residual network was proposed, the network structure could not be very deep. In VGG, the convolutional network reached 19 layers, and in GoogLeNet, the network reached 22 layers. As the number of network layers increases, the network has a degradation phenomenon: as the number of network layers increases, the training set loss gradually decreases, and then tends to be saturated. When you increase the network depth, the training set loss will increase. increase.

The ResNet residual network is mainly composed of residual blocks, and after the introduction of residual blocks, the network can reach very deep, and the effect of the network will also become better. All in all, the ResNet network is to solve the degradation problem in the deep network, that is, the deeper the network layer, the worse the performance on the data set.


1. Structure display

1. ResNet18 graphics:

2. ResNet tabularization:

insert image description here

2. Structure learning

reference source

1. Overall

According to the comparison between the graphics of resnet18 and the 18layer in the table, perform text analysis:

  1. conv1: 7x7x64 convolution
  2. conv2: 3x3 maximum pooling downsampling layer, two layers of two sets of 3x3x64 convolution stacking
  3. conv3: two layers of two sets of 3x3x128 convolution stacking
  4. conv4: Two layers of two sets of 3x3x256 convolution stacking
  5. conv5: Two layers of two sets of 3x3x512 convolution stacking
  6. 1+1+1: average pooling downsampling layer, fully connected layer, normalization processing

2. Residual structure

The forward propagation path of the residual structure part is divided into two paths:
the first path is convolutional and progressive, which needs to pass through several layers of convolutional layers, and
the second path is quickly connected, directly bypassing the first path and the second path. The feature maps output by one channel are added together

insert image description here

However, no matter which way it is, when entering the first layer of the next residual structure, it is necessary to adjust the size and nonlinear processing to achieve the same size of input and output

1. Take resnet18 as an example:

insert image description here

Conv(n)——>Conv(n+1) (right structure):
the input of the convolution is [56,56,64]—>the ​​output is [28,28,128]:

  1. By setting stride=2, 3x3 convolution and convolution kernel 128, the height and width are reduced by half from 56 to 28, and the number of convolution kernels is changed from 64 to 128.
  2. Activation function relu processing.
  3. Repeat feature extraction by setting stride=1, 3x3 convolution and convolution kernel 128.

The input of the shortcut connection is [56,56,64] --> [1x1x128] --> the output is [28,28,128]:

  1. By setting stride=2, 1x1 convolution and convolution kernel changes, the height and width are reduced by half from 56 to 28, and the convolution kernels are changed from 64 to 128.
  2. Added to the output value of the convolutional progression

The first layer of non-residual structure (left structure):
convolution progression:

  1. By setting stride=1, 3x3 convolution and convolution kernel 64, the height, width and convolution kernel remain unchanged.
  2. Activation function relu processing.
  3. Repeat feature extraction by setting stride=1, 3x3 convolution and convolution kernel 64.

Quick connection:

  1. The input value is added to the output value of the convolution progression

insert image description here

2. Take resnet50 as an example:

insert image description here
Conv(n)——>Conv(n+1) (right structure):
the input of the convolution is [56,56,256]—>the ​​output is [28,28,512]:

  1. By setting stride=1, 1x1 convolution and convolution kernel 128, the height and width remain unchanged, and the number of convolution kernels is changed from 256 to 128.
  2. Activation function relu processing.
  3. By setting stride=2, 3x3 convolution and convolution kernel 128, the height and width are reduced by half from 56 to 28, and the number of convolution kernels remains unchanged.
  4. Activation function relu processing.
  5. By setting stride=1, 1x1 convolution and convolution kernel 512, the height and width remain unchanged, and the number of convolution kernels is changed from 256 to 512.

The input of the shortcut connection is [56,56,256] --> [1x1x512] --> the output is [28,28,512]:

  1. By setting stride=2, 1x1 convolution and convolution kernel changes, the height and width are reduced by half from 56 to 28, and the convolution kernels are changed from 256 to 512.
  2. Added to the output value of the convolutional progression

The first layer of non-residual structure (left structure):
convolution progression:

  1. By setting stride=1, 1x1 convolution and convolution kernel 64, the height and width are kept unchanged, and the number of convolution kernels is changed from 256 to 64.
  2. Activation function relu processing.
  3. By setting stride=1, 3x3 convolution and convolution kernel 64, the height, width and number of convolution kernels remain unchanged.
  4. Activation function relu processing.
  5. By setting stride=1, 1x1 convolution and convolution kernel 256, the height and width remain unchanged, and the number of convolution kernels is changed from 64 to 256.

Quick connection:

  1. The input value is added to the output value of the convolution progression

insert image description here


三、Batch Normalization

1. Preliminary understanding

Refer to the blog.
The so-called Feature Map is rolled out by the convolution kernel. If you multiply the original image by the convolution kernel in various situations, you will get various feature maps.

When inputting a picture, preprocessing is often used to make the picture satisfy a certain distribution law to speed up feature extraction. Then a convolution is performed to obtain a feature map, and this feature map does not necessarily satisfy the previously required distribution law.

The purpose of Batch Normalization is to make the feature map of a batch of Batch data satisfy the distribution law with a mean of 0 and a variance of 1. Note that it is not a feature map of a certain picture, but a whole batch of feature maps, because BN needs to calculate the mean and variance of the entire batch.
insert image description here

The value that can be obtained through BN
insert image description here

2. Use attention

Problems that need to be paid attention to when using BN
(1) Set the training parameter to True for statistics during training, and set the training parameter to False during verification to verify the statistical value. In pytorch, it can be controlled by the model.train() and model.eval() methods of creating models.

(2) Set the batch size as large as possible, and the performance may be poor after setting it small. The larger the setting, the closer the mean and variance are to the mean and variance of the entire training set.

(3) It is recommended to place the bn layer between the convolutional layer (Conv) and the activation layer (such as Relu), and the convolutional layer should not use bias bias, because it is useless, even if the bias bias is used to obtain the result. the same.

Four, python code

Combining ResNet and Batch Normalization to make a pytorch version of the network model

import torch.nn as nn
import torch



class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self,in_channel,out_channel,stride=1,downsample=None):
        super(BasicBlock,self).__init__()

        self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=out_channel,
                               kernel_size=3,stride=stride,padding=1,bias=False)

        self.bn1 = nn.BatchNorm2d(out_channel)

        self.relu = nn.ReLU()

        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=1, padding=1, bias=False)

        self.bn2= nn.BatchNorm2d(out_channel)

        self.downsample = downsample

    def forward(self,x):
        identity = x

        if self.downsample is None:
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        out += identity
        out = self.relu(out)

        return out



class Bottleneck(nn.Module):
    expansion = 4

    def __init__(self,in_channel,out_channel,stride=1,downsample=None):
        super(Bottleneck,self).__init__()

        self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=out_channel,
                               kernel_size=1,stride=1,bias=False)

        self.bn1 = nn.BatchNorm2d(out_channel)

        self.conv2 = nn.Conv2d(in_channels=out_channel, out_channels=out_channel,
                               kernel_size=3, stride=stride, bias=False,padding=1)

        self.bn2 = nn.BatchNorm2d(out_channel)

        self.conv3 = nn.Conv2d(in_channels=in_channel, out_channels=out_channel*self.expansion,
                               kernel_size=1, stride=1, bias=False)

        self.bn2 = nn.BatchNorm2d(out_channel*self.expansion)

        self.relu = nn.ReLU(inplace=True)

        self.downsample = downsample

    def forward(self,x):
        identity = x

        if self.downsample is None:
            identity = self.downsample(x)

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        out += identity
        out = self.relu(out)

        return out




class ResNet(nn.Module):

    def __init__(self,block,blocks_num,num_classes=1000,include_top=True):
        super(ResNet,self).__init__()

        self.include_top = include_top

        self.in_channel = 64

        self.conv1 = nn.Conv2d(3,self.in_channel,kernel_size=7,stride=2,
                               padding=3,bias=False)

        self.bn1 = nn.BatchNorm2d(self.in_channel)

        self.relu = nn.ReLU(inplace=True)

        self.maxpool = nn.MzxPool2d(kernel_size=3,stride=2,padding=1)

        self.layer1 = self._make_layer(block, 64, blocks_num[0])
        self.layer2 = self._make_layer(block, 128, blocks_num[1], stride=2)
        self.layer1 = self._make_layer(block, 256, blocks_num[2], stride=2)
        self.layer1 = self._make_layer(block, 512, blocks_num[3], stride=2)

        if self.include_top:
            self.avgpool = nn.AdaptiveAvgPool2d((1,1))
            self.fc = nn.Linear(512*block.expansion,num_classes)

        for m in self.modules():
            if isinstance(m,nn.Conv2d):
                nn.init.kaiming_normal_(m.weight,mode='fan_out',nonlinearity='relu')

    def _make_layer(self,block,channel,blocks_num,stride=1):
        downsample = None
        if stride != 1 or self.in_channel != channel*block.expansion:
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channel,channel*block.expansion,kernel_size=1,stride=stride,bias=False),
                nn.BatchNorm2d(channel*block.expansion))

        layers = []
        layers.append(block(self.in_channel,channel,downsample=downsample,stride=stride))

        self.in_channel = channel*block.expansion

        for _ in range(1,blocks_num):
            layers.append(block(self.in_channel,channel))

        return nn.Sequential(*layers)

    def forward(self,x):
        x = self.conv1(x)
        x = self.bn1(x)
        x = self.relu(x)
        x = self.maxpool(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)

        if self.include_top:
            x = self.avgpool(x)
            x = torch.flatten(x,1)
            x = self.fc(x)

        return x



def resnet18(num_classes=1000,include_top=True):
    return ResNet(BasicBlock,[2,2,2,2],num_classes=num_classes,include_top=True)



def resnet50(num_classes=1000,include_top=True):
    return ResNet(Bottleneck,[3,4,6,3],num_classes=num_classes,include_top=True)

Guess you like

Origin blog.csdn.net/qq_45848817/article/details/127023232