[OUC Deep Learning Introduction] Week 3 Learning Record: ResNet+ResNeXt

Part1 ResNet

1 paper reading

It is known that the deeper the number of network layers, the more difficult it is to train. In order to solve this problem, this paper proposes a residual learning module. This structural module is easy to optimize and can achieve high accuracy when the depth is very deep.

The training of deep networks mainly faces two problems, one is gradient disappearance/gradient explosion, and the other is network degradation. The former can be solved by normalizing operations, adjusting optimizers and learning rates, while the latter makes the network deeper in layers. The training effect reaches saturation when , and this network degradation is not caused by overfitting.

Therefore, starting from the identity transformation, this paper proposes a residual learning module to solve the network degradation problem under this deep network, so that the structural modules in the deep network can not only perform nonlinear transformation to obtain better training results, but also Perform an identity transformation to keep the current training effect good enough.

ResNet mainly has the following characteristics:

  1. Residual representation
  2. Shortcut connection (realize identity transformation)

2 Network structure

Highlights from the network :

  • Deep network structure (breakthrough 1000)
  • residual module
  • Discard Dropout and use Batch Normalization to speed up training

 Problems faced by traditional deep network structures :

  • Vanishing or Exploding Gradients
  • degradation problem

residual structure :

The residual structure on the left is used for the shallower network (ResNet34), and the BottleNeck on the right is used for ResNet50/101/152. The output feature matrix shape of the main branch and the shortcut must be the same.

The 1*1 convolution kernel in BottleNeck is used to reduce the dimension, first reduce the dimension and then increase the dimension, which is beneficial to reduce the amount of parameters.

The dotted shortcut in the ResNet structure schematic diagram indicates that the residual structure here has dimension enhancement, and a 1*1 convolution kernel for dimension enhancement is added to the corresponding shortcut.

The ResNet of the original paper is different from the official ResNet implemented by PyTorch : on the main branch of the dotted residual structure in the original paper, the step size of 1*1 convolution is 2, and the step size of 3*3 convolution is 1; while the official PyTorch In the implementation, the step size of 1*1 convolution is 1, and the step size of 3*3 convolution is 2, which slightly improves the accuracy rate.

Batch Normalization (BN) : Adjust the feature map under a batch to meet the distribution law with a mean value of 0 and a variance of 1. Both the mean value and variance are vectors, and the dimensions and depths correspond to statistics in the forward propagation. γ and β are in the reverse trained during propagation. ( Detailed explanation of Batch Normalization and pytorch experiment_Sunflower's Mung Bean Blog-CSDN Blog_batchnormalization pytorch )

 

 When using BN, pay attention to :

  • training=True during training, training=False during verification, which can be controlled by the model.train() and model.eval() methods of creating models in pytorch
  • The larger the batch size is set, the closer the mean and variance are to the mean and variance of the entire training set
  • It is recommended to place the bn layer between the convolutional layer (Conv) and the activation layer (such as Relu), and the convolutional layer should not use bias

The advantage of transfer learning : it can quickly train better results, even if the data set is small, it can achieve the desired effect (pay attention to the pre-training processing method)

Common methods of transfer learning : train all parameters after loading weights; only train the last few layers of parameters after loading weights; add fully connected layers after loading weights

3 Build ResNet based on PyTorch

Code link: (colab) Build ResNet based on PyTorch

from torch.nn.modules.batchnorm import BatchNorm2d
import torch
import torch.nn as nn


# 浅层ResNet的残差结构
class BasicBlock(nn.Module):
  expansion = 1 # 主分支中卷积核个数是否发生变化
  def __init__(self,in_channel,out_channel,stride=1,downsample=None):
    super(BasicBlock,self).__init__()
    self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=out_channel,
                kernel_size=3,stride=stride,padding=1,bias=False)
    self.bn1 = nn.BatchNorm2d(out_channel)
    self.relu = nn.ReLU()
    self.conv2 = nn.Conv2d(in_channels=out_channel,out_channels=out_channel,
                kernel_size=3,stride=1,padding=1,bias=False)
    self.bn2 = nn.BatchNorm2d(out_channel)
    self.downsample = downsample

  def forward(self,x):
    identity = x
    # 如果传入了下采样,就是虚线的残差块
    if self.downsample is not None:
      identity = self.downsample(x)

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)

    out += identity
    out = self.relu(out)

    return out


# 深层ResNet的残差结构
class Bottleneck(nn.Moudule):
  expansion = 4
  def __init__(self,in_channel,out_channel,stride=1,downsample=None):
    super(Bottleneck,self).__init__()
    # 降维
    self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=out_channel,
                kernel_size=1,stride=1,bias=False)
    self.bn1 = nn.BatchNorm2d(out_channel)

    self.conv2 = nn.Conv2d(in_channels=out_channel,out_channels=out_channel,
                kernel_size=3,stride=stride,padding=1,bias=False)
    self.bn2 = nn.BatchNorm2d(out_channel)

    # 升维
    self.conv3 = nn.Conv2d(in_channels=out_channel,out_channels=out_channel*self.expansion,
                kernel_size=1,stride=1,bias=False)
    self.bn3 = nn.BatchNorm2d(out_channel*self.expansion)

    self.relu = nn.ReLU(inplace=True)
    self.downsample = downsample

  def forward(self,x):
    identity = x
    # 如果传入了下采样,就是虚线的残差块
    if self.downsample is not None:
      identity = self.downsample(x)

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)
    out = self.relu(out)

    out = self.conv3(out)
    out = self.bn3(out)

    out += identity
    out = self.relu(out)

    return out



# ResNet
class ResNet(nn.Module):
  def __init__(self,block,blocks_num,num_classes=1000,include_top=True):
    super(ResNet,self).__init__()
    self.include_top = include_top
    self.in_channel = 64

    self.conv1 = nn.Conv2d(3,self.in_channel,kernel_size=7,stride=2,padding=3,bias=False)
    self.bn1 = nn.BatchNorm2d(self.in_channel)
    self.relu = nn.ReLU(inplace=True)
    self.maxpool = nn.MaxPool2d(kernel_size=3,stride=2,padding=1)

    self.layer1 = self._make_layer(block,64,blocks_num[0])
    self.layer2 = self._make_layer(block,128,blocks_num[1],stride=2)
    self.layer3 = self._make_layer(block,256,blocks_num[2],stride=2)
    self.layer4 = self._make_layer(block,512,blocks_num[3],stride=2)

    # 如果包含全连接层
    if self.include_top:
      self.avgpool = nn.AdaptiveAvgPool2d((1,1))
      self.fc = nn.Linear(512*block.expansion,num_classes)

    for m in self.modules():
      if isinstance(m,nn.Conv2d):
        nn.init.kaiming_normal_(m.weight,mode='fan_out',nonlinearity='relu')

  def _make_layer(self,block,channel,block_num,stride=1):
    downsample = None
    if stride!=1 or self.in_channel!=channel*block.expansion:
      downsample = nn.Sequential(
          nn.Conv2d(self.in_channel,channel*block.expansion,kernel_size=1,stride=stride,bias=False),
          nn.BatchNorm2d(channel*block.expansion)
      )
      layers = []
      layers.append(block(self.in_channel,channel,downsample=downsample,stride=stride))
      self.in_channel = channel*block.expansion

    for _ in range(1,block_num):
      layers.append(block(self.in_channel,channel))

    return nn.Sequential(*layers) # 非关键字参数

  def forward(self,x):
    x = self.conv1(x)
    x = self.bn1(x)
    x = self.relu(x)
    x = self.maxpool(x)

    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)

    if self.include_top:
      x = self.avgpool(x)
      x = torch.flatten(x,1)
      x = self.fc(x)

    return x


def resnet18(num_classes=1000,include_top=True):
  return ResNet(BasicBlock,[2,2,2,2],num_classes=num_classes,include_top=include_top)

def resnet34(num_classes=1000,include_top=True):
  return ResNet(BasicBlock,[3,4,6,3],num_classes=num_classes,include_top=include_top)

def resnet50(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,4,6,3],num_classes=num_classes,include_top=include_top)

def resnet101(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,4,23,3],num_classes=num_classes,include_top=include_top)

def resnet152(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,8,36,3],num_classes=num_classes,include_top=include_top)

Part2 ResNeXt

1 paper reading

On the basis of ResNet, the paper proposed the concept of "cardinality (the size of the conversion set)", and pointed out that "cardinality" is another important factor affecting the training effect besides network depth and width, and increasing the "cardinality" this The operation is more efficient than increasing the depth and width of the network, and increasing the size of hyperparameters does not always improve the training effect, so the increase in the size of hyperparameters will increase the difficulty and uncertainty of training. In contrast, Clever network structure design can achieve better results than simply increasing the depth of the existing network.

From this perspective, the paper improves the network structure of ResNet, stacking multiple convolutions of the same structure as a residual block, and believes that this structure can better avoid overfitting problems. The input of the residual block will be divided into several parts of equal size, and then the same convolution operation will be performed on each part, and the results will be integrated to form the output. Compared with the Inception structure, this structure module is more concise and easier to implement.

After experiments, it was found that the 101-layer ResNeXt is more accurate than the 200-layer ResNet, and the complexity is only half of the latter. The main features of ResNeXt are as follows:

  • Multi-branch convolutional network
  • group convolution
  • Compressed Convolutional Networks
  • additive transformation

Constructing the residual block structure of ResNeXt needs to follow the following two guidelines:

  1. Spatial maps of the same size share hyperparameters
  2. Every time the spatial map is downsampled by a factor of 2, the width of the block is also multiplied by a factor of 2

2 Network structure

Compared with ResNet, ResNeXt improves the block structure on the basis of ResNet, and the backbone is changed to group convolution. Under the same calculation amount, ResNeXt has a lower error rate

 Group convolution has fewer parameters than traditional convolution. When the output dimension is the same as the input dimension, it is equivalent to assigning a convolution kernel with a channel of 1 to each channel of the input feature matrix for convolution.

The following three forms are completely equivalent in computation.

 ResNet50 and ResNeXt50:

Only when the number of block layers is greater than or equal to 3, can a meaningful group convolution block be built, so this improvement has little effect on the shallow ResNet

3 Build ResNet based on PyTorch

 Code link: (colab) Build ResNeXt based on PyTorch

# 基于PyTorch搭建ResNeXt

from torch.nn.modules.batchnorm import BatchNorm2d
import torch
import torch.nn as nn


# 浅层的残差结构(不变)
class BasicBlock(nn.Module):
  expansion = 1 # 主分支中卷积核个数是否发生变化
  def __init__(self,in_channel,out_channel,stride=1,downsample=None):
    super(BasicBlock,self).__init__()
    self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=out_channel,
                kernel_size=3,stride=stride,padding=1,bias=False)
    self.bn1 = nn.BatchNorm2d(out_channel)
    self.relu = nn.ReLU()
    self.conv2 = nn.Conv2d(in_channels=out_channel,out_channels=out_channel,
                kernel_size=3,stride=1,padding=1,bias=False)
    self.bn2 = nn.BatchNorm2d(out_channel)
    self.downsample = downsample

  def forward(self,x):
    identity = x
    # 如果传入了下采样,就是虚线的残差块
    if self.downsample is not None:
      identity = self.downsample(x)

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)

    out += identity
    out = self.relu(out)

    return out


# 深层的残差结构
class Bottleneck(nn.Moudule):
  expansion = 4
  def __init__(self,in_channel,out_channel,stride=1,downsample=None,groups=1,width_per_group=64):
    super(Bottleneck,self).__init__()

    width = int(out_channel*(width_per_group/64.))*groups

    # 降维
    self.conv1 = nn.Conv2d(in_channels=in_channel,out_channels=width,
                kernel_size=1,stride=1,bias=False)
    self.bn1 = nn.BatchNorm2d(out_channel)

    self.conv2 = nn.Conv2d(in_channels=width,out_channels=width,groups=groups,
                kernel_size=3,stride=stride,padding=1,bias=False)
    self.bn2 = nn.BatchNorm2d(width)

    # 升维
    self.conv3 = nn.Conv2d(in_channels=width,out_channels=out_channel*self.expansion,
                kernel_size=1,stride=1,bias=False)
    self.bn3 = nn.BatchNorm2d(out_channel*self.expansion)

    self.relu = nn.ReLU(inplace=True)
    self.downsample = downsample

  def forward(self,x):
    identity = x
    # 如果传入了下采样,就是虚线的残差块
    if self.downsample is not None:
      identity = self.downsample(x)

    out = self.conv1(x)
    out = self.bn1(out)
    out = self.relu(out)

    out = self.conv2(out)
    out = self.bn2(out)
    out = self.relu(out)

    out = self.conv3(out)
    out = self.bn3(out)

    out += identity
    out = self.relu(out)

    return out



# ResNet
class ResNeXt(nn.Module):
  def __init__(self,block,blocks_num,num_classes=1000,include_top=True):
    super(ResNeXt,self).__init__()
    self.include_top = include_top
    self.in_channel = 64

    self.conv1 = nn.Conv2d(3,self.in_channel,kernel_size=7,stride=2,padding=3,bias=False)
    self.bn1 = nn.BatchNorm2d(self.in_channel)
    self.relu = nn.ReLU(inplace=True)
    self.maxpool = nn.MaxPool2d(kernel_size=3,stride=2,padding=1)

    self.layer1 = self._make_layer(block,64,blocks_num[0])
    self.layer2 = self._make_layer(block,128,blocks_num[1],stride=2)
    self.layer3 = self._make_layer(block,256,blocks_num[2],stride=2)
    self.layer4 = self._make_layer(block,512,blocks_num[3],stride=2)

    # 如果包含全连接层
    if self.include_top:
      self.avgpool = nn.AdaptiveAvgPool2d((1,1))
      self.fc = nn.Linear(512*block.expansion,num_classes)

    for m in self.modules():
      if isinstance(m,nn.Conv2d):
        nn.init.kaiming_normal_(m.weight,mode='fan_out',nonlinearity='relu')

  def _make_layer(self,block,channel,block_num,stride=1):
    downsample = None
    if stride!=1 or self.in_channel!=channel*block.expansion:
      downsample = nn.Sequential(
          nn.Conv2d(self.in_channel,channel*block.expansion,kernel_size=1,stride=stride,bias=False),
          nn.BatchNorm2d(channel*block.expansion)
      )
      layers = []
      layers.append(block(self.in_channel,channel,downsample=downsample,
                stride=stride,groups=self.groups,width_per_group=self.width_per_group))
      self.in_channel = channel*block.expansion

    for _ in range(1,block_num):
      layers.append(block(self.in_channel,channel,groups=self.groups,
                width_per_group=self.width_per_group))

    return nn.Sequential(*layers) # 非关键字参数

  def forward(self,x):
    x = self.conv1(x)
    x = self.bn1(x)
    x = self.relu(x)
    x = self.maxpool(x)

    x = self.layer1(x)
    x = self.layer2(x)
    x = self.layer3(x)
    x = self.layer4(x)

    if self.include_top:
      x = self.avgpool(x)
      x = torch.flatten(x,1)
      x = self.fc(x)

    return x


def resnet18(num_classes=1000,include_top=True):
  return ResNet(BasicBlock,[2,2,2,2],num_classes=num_classes,include_top=include_top)

def resnet34(num_classes=1000,include_top=True):
  return ResNet(BasicBlock,[3,4,6,3],num_classes=num_classes,include_top=include_top)

def resnet50(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,4,6,3],num_classes=num_classes,include_top=include_top)

def resnet101(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,4,23,3],num_classes=num_classes,include_top=include_top)

def resnet152(num_classes=1000,include_top=True):
  return ResNet(Bottleneck,[3,8,36,3],num_classes=num_classes,include_top=include_top)


def resnext50_32_4d(num_classes=1000,include_top=True):
  groups = 32
  width_per_group = 4
  return ResNeXt(Bottleneck,[3,4,6,3],num_classes=num_classes,include_top=include_top,
                groups=groups,width_per_group=width_per_group)
  
def resnext101_32_8d(num_classes=1000,include_top=True):
  groups = 32
  width_per_group = 8
  return ResNeXt(Bottleneck,[3,4,23,3],num_classes=num_classes,include_top=include_top,
                groups=groups,width_per_group=width_per_group)

Part3 code exercise: cat and dog war

Code link: (colab) cat and dog war

data loader:

data_dir = '/content/drive/MyDrive/Colab Notebooks/cat_dog/'
train_dir = data_dir+'train/'
test_dir = data_dir+'test/'
val_dir = data_dir+'val/'

train_imgs = os.listdir(train_dir)
train_labels = []


normalize = transforms.Normalize(mean=[0.485,0.456,0.406],std=[0.229,0.224,0.225])

transform = transforms.Compose([transforms.Resize([32,32]),transforms.ToTensor(),normalize])

class CatDogDataset(Dataset):
  def __init__(self, root, transform=None):
    self.root = root
    self.transform = transform
    # 正样本
    self.imgs = os.listdir(self.root)
    self.labels = []
    
    for img in self.imgs:
      if img.split('_')[0]=='cat':
        self.labels.append(0)
      if img.split('_')[0]=='dog':
        self.labels.append(1)

  def __len__(self):
    return len(self.imgs)

  def __getitem__(self, index):
    label = self.labels[index]
    img_dir = self.root + str(self.imgs[index])
    img = Image.open(img_dir)

    # transform?
    if self.transform is not None:
      img = self.transform(img)

    return img,torch.from_numpy(np.array(label))  # 返回数据+标签

LeNet5:

class LeNet5(nn.Module):
    def __init__(self): 
        super(LeNet5, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, 5)
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(16, 32, 5)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.fc1 = nn.Linear(32*5*5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
                            
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = x.view(-1, 32*5*5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

Related parameters:

Training (ResNet34 on the left, LeNet5 on the right):

 

 

Generate result csv file:

Scoring results: 

 

It can be seen that ResNet34 is better than LeNet5 when using the same optimization parameters and number of training rounds.

Part4 thinking questions

1. Residual learning

Suppose the basic mapping is h(x), and the mappings of the two branches of the residual block are f(x) and x respectively, then h(x)=x+f(x), f(x) is the main branch to be learned Mapping, this structure will help the realization and preprocessing of identity mapping, and alleviate the problem of network degradation. When deriving the gradient, there is h'(x)=1+f'(x), which can also alleviate the problem of gradient disappearance.

2. The principle of Batch Normailization

The convolutional neural network contains many hidden layers, and the parameters of each layer will change with the training. The input distribution of the hidden layer will always change, which will reduce the learning speed and the gradient of the activation function will be saturated. Normalization is an idea to solve this problem, but if the use of Normalization is too much, the calculation of the training process will be very complicated, and if it is too small, it will not be effective. Therefore, BN is proposed. BN first divides the data into multiple batches. Then perform a normalization operation in each batch.

3. Why can group convolution improve accuracy? Now that the group convolution can improve the accuracy and reduce the amount of calculation, can't the number of scores be as large as possible?

According to the content of ResNeXt's paper, it pointed out that the original purpose of group convolution is to facilitate training a model on multiple GPUs at the same time. Group convolution can reduce the amount of calculation, but there is little evidence that group convolution can improve accuracy. Even now, I have not found relevant evidence for the time being. The improvement from ResNet to ResNeXt mainly proposes the concept of cardinality. Personally, I think that if group convolution can improve the accuracy, it is because group convolution can perform group learning, thereby learning deeper information, similar to convolutional neural networks. network and traditional fully-connected layers.

The number of groups in the group convolution should not be too many. Too many groups will make the feature extraction very fragmented, which is not conducive to the extraction of key features.

Guess you like

Origin blog.csdn.net/qq_55708326/article/details/125957382