Pytorch trains ResNet with its own data

1. Introduction to ResNet algorithm

The Residual Neural Network (ResNet) was proposed by He Yuming and others from Microsoft Research. ResNet won the championship in the 2015 ILSVRC.

Through experiments, with the continuous deepening of the network layer of ResNet, the accuracy of the model is first continuously improved, reaching the maximum value (accuracy saturation), and then as the network depth continues to increase, the accuracy of the model appears without warning. reduce. This phenomenon is obviously contradictory and conflicts with the belief that "the deeper the network, the higher the accuracy rate". The ResNet team calls this phenomenon "Degradation".

It is reasonable to add more layers to the network. The solution space of the shallow network is included in the solution space of the deep network. It becomes an identity mapping, and the weights of other layers are copied intact to the shallow network, and the same performance as the shallow network can be obtained. A better explanation obviously exists, why can't I find it? Find a worse solution instead?

The performance degradation on the training set can rule out overfitting, and the introduction of the BN layer basically solves the gradient disappearance and gradient explosion problems of plain net. If it is not caused by overfitting and gradient disappearance, what is the reason?

Obviously, this is an optimization problem, reflecting that the optimization difficulty of models with similar structures is different, and the increase in difficulty is not linear, and the deeper the model is, the more difficult it is to optimize.

There are two solutions, one is to adjust the solution method, such as better initialization, better gradient descent algorithm, etc.; the other is to adjust the model structure to make the model easier to optimize-changing the model structure actually changes the The shape of the error surface.

ResNet proposes the concept of residual block from the perspective of adjusting the model structure. The actual principle is to let each residual block in the deep layer of the network learn the identity as much as possible. This is equivalent to simplifying the task, and the network depth can be deeper.

 Why is the residual block designed in this way?

The purpose of ResNet is to design a network with identity mapping, but the task of fitting the identity of the neural network is more complicated, it is better to directly learn the mapping of the residual. Then the purpose of the network is to make the residual equal to zero, which is equivalent to an identity mapping network. As shown in Figure 2, it is a residual block, F(x) represents the residual learning path, x represents the shortcut path, and the mapping relationship obtained after learning is:

x:=F(x)+x

In the original paper, the residual path can be roughly divided into two types, one has a bottleneck structure, that is, the 1×1 convolutional layer in the right of the figure below, which is used to reduce the dimension first and then increase the dimension, mainly for the reality of reducing computational complexity Consider , call it " bottleneck block ", another structure without bottleneck, as shown on the left of the figure below, call it " basic block ". The basic block consists of two 3×3 convolutional layers.

ResNet is a series of multiple Residual Blocks. Its structure is very easy to modify and expand. By adjusting the number of channels in the block and the number of stacked blocks, you can easily adjust the width and depth of the network to obtain networks with different expressive capabilities. Instead of worrying too much about the "degeneration" of the network, as long as the training data is sufficient and the network is gradually deepened, better performance can be obtained. At present, ResNet is most commonly used as the backbone of the detection network. Commonly used structures include ResNet-50, ResNet-101, etc.

2. Dataset introduction

This experiment uses an open source data set for gesture recognition to train a gesture classifier. The data set comes from the project https://codechina.csdn.net/EricLee/classification , with a total of 2850 samples divided into 14 categories.

There is nothing to say about the pytorch definition of the data, the basic steps. Just rewrite a few functions according to your own data characteristics. In this experiment, the samples are divided into training set and verification set according to the ratio of 5:1.

import torch
import torch.nn as nn
from torch.utils.data import DataLoader,Dataset
from torchvision import transforms as T
import matplotlib.pyplot as plt
import os
from PIL import Image
import numpy as np
import random
 
class hand_pose(Dataset):
    def __init__(self, root, train=True, transforms=None):
        imgs = []
        for path in os.listdir(root):
            path_prefix = path[:3]
            if path_prefix == "000":
                label = 0
            elif path_prefix == "001":
                label = 1
            elif path_prefix == "002":
                label = 2
            elif path_prefix == "003":
                label = 3
            elif path_prefix == "004":
                label = 4
            elif path_prefix == "005":
                label = 5
            elif path_prefix == "006":
                label = 6
            elif path_prefix == "007":
                label = 7
            elif path_prefix == "008":
                label = 8
            elif path_prefix == "009":
                label = 9
            elif path_prefix == "010":
                label = 10
            elif path_prefix == "011":
                label = 11
            elif path_prefix == "012":
                label = 12
            elif path_prefix == "013":
                label = 13
            else:
                print("data label error")
 
            childpath = os.path.join(root, path)
            for imgpath in os.listdir(childpath):
                imgs.append((os.path.join(childpath, imgpath), label))
        
        train_path_list, val_path_list = self._split_data_set(imgs)
        if train:
            self.imgs = train_path_list
        else:
            self.imgs = val_path_list

        if transforms is None:
            normalize = T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
 
            self.transforms = T.Compose([
                    T.Resize(256),
                    T.CenterCrop(224),
                    T.ToTensor(),
                    normalize
            ])
        else:
            self.transforms = transforms
             
    def __getitem__(self, index):
        img_path = self.imgs[index][0]
        label = self.imgs[index][1]
 
        data = Image.open(img_path)
        if data.mode != "RGB":
            data = data.convert("RGB")
        data = self.transforms(data)
        return data,label
 
    def __len__(self):
        return len(self.imgs)

    def _split_data_set(self, imags):
        """
        分类数据为训练集和验证集,根据个人数据特点设计,不通用。
        """
        val_path_list = imags[::5]
        train_path_list = []
        for item in imags:
            if item not in val_path_list:
                train_path_list.append(item)
        return train_path_list, val_path_list
 
if __name__ == "__main__":
    root = "handpose_x_gesture_v1"
   
    train_dataset = hand_pose(root, train=False)
    train_dataloader = DataLoader(train_dataset, batch_size=32, shuffle=True)
    for data, label in train_dataloader:
        print(data.shape)
        print(label)
        break

Because nn.CrossEntroyLoss contains softmax and ont-hot encoding processing, it is not necessary to perform ont-hot processing during data definition, and the categories can be sorted according to int (0, 1, 2,...)

3. Model training

3.1 Model network definition

import torch
from torch import nn

class Bottleneck(nn.Module):
    # 残差块定义
    extention = 4
    def __init__(self, inplanes, planes, stride, downsample=None):
        super(Bottleneck, self).__init__()
        self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=1, stride=stride, bias=False)
        self.bn1 = nn.BatchNorm2d(planes)

        self.conv2 = nn.Conv2d(planes, planes, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(planes)

        self.conv3 = nn.Conv2d(planes, planes*self.extention, kernel_size=1, stride=1, bias=False)
        self.bn3 = nn.BatchNorm2d(planes*self.extention)

        self.relu = nn.ReLU(inplace=True)

        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        shortcut = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)
        out = self.relu(out)

        if self.downsample is not None:
            shortcut = self.downsample(x)
        
        out = out + shortcut   # 不能写作out+=shortcut
        out = self.relu(out)
        return out


class ResNet50(nn.Module):
    def __init__(self, block, layers, num_class):
        self.inplane = 64
        super(ResNet50,self).__init__()

        self.block = block
        self.layers = layers

        self.conv1 = nn.Conv2d(3, self.inplane, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(self.inplane)
        self.relu = nn.ReLU()
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.stage1=self.make_layer(self.block,64,layers[0],stride=1)
        self.stage2=self.make_layer(self.block,128,layers[1],stride=2)
        self.stage3=self.make_layer(self.block,256,layers[2],stride=2)
        self.stage4=self.make_layer(self.block,512,layers[3],stride=2)

        self.avgpool = nn.AvgPool2d(7)
        self.fc = nn.Linear(512*block.extention, num_class)

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.maxpool(out)

        #block部分
        out=self.stage1(out)
        out=self.stage2(out)
        out=self.stage3(out)
        out=self.stage4(out)

        out=self.avgpool(out)
        out=torch.flatten(out,1)
        out=self.fc(out)

        return out

    def make_layer(self, block, plane, block_num, stride=1):
        block_list = []
        downsample = None
        if(stride!=1 or self.inplane!=plane*block.extention):
            downsample = nn.Sequential(
                nn.Conv2d(self.inplane, plane*block.extention, stride=stride, kernel_size=1, bias=False),
                nn.BatchNorm2d(plane*block.extention)
            )
        conv_block = block(self.inplane, plane, stride=stride, downsample=downsample)
        block_list.append(conv_block)
        self.inplane = plane*block.extention

        for i in range(1,block_num):
            block_list.append(block(self.inplane, plane, stride=1))

        return nn.Sequential(*block_list)


if __name__ == "__main__":
    resnet = ResNet50(Bottleneck,[3,4,6,3],14)
    x = torch.randn(64,3,224,224)
    x = resnet(x)
    print(x.shape)

There are two parts to the network definition. The bottleneck is the basic module of the residual network, and Resnet50 is the entire network architecture, which corresponds to the network structure in the figure below. 

Note that when defining the residual block bottleneck, the jump-connected addition part of the shortcut cannot be written as out += shortcut . The specific reason is that the out must be saved for the gradient calculation of the backend, and += is an inplace operation that changes the variable.

If you use the inplace writing method, an error will be reported, and the error message is:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation:

3.2 Training

import torch
import torch.nn as nn
from torch.utils.data import DataLoader,Dataset
from Data import hand_pose
from Model import ResNet50, Bottleneck
import os


def main():
    # 1. load dataset
    root = "handpose_x_gesture_v1"
    batch_size = 64
    train_data = hand_pose(root, train=True)
    val_data = hand_pose(root, train=False)
    train_dataloader = DataLoader(train_data,batch_size=batch_size,shuffle=True)
    val_dataloader = DataLoader(val_data,batch_size=batch_size,shuffle=True)
    
    # 2. load model
    num_class = 14
    model = ResNet50(Bottleneck,[3,4,6,3], num_class)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)
    
    # 3. prepare super parameters
    criterion = nn.CrossEntropyLoss()
    learning_rate = 1e-3
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    epoch = 30

    # 4. train
    val_acc_list = []
    out_dir = "checkpoints/"
    if not os.path.exists(out_dir):
        os.makedirs(out_dir)
    for epoch in range(0, epoch):
        print('\nEpoch: %d' % (epoch + 1))
        model.train()
        sum_loss = 0.0
        correct = 0.0
        total = 0.0
        for batch_idx, (images, labels) in enumerate(train_dataloader):
            length = len(train_dataloader)
            images, labels = images.to(device), labels.to(device)
            optimizer.zero_grad()
            outputs = model(images) # torch.size([batch_size, num_class])
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
        
            sum_loss += loss.item()
            _, predicted = torch.max(outputs.data, dim=1)
            total += labels.size(0)
            correct += predicted.eq(labels.data).cpu().sum()
            print('[epoch:%d, iter:%d] Loss: %.03f | Acc: %.3f%% ' 
                % (epoch + 1, (batch_idx + 1 + epoch * length), sum_loss / (batch_idx + 1), 100. * correct / total))
            
        #get the ac with testdataset in each epoch
        print('Waiting Val...')
        with torch.no_grad():
            correct = 0.0
            total = 0.0
            for batch_idx, (images, labels) in enumerate(val_dataloader):
                model.eval()
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, dim=1)
                total += labels.size(0)
                correct += (predicted == labels).sum()
            print('Val\'s ac is: %.3f%%' % (100 * correct / total))
            
            acc_val = 100 * correct / total
            val_acc_list.append(acc_val)


        torch.save(model.state_dict(), out_dir+"last.pt")
        if acc_val == max(val_acc_list):
            torch.save(model.state_dict(), out_dir+"best.pt")
            print("save epoch {} model".format(epoch))

if __name__ == "__main__":
    main()

During training, each epoch tests the accuracy on the training set and validation set respectively, and saves the model.

The final training results are as follows.

The accuracy rate of the training set is 88, and the verification set is only 72.6. There is no doubt that the model has some overfitting . The reason is that the amount of data is too small, a total of 2850 samples, which are divided into training set and verification set according to the ratio of 5:1. If you use transfer learning, initialize the model with a pre-trained model, and then train, the effect should be much better.

3.3 Transfer Learning

 Transfer learning, as the name implies, is to transfer the trained model parameters to a new model to help the new model training. Considering that most of the data or tasks are related, through transfer learning we can share the learned model parameters (also understood as the knowledge learned by the model) to the new model in a certain way to speed up and The learning efficiency of the optimized model does not need to be learned from scratch like most networks.

Advantages: 1. Accelerates the training speed, and the loss converges quickly; 2. Can reduce overfitting and obtain a model with stronger generalization ability.

Because the model we defined is different from the resnet50 in the paper, it is not possible to directly load the pre-trained model on the Internet. Here we use the resnet50 network that comes with torchvision, then load the pre-trained model, change the last fully connected layer, and then train. Just load the model in train.py and modify it here.

# 2. load model
    num_class = 14
    # model = ResNet50(Bottleneck,[3,4,6,3], num_class)
    model = models.resnet50(pretrained=True)
    fc_inputs = model.fc.in_features
    model.fc = nn.Linear(fc_inputs, num_class)
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    model = model.to(device)

The final training results are as follows. When the epoch reaches 22, the loss is very small, the accuracy rate of the verification set is 88, and the accuracy rate of the training set is 99.

Why is the gap between learning from zero and fine-tune learning so large? The size of the loss function is ten times different, and the accuracy of the verification set is 20 times worse. Personally, I think it has more to do with initialization. Initialization prevents the loss from spinning at a local minimum and finds a lower point, which improves the performance of the model, but the problem of overfitting still exists.

 

Guess you like

Origin blog.csdn.net/Eyesleft_being/article/details/119996210