[DataWhale Learning Record 13-02] Zero-based introductory CV competition-Task03-character recognition model

1. Goal:

1. CNN basics and principles;
2. Use the Pytorch framework to build a CNN model and complete the training.

2. About CNN

1) Introduction to CNN

Convolutional Neural Network (CNN) is a feed-forward neural network, its artificial neurons can respond to a part of the surrounding units in the coverage area, [1] has excellent performance for large-scale image processing.

The convolutional neural network consists of one or more convolutional layers and a fully connected layer (corresponding to the classic neural network) at the top, and also includes associated weights and a pooling layer. This structure enables convolutional neural networks to use the two-dimensional structure of the input data. Compared with other deep learning structures, convolutional neural networks can give better results in image and speech recognition. This model can also be trained using backpropagation algorithms. Compared with other deep, feed-forward neural networks, convolutional neural networks need to consider fewer parameters, making it an attractive deep learning structure [2].
Insert picture description here
Convolutional neural network (CNN for short) is a special type of artificial neural network, which is an important branch of deep learning. CNN has excellent performance in many fields, and its accuracy and speed are much higher than traditional computational learning algorithms. Especially in the field of computer vision, CNN is a mainstream model that solves image classification, image retrieval, object detection and semantic segmentation.

Each layer of CNN is composed of a large number of convolution kernels, and each convolution kernel performs a convolution operation on the input pixels to get the next input. As the network layer increases, the convolution kernel will gradually expand the receptive field and reduce the size of the image.

CNN is a hierarchical model, the input is the original pixel data. CNN is composed of convolution, pooling, non-linear activation function and fully connected layer.

The following figure shows the LeNet network structure, which is a very classic character recognition model. It consists of two convolutional layers, two pooling layers, and two fully connected layers. The convolution kernels are all 5×5, stride=1, and the pooling layer uses maximum pooling.

Through multiple convolutions and pooling, the last layer of CNN maps the input image pixels to specific output. For example, in the classification task, it will be converted into the probability output of different categories, and then the difference between the real label and the prediction result of the CNN model will be calculated, and the parameters of each layer will be updated through backpropagation, and then forward again after the update is completed, and so on. Until the training is complete.

Compared with traditional machine learning models, CNN has an end-to-end idea. In the process of CNN training, it is directly from the image pixels to the final output. It does not involve the process of specific feature extraction and model building, and does not require human involvement.

2) CNN understands
Insert picture description here
a 3x3 source pixel after a 3x3 convolution kernel, the feature of the source pixel is mapped into a 1x1 destination pixel.

According to the nature of the human eye to recognize the image:
Insert picture description here

问题：
1.卷积层提取的特征数量有限，而图片特征可能会过多；
2.最后采样层选出来的特征是否重要。

Introduce a new concept- cascade of classifiers (cascade of classifiers)

It probably means that from a bunch of weak classifiers, I pick out the weak classifier that best meets the requirements, and use this weak classifier to eliminate unwanted data and keep the desired data.

Then, from the remaining weak classifiers, pick a weak classifier that best meets the requirements, and remove the unwanted data from the data retained by the upper level, and retain the desired data.

Finally, by continuously concatenating several weak classifiers, filtering through data layers, we finally get the data we want.

CNN is mainly composed of 3 clock modules:

Convolutional layer
Sampling layer
Fully connected layer

It can be roughly understood as:

Extract the original feature through the first convolutional layer, and output the feature map
through the first sampling layer to perform feature selection on the original feature map, remove redundant features, and reconstruct a new feature map. The
second The convolutional layer performs secondary feature extraction on the output feature map
of the upper sampling layer (feature map). The second sampling layer also performs secondary feature selection on the output of the upper layer. The
fully connected layer is classified according to the obtained features.

Reference: Introduction to Convolutional Neural Network (CNN): https://zhuanlan.zhihu.com/p/31249821

Three, Pytorch builds a CNN model

import torch
torch.manual_seed(0)
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True

import torchvision.models as models
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data.dataset import Dataset

# 定义模型
class SVHN_Model1(nn.Module):
    def __init__(self):
        super(SVHN_Model1, self).__init__()
        # CNN提取特征模块
        self.cnn = nn.Sequential(
            nn.Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2)),
            nn.ReLU(),  
            nn.MaxPool2d(2),
            nn.Conv2d(16, 32, kernel_size=(3, 3), stride=(2, 2)),
            nn.ReLU(), 
            nn.MaxPool2d(2),
        )
        # 
        self.fc1 = nn.Linear(32*3*7, 11)
        self.fc2 = nn.Linear(32*3*7, 11)
        self.fc3 = nn.Linear(32*3*7, 11)
        self.fc4 = nn.Linear(32*3*7, 11)
        self.fc5 = nn.Linear(32*3*7, 11)
        self.fc6 = nn.Linear(32*3*7, 11)
    
    def forward(self, img):        
        feat = self.cnn(img)
        feat = feat.view(feat.shape[0], -1)
        c1 = self.fc1(feat)
        c2 = self.fc2(feat)
        c3 = self.fc3(feat)
        c4 = self.fc4(feat)
        c5 = self.fc5(feat)
        c6 = self.fc6(feat)
        return c1, c2, c3, c4, c5, c6
    
model = SVHN_Model1()

Training code:

# 损失函数
criterion = nn.CrossEntropyLoss()
# 优化器
optimizer = torch.optim.Adam(model.parameters(), 0.005)

loss_plot, c0_plot = [], []
# 迭代10个Epoch
for epoch in range(10):
    for data in train_loader:
        c0, c1, c2, c3, c4, c5 = model(data[0])
        loss = criterion(c0, data[1][:, 0]) + \
                criterion(c1, data[1][:, 1]) + \
                criterion(c2, data[1][:, 2]) + \
                criterion(c3, data[1][:, 3]) + \
                criterion(c4, data[1][:, 4]) + \
                criterion(c5, data[1][:, 5])
        loss /= 6
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        loss_plot.append(loss.item())
        c0_plot.append((c0.argmax(1) == data[1][:, 0]).sum().item()*1.0 / c0.shape[0])
        
    print(epoch)

In order to pursue accuracy, the pre-trained model on the ImageNet dataset can also be used. The specific method is as follows:

class SVHN_Model2(nn.Module):
    def __init__(self):
        super(SVHN_Model1, self).__init__()
                
        model_conv = models.resnet18(pretrained=True)
        model_conv.avgpool = nn.AdaptiveAvgPool2d(1)
        model_conv = nn.Sequential(*list(model_conv.children())[:-1])
        self.cnn = model_conv
        
        self.fc1 = nn.Linear(512, 11)
        self.fc2 = nn.Linear(512, 11)
        self.fc3 = nn.Linear(512, 11)
        self.fc4 = nn.Linear(512, 11)
        self.fc5 = nn.Linear(512, 11)
    
    def forward(self, img):        
        feat = self.cnn(img)
        # print(feat.shape)
        feat = feat.view(feat.shape[0], -1)
        c1 = self.fc1(feat)
        c2 = self.fc2(feat)
        c3 = self.fc3(feat)
        c4 = self.fc4(feat)
        c5 = self.fc5(feat)
        return c1, c2, c3, c4, c5