introduction

This project builds a deep learning neural network based on pytorch. The network includes a convolutional layer, a pooling layer, and a fully connected layer. Through this network, the recognition of handwritten numbers in the MINST dataset is realized. Through the code of this project, the handwritten numbers can be understood in principle. The whole process of recognition, including backpropagation, gradient descent, etc.

1 Introduction to Convolutional Neural Networks

1.1 What is a Convolutional Neural Network

A convolutional neural network is a multi-layer, feed-forward neural network. Functionally, it can be divided into two stages, feature extraction stage and classification recognition stage.

The feature extraction stage can automatically extract the features in the input data as the basis for classification. It is composed of multiple feature layers stacked, and each feature layer is composed of a convolutional layer and a pooling layer. The feature layer in the front captures the information of local details in the image, while the feature layer in the back can capture the higher-level and abstract information in the image.

1.1.1 Convolution Kernel

In a convolutional layer of a convolutional neural network, a neuron is only connected to some neurons in neighboring layers. In a convolutional layer of CNN, it usually contains several feature maps (featureMap), each feature map is composed of some neurons arranged in a rectangle, and the neurons of the same feature map share weights, and the shared weights here are volumes. Accumulation. The convolution kernel is generally initialized in the form of a random decimal matrix, and the convolution kernel will learn to obtain reasonable weights during the training process of the network. The direct benefit of sharing weights (convolution kernels) is to reduce connections between layers of the network while reducing the risk of overfitting.

1.1.2 Receptive Field

Definition: In the convolutional neural network, the pixel points on the feature map (feature map) output by each layer of the convolutional neural network map the area size on the input image. In a typical CNN structure, the value of each output node of the FC layer depends on all the inputs of the FC layer, while the value of each output node of the CONV layer only depends on an area of the input of the CONV layer, and other input values outside this area will not Affects the output value, and this area is the receptive field. The following figure is a schematic diagram of the receptive field:

When we use convolution kernels with different sizes, the biggest difference is that the size of the receptive field is different, so we often use multiple layers of small convolution kernels to replace a layer of large convolution kernels, and reduce parameters while keeping the receptive fields the same. volume and computation. For example, it is very common to use two layers of 3*3 convolution kernels to replace a layer of 5*5 convolution kernels.

1.3 Standardization (Batch Normalization)

Before the introduction of BN, the previous model training had some systematic problems, which caused many algorithms to converge very slowly, or even not work at all, especially when using the sigmoid activation function. In machine learning, we usually standardize or normalize the input features, because the dimensions of each dimension of the directly input data may be different, and the values may vary greatly, resulting in the model not being able to learn from each feature well. When the output value of the previous layer is too large or too small, it will fall into the saturation region when it passes through the sigmoid activation function, and the gradient will disappear in the backpropagation.
Batch Normalization : Standardize a small batch of data. Fit the data to a distribution with a mean of 0 and a standard deviation of 1.

The Batch Normalization layer is usually added between each neural network layer and the activation layer, unifies and adjusts the data distribution output by the neural network layer, and turns it into a standard normal distribution with a mean of 0 and a variance of 1, which solves the problem of gradient disappearance in the neural network. The problem of making the output in the unsaturated area of the activation layer achieves the effect of speeding up the convergence.

1.1.4 Pooling layer (Pooling)

Pooling is used to reduce the dimension of the feature map (Feature Map) in the neural network. In convolutional neural networks, the pooling operation is usually followed by the convolution operation to reduce the spatial size of the feature map. The basic idea of the pooling operation is to divide the feature map into several sub-regions (usually rectangular), and perform statistical summary for each sub-region. Pooling usually has two forms: mean pooling and max pooling. Pooling can be seen as a special convolution process. Convolution and pooling greatly simplify the model complexity and reduce the parameters of the model.

Maximum pooling can extract image texture
Mean pooling preserves background features

1.2 Calculation process of convolution

Suppose we input a 5*5*1 image, and the 3*3*1 in the middle is a convolution kernel we defined (in simple terms, it can be regarded as a matrix operator), through the original input image and convolution The result of the green part can be obtained by the core operation. What kind of operation? It is actually very simple. We look at the dark part in the left picture. The number in the middle is the pixel of the image, and the number in the lower right corner is the number of our convolution kernel. Just multiply and add the corresponding numbers to get the result. For example, '3*0+1*1+2*2+2*2+0*2+0*0+2*0+0*1+0*2=9' in the picture

The calculation process is as follows:

The three input matrices on the far left in the figure are our equivalent input d=3 when there are three channel maps. Each channel map has a convolution kernel belonging to its own channel. We can see that there are only two output (output) A feature map means that we set the output d=2, and there are several layers of convolution kernels with several output channels (for example, there are FilterW0 and FilterW1 in the figure), which means that the number of our convolution kernels is the number of input d The number multiplied by the number of output d (in the figure is 2*3=6), where the calculation of each layer of channel map is the same as the calculation of the layer mentioned above, and then add the output of each channel output That's the green output number.

Step size: the size of each convolution kernel movement

Output feature size calculation : After understanding the entire process of convolution calculation in the neural network, the size of the output feature map can be calculated. As shown in the figure below, the output feature size of a 5×5 image is 3×3 after convolution calculation with a convolution kernel of 3×3 size.

zero padding

When the convolution kernel size is greater than 1, the size of the output feature map will be smaller than the size of the input image. After multiple convolutions, the output image size will continue to decrease. In order to avoid the image size becoming smaller after convolution, padding is usually performed on the periphery of the image, as shown in the figure below

All zero padding (padding): In order to keep the output image size consistent with the input image, all zero padding is often performed around the input image, as shown below, if 0 is filled around the 5×5 input image, the output feature size is also 5 ×5.

When padding=1 and paadding=2, as shown in the figure below:

2 Using CNN to realize MNIST handwritten digit recognition

The process of machine image recognition: Machine image recognition does not completely recognize a complex picture at once, but divides a complete picture into many small parts, extracts the features of each small part, and then The features of these small parts are summed together to complete the machine's recognition of the entire image.

2.1 Introduction to MNIST data

The MNIST dataset is a large database of handwritten digits collected by the National Institute of Standards and Technology, including a training set of 60,000 examples and a test set of 10,000 examples. The image size is 28*28. The sampled data is displayed as follows:

2.2 Code implementation based on pytorch

import torch
import torch.nn as nn
import torchvision.datasets as dataset
import torchvision.transforms as transforms
import torch.utils.data as data_utils
import matplotlib.pyplot as plt
import numpy as np


#获取数据集
train_data=dataset.MNIST(root="./data",
                         train=True,
                         transform=transforms.ToTensor(),
                         download=True
                         )
test_data=dataset.MNIST(root="./data",
                         train=False,
                         transform=transforms.ToTensor(),
                         download=False
                         )
train_loader=data_utils.DataLoader(dataset=train_data, batch_size=64, shuffle=True)
test_loader=data_utils.DataLoader(dataset=test_data, batch_size=64, shuffle=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

#创建网络
class Net(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.conv=nn.Conv2d(1, 32, kernel_size=5, padding=2)
        self.bat2d=nn.BatchNorm2d(32)
        self.relu=nn.ReLU()
        self.pool=nn.MaxPool2d(2)
        self.linear=nn.Linear(14 * 14 * 32, 70)
        self.tanh=nn.Tanh()
        self.linear1=nn.Linear(70,30)
        self.linear2=nn.Linear(30, 10)
   def forward(self,x):
        y=self.conv(x)
        y=self.bat2d(y)
        y=self.relu(y)
        y=self.pool(y)
        y=y.view(y.size()[0],-1)
        y=self.linear(y)
        y=self.tanh(y)
        y=self.linear1(y)
        y=self.tanh(y)
        y=self.linear2(y)
        return y
cnn=Net()
cnn = cnn.to(device)

#损失函数
los=torch.nn.CrossEntropyLoss()

#优化函数
optime=torch.optim.Adam(cnn.parameters(), lr=0.001)

#训练模型
accuracy_rate = [0]
num_epochs = 10
for epo in range(num_epochs):
    for i, (images,lab) in enumerate(train_loader):
        images=images.to(device)
        lab=lab.to(device)
        out = cnn(images)
        loss=los(out,lab)
        optime.zero_grad()
        loss.backward()
        optime.step()
    print("epo:{},i:{},loss:{}".format(epo+1,i,loss))

    #测试模型
    loss_test=0
    accuracy=0
    with torch.no_grad():
        for j, (images_test,lab_test) in enumerate(test_loader):
            images_test = images_test.to(device)
            lab_test=lab_test.to(device)
            out1 = cnn(images_test)
            loss_test+=los(out1,lab_test)
            loss_test=loss_test/(len(test_data)//100)
            _,p=out1.max(1)
            accuracy += (p==lab_test).sum().item()

        accuracy=accuracy/len(test_data)
        accuracy_rate.append(accuracy)
        print("loss_test:{},accuracy:{}".format(loss_test,accuracy))


accuracy_rate = np.array(accuracy_rate)
times = np.linspace(0, num_epochs, num_epochs+1)
plt.xlabel('times')
plt.ylabel('accuracy rate')
plt.plot(times, accuracy_rate)
plt.show()

operation result:

epo:1,i:937,loss:0.2277517020702362
loss_test:0.0017883364344015718,accuracy:0.9729
epo:2,i:937,loss:0.01490325853228569
loss_test:9.064914047485217e-05,accuracy:0.9773
epo:3,i:937,loss:0.0903361514210701
loss_test:0.0003304268466308713,accuracy:0.9791
epo:4,i:937,loss:0.003910894505679607
loss_test:0.00019427068764343858,accuracy:0.9845
epo:5,i:937,loss:0.011963552795350552
loss_test:3.232352901250124e-05,accuracy:0.983
epo:6,i:937,loss:0.04549657553434372
loss_test:0.0001462855434510857,accuracy:0.9859
epo:7,i:937,loss:0.02365218661725521
loss_test:3.670657861221116e-06,accuracy:0.9867
epo:8,i:937,loss:0.00040980291669256985
loss_test:1.4913265658833552e-05,accuracy:0.9872
epo:9,i:937,loss:0.024399513378739357
loss_test:7.590289897052571e-05,accuracy:0.9865
epo:10,i:937,loss:0.0012365489965304732
loss_test:0.00014759502664674073,accuracy:0.9869

3 summary

This paper introduces the key concepts in convolutional neural network, including convolution kernel, pooling, standardization, receptive field, etc., and builds a convolutional neural network recognition model based on the MNIST dataset. After 10 epochs training, the accuracy rate reaches 98%, fully demonstrating the role of convolutional neural networks in image recognition.

Deep Learning: MNIST Handwritten Digit Recognition Using Convolutional Neural Network CNN