Handwritten Digit Recognition Based on Convolutional Neural Network (cnn) (PyTorch)

Table of contents

1.1 Introduction to Convolutional Neural Networks

1.2 Neural Network

1.2.1 Neuron model

 1.2.2 Neural network model

1.3 Convolutional Neural Networks

1.3.1 The concept of convolution

1.3.2 Calculation process of convolution

1.3.3 Receptive field

1.3.4 Step size

1.3.5 Calculation of output feature size

 1.3.6 All zero padding

1.3.7 Standardization

1.3.7 Pooling layer

 1.4 The whole process of convolutional neural network

 1.5 PyTorch's convolutional neural network (cnn) handwritten digit recognition

1.5.1 Code


1.1 Introduction to Convolutional Neural Networks

Convolutional Neural Networks (CNN for short) is a very important neural network structure in deep learning. It is mainly used in image processing, video processing, audio processing, and natural language processing.
As early as around the 1980s, the concept of convolutional neural networks has been proposed. But its real rise is after the 21st century. After the 21st century, with the continuous improvement of deep learning theory, at the same time, the improvement of computer hardware performance and the continuous development of computer computing power provide convolutional neural network algorithms. application space. Most of the famous AlphaGo and face recognition on mobile phones use convolutional neural networks. Therefore, it can be said that convolutional neural networks play a pivotal role in today's deep learning field.

Before understanding the convolutional neural network, we must know: what is a neural network (Neural Networks), about this, we have already introduced it in the second part of the introduction to deep learning. I won't go into details here. On the basis of understanding the neural network, let's explore again: what is the convolutional neural network? What does the word "convolution" mean?
 

1.2 Neural Network

1.2.1 Neuron model


The research on artificial neural networks (neural networks) has appeared very early. Today, "neural network" is a rather large and interdisciplinary field. Various related disciplines have various definitions of neural networks. An extensive parallel interconnected network composed of simple units whose organization can simulate the interactive response of the biological nervous system to real-world objects". The most basic component of a
neural network is the neuron model, which is the neuron model in the above definition "Simple unit", in the biological neural network, each neuron is connected to other neurons, when it is "excited", it will send chemical substances to the connected neurons, thereby changing the potential in these neurons; if a certain The potential of a neuron exceeds a "threshold" (threshold), then it will be activated, that is, "excited", sending chemicals to other neurons. In this model, a neuron receives signals from n other neurons The transmitted input signals are transmitted through weighted connections, and the total input value received by the neuron will be compared with the inter-value of the neuron, and then processed by the activation function to generate the neuron output.

 

 1.2.2 Neural network model

 A neural network is an operational model consisting of a large number of nodes (or neurons) connected to each other. Each node represents a specific output function called an activation function. Each connection between two nodes represents a weighted value for the signal passing through the connection, called weight, which is equivalent to the memory of the artificial neural network. The output of the network varies according to the way the network is connected, the weight value and the activation function. The network itself is usually an approximation to a certain algorithm or function in nature, or it may be an expression of a logical strategy.
 

1.3 Convolutional Neural Networks

1.3.1 The concept of convolution

The difference between the convolutional neural network and the ordinary neural network is that the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer (pooling layer) . In a convolutional layer of a convolutional neural network, a neuron is only connected to some neurons in neighboring layers. In a convolutional layer of CNN, it usually contains several feature maps (featureMap), each feature map is composed of some neurons arranged in a rectangle, and the neurons of the same feature map share weights, and the shared weights here are volumes . Accumulation . The convolution kernel is generally initialized in the form of a random decimal matrix, and the convolution kernel will learn to obtain reasonable weights during the training process of the network. The direct benefit of sharing weights (convolution kernels) is to reduce connections between layers of the network while reducing the risk of overfitting. Subsampling is also called pooling, and it usually has two forms : mean pooling and max pooling . Subsampling can be seen as a special kind of convolution process. Convolution and subsampling greatly simplify the model complexity and reduce the parameters of the model.

1.3.2 Calculation process of convolution

Suppose we input a 5*5*1 image, and the 3*3*1 in the middle is a convolution kernel we defined (in simple terms, it can be regarded as a matrix operator), through the original input image and convolution The result of the green part can be obtained by the core operation. What kind of operation? It is actually very simple. We look at the dark part in the left picture. The number in the middle is the pixel of the image, and the number in the lower right corner is the number of our convolution kernel. Just multiply and add the corresponding numbers to get the result. For example, '3*0+1*1+2*2+2*2+0*2+0*0+2*0+0*1+0*2=9' in the picture

The calculation process is as follows:

The three input matrices on the far left in the figure are our equivalent input d=3 when there are three channel maps. Each channel map has a convolution kernel belonging to its own channel. We can see that there are only two output (output) A feature map means that we set the output d=2, and there are several layers of convolution kernels with several output channels (for example, there are FilterW0 and FilterW1 in the figure), which means that the number of our convolution kernels is the number of input d The number multiplied by the number of output d (in the figure is 2*3=6), where the calculation of each layer of channel map is the same as the calculation of the layer mentioned above, and then add the output of each channel output That's the green output number.

1.3.3 Receptive field

Receptive Field : The size of the mapping area of ​​each pixel of each output layer of the convolutional neural network on the original image.
The following figure is a schematic diagram of the receptive field:

 When we use convolution kernels with different sizes, the biggest difference is that the size of the receptive field is different, so we often use multiple layers of small convolution kernels to replace a layer of large convolution kernels, and reduce parameters while keeping the receptive fields the same. volume and computation.
For example, it is very common to use 2 layers of 3*3 convolution kernels to replace 1 layer of 5*5 convolution kernels, as shown in the figure below.

1.3.4 Step size

The size of each convolution kernel move.

1.3.5 Calculation of output feature size

Output feature size calculation : After understanding the entire process of convolution calculation in the neural network , the size of the output feature map can be calculated. As shown in the figure below, the output feature size of a 5×5 image is 3×3 after convolution calculation with a convolution kernel of 3×3 size.

 1.3.6 All zero padding

When the convolution kernel size is greater than 1, the size of the output feature map will be smaller than the size of the input image. After multiple convolutions, the output image size will continue to decrease. In order to avoid the image size becoming smaller after convolution, padding is usually performed on the periphery of the image, as shown in the figure below

All zero padding (padding): In order to keep the output image size consistent with the input image, all zero padding is often performed around the input image, as shown below, if 0 is filled around the 5×5 input image, the output feature size is also 5 ×5.

When padding=1 and paadding=2, as shown in the figure below:

1.3.7 Standardization

Fit the data to a distribution with a mean of 0 and a standard deviation of 1.
Batch Normalization : Standardize a small batch of data (batch).

 Batch Normalization adjusts the input of each layer of the neural network to a standard normal distribution with a mean of 0 and a variance of 1. Its purpose is to solve the problem of gradient disappearance in the neural network .

Another important step in the BN operation is scaling and shifting. It is worth noting that both the scaling factor γ and the shifting factor β are trainable parameters. 

1.3.7 Pooling layer

Pooling is used to reduce the amount of feature data.
Maximum pooling can extract image texture, mean pooling can preserve background features

 1.4 The whole process of convolutional neural network

 1.5 PyTorch's convolutional neural network (cnn) handwritten digit recognition

The framework used is pytorch.

Dataset: MNIST dataset, 60,000 training images, each image size is 28*28.

Available at http://yann.lecun.com/exdb/mnist/

1.5.1 Code

import torch
import torch.nn as nn
import torchvision.datasets as dataset
import torchvision.transforms as transforms
import torch.utils.data as data_utils

#获取数据集
train_data=dataset.MNIST(root="D",
                         train=True,
                         transform=transforms.ToTensor(),
                         download=True
                         )
test_data=dataset.MNIST(root="D",
                         train=False,
                         transform=transforms.ToTensor(),
                         download=False
                         )
train_loader=data_utils.DataLoader(dataset=train_data, batch_size=100, shuffle=True)
test_loader=data_utils.DataLoader(dataset=test_data, batch_size=100, shuffle=True)

#创建网络
class Net(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.conv=nn.Conv2d(1, 32, kernel_size=5, padding=2)
        self.bat2d=nn.BatchNorm2d(32)
        self.relu=nn.ReLU()
        self.pool=nn.MaxPool2d(2)
        self.linear=nn.Linear(14 * 14 * 32, 70)
        self.tanh=nn.Tanh()
        self.linear1=nn.Linear(70,30)
        self.linear2=nn.Linear(30, 10)
   def forward(self,x):
        y=self.conv(x)
        y=self.bat2d(y)
        y=self.relu(y)
        y=self.pool(y)
        y=y.view(y.size()[0],-1)
        y=self.linear(y)
        y=self.tanh(y)
        y=self.linear1(y)
        y=self.tanh(y)
        y=self.linear2(y)
        return y
cnn=Net()
cnn=cnn.cuda()

#损失函数
los=torch.nn.CrossEntropyLoss()

#优化函数
optime=torch.optim.Adam(cnn.parameters(), lr=0.01)

#训练模型
for epo in range(10):
   for i, (images,lab) in enumerate(train_loader):
        images=images.cuda()
        lab=lab.cuda()
        out = cnn(images)
        loss=los(out,lab)
        optime.zero_grad()
        loss.backward()
        optime.step()
        print("epo:{},i:{},loss:{}".format(epo+1,i,loss))

#测试模型
loss_test=0
accuracy=0
with torch.no_grad():
   for j, (images_test,lab_test) in enumerate(test_loader):
        images_test = images_test.cuda()
        lab_test=lab_test.cuda()
        out1 = cnn(images_test)
        loss_test+=los(out1,lab_test)
        loss_test=loss_test/(len(test_data)//100)
        _,p=out1.max(1)
        accuracy += (p==lab_test).sum().item()
        accuracy=accuracy/len(test_data)
        print("loss_test:{},accuracy:{}".format(loss_test,accuracy))

Guess you like

Origin blog.csdn.net/m0_53675977/article/details/128240310