Paddle-Based Computer Vision Introductory Tutorial - Lecture 7 Practice: Handwritten Number Recognition

B station tutorial address

https://www.bilibili.com/video/BV18b4y1J7a6/

Task introduction

Handwritten digit recognition is a classic project of computer vision . Because of the randomness of handwritten digits , it is difficult to find common features of digits using traditional computer vision techniques . In the early days of computer vision, handwritten digit recognition became a major problem.

Judging from the visual task classification we explained earlier, handwritten digit recognition is a typical classification task , and a picture is input for ten classification . In real life, handwritten digit recognition also has many application scenarios. As shown in the figure below, we can see that the recognition of zip codes can greatly promote industrial automation, and the accuracy achieved by using convolutional neural networks can even surpass that of humans .

This task is to build a model that can output a correct classification result by inputting a picture of a handwritten number. Through such a practical project, it can help us consolidate and understand the common operations such as convolution and pooling that we have talked about before , and we can also review the basic process of deep learning.

data preparation

There is a general data set MNIST for handwritten digit recognition , which contains tens of thousands of handwritten digits that have been marked, and has been divided into training set and evaluation set. If we visualize one of the pictures , we can see something like this:

The shape of the image is **(1, 28, 28) , which is a single-channel image**. The size of the image is only 28*28 , and its label is 7 .

Usually, for general projects, you need to write a Dataloader by yourself to load data in sequence, return pictures and annotations, and provide a training interface for training . Considering the reason for our entry here, we directly use the written API. Interested students can try to download the compressed package and write Dataloader by themselves without using the advanced API.

train_loader = paddle.io.DataLoader(MNIST(mode='train', transform=ToTensor()), batch_size=10, shuffle=True)
valid_loader = paddle.io.DataLoader(MNIST(mode='test', transform=ToTensor()), batch_size=10)

Through the API packaged above, we have loaded the training set and evaluation set , which can be called by the training interface.

Network construction

After preparing the data, the second step is to build a convolutional neural network . The convolutional neural network directly affects the accuracy of the model. This step is also the most critical link . In this actual combat, we use LeNet by default . LeNet is one of the earliest convolutional neural networks, born in 1998, and has achieved great success in handwritten digit recognition tasks .

Its network structure is also very simple, basically a convolutional layer followed by a pooling layer , and finally a [1,10] matrix is output through two fully connected layers. We have not introduced the fully connected layer before. It is usually used to fit some batch data, such as a lot of scattered points, to fit a curve . Its structure is as follows:

That is to say, each output is related to all the parameters of the previous layer , and its mathematical expression is actually to multiply a transformation matrix and add a bias to obtain the output matrix. Why are convolutional layers used extensively in images and fully connected layers rarely used? Here I leave it for you to think for yourself after class.

LeNet uses Paddle to reproduce the code as follows:

import paddle
import numpy as np
from paddle.nn import Conv2D, MaxPool2D, Linear
import paddle.nn.functional as F

# 定义 LeNet 网络结构
class LeNet(paddle.nn.Layer):
    def __init__(self, num_classes=1):
        super(LeNet, self).__init__()
        self.conv1 = Conv2D(in_channels=1, out_channels=6, kernel_size=5)
        self.max_pool1 = MaxPool2D(kernel_size=2, stride=2)
        self.conv2 = Conv2D(in_channels=6, out_channels=16, kernel_size=5)
        self.max_pool2 = MaxPool2D(kernel_size=2, stride=2)
        self.conv3 = Conv2D(in_channels=16, out_channels=120, kernel_size=4)
        self.fc1 = Linear(in_features=120, out_features=64)
        self.fc2 = Linear(in_features=64, out_features=num_classes)
    def forward(self, x):                        #[N,1,28,28] 
        x = self.conv1(x)                        #[N,6,24,24]
        x = F.sigmoid(x)                         #[N,6,24,24]
        x = self.max_pool1(x)                    #[N,6,12,12]
        x = F.sigmoid(x)                         #[N,6,12,12]
        x = self.conv2(x)                        #[N,16,8,8]
        x = self.max_pool2(x)                    #[N,16,4,4]
        x = self.conv3(x)                        #[N,120,1,1]
        x = paddle.reshape(x, [x.shape[0], -1])  #[N,120]
        x = self.fc1(x)                          #[N,64]
        x = F.sigmoid(x)                         #[N,64]
        x = self.fc2(x)                          #[N,10]
        return x

The way Paddle uses dynamic graphs is very clear . Define a class body and write the layers to be used in the initialization function. You need to pay special attention to the number of input and output channels and the size of the convolution kernel. Dimensional errors occur . After defining here, we write the forward function. The forward function is the actual operation after we pass in the image.

In order to help everyone understand, let me explain the execution process in detail. First we instantiate the class body .

model = LeNet(num_classes=10)

When instantiated, the class body automatically executes the init() initialization function , and the init() function instantiates Conv2D and MaxPool2D . These are actually class bodies . These class bodies, like LeNet, also have init() and forward functions. , are instantiated accordingly in the initialization function. The instantiation process doesn't really start the calculation , but just defines the layer I want to use.

output = model(img)

When I run the above code again, it is equivalent to calling the class body and inputting img . At this time, the class body will automatically call the call() function , so why does the forward function execute? The reason is that all operations inherit the parent class of paddle.nn.Layer . In the parent class, the forward function is written under call() , which is equivalent to calling the forward function automatically when calling the LeNet class body. That's when the real calculation process begins .

I hope that everyone will scrutinize the whole process repeatedly until they understand it thoroughly. It is not difficult to find that such a form of building a network can be nested continuously . This is a very clear form, and this advantage will be reflected when we explain the complex model later.

model training

# -*- coding: utf-8 -*-
# LeNet 识别手写数字
import imp
import paddle
import numpy as np
import paddle
from model import LeNet
from paddle.vision.transforms import ToTensor
from paddle.vision.datasets import MNIST


def train(model, opt, train_loader, valid_loader):
    use_gpu = True
    paddle.device.set_device('gpu:0') if use_gpu else paddle.device.set_device('cpu')
    print('start training ... ')
    model.train()
    for epoch in range(EPOCH_NUM):
        for batch_id, data in enumerate(train_loader()):
            img = data[0]              #[10,1,28,28]
            label = data[1]            #[10,1]
            # 计算模型输出
            logits = model(img)
            # 计算损失函数
            loss_func = paddle.nn.CrossEntropyLoss(reduction='none')
            loss = loss_func(logits, label)
            avg_loss = paddle.mean(loss)

            if batch_id % 500 == 0:
                print("epoch: {}, batch_id: {}, loss is: {:.4f}".format(epoch+1, batch_id, float(avg_loss.numpy())))
            avg_loss.backward()
            opt.step()
            opt.clear_grad()

        model.eval()
        accuracies = []
        losses = []
        for batch_id, data in enumerate(valid_loader()):
            img = data[0]
            label = data[1] 
            # 计算模型输出
            logits = model(img)
            # 计算损失函数
            loss_func = paddle.nn.CrossEntropyLoss(reduction='none')
            loss = loss_func(logits, label)
            acc = paddle.metric.accuracy(logits, label)
            accuracies.append(acc.numpy())
            losses.append(loss.numpy())
        print("[validation] accuracy/loss: {:.4f}/{:.4f}".format(np.mean(accuracies), np.mean(losses)))
        model.train()

    # 保存模型参数
    paddle.save(model.state_dict(), 'mnist.pdparams')


model = LeNet(num_classes=10)
EPOCH_NUM = 5
opt = paddle.optimizer.Momentum(learning_rate=0.001, parameters=model.parameters())
train_loader = paddle.io.DataLoader(MNIST(mode='train', transform=ToTensor()), batch_size=10, shuffle=True)
valid_loader = paddle.io.DataLoader(MNIST(mode='test', transform=ToTensor()), batch_size=10)
train(model, opt, train_loader, valid_loader)

The training code is very clear based on what we have learned. Obtain the data set from the data set interface , input the image into the model, the model gets a predicted value, use the CrossEntropyLoss loss function to calculate the loss between the predicted value and the real value of the label , reverse the loss to the network parameters , and finally use the optimizer to correct parameter, reduce loss .

It should be noted that the CrossEntropyLoss loss function comes with softmax . At the end of the classification problem, a softmax activation function is required to return the output [1,10] matrix to [0,1], and the sum of 10 numbers is 1, which means The probability that this picture is 0-9 .

model prediction

import numpy as np
import paddle
from model import LeNet
from paddle.vision.datasets import MNIST
from paddle.vision.transforms import ToTensor
import paddle.nn.functional as F

valid_loader = MNIST(mode='test', transform=ToTensor())
img = np.array(valid_loader[0][0])

# import matplotlib.pyplot as plt
# plt.imshow(img.squeeze(), cmap='gray')
# plt.show()

model = LeNet(num_classes=10)
model_dict = paddle.load("mnist.pdparams")
model.set_state_dict(model_dict)
model.eval()
x = valid_loader[0][0].reshape((1,1,28,28)).astype('float32')
result = F.softmax(model(x))
print(result.numpy()[0])

After training the model, we need to load the model and make a prediction . Here we select a picture prediction in the evaluation set to see if the output result is correct.

model = LeNet(num_classes=10)
model_dict = paddle.load("mnist.pdparams")
model.set_state_dict(model_dict)

We load the model using this method , and finally predict the output:

[7.3181213e-06 1.4578840e-05 3.3818762e-04 2.1557527e-04 2.6723552e-05 
 6.7271581e-06 1.3456239e-08 9.9840504e-01 4.1231990e-05 9.4459485e-04]

This also represents the probability of 0-9 respectively , the probability of 7 is as high as 99.84%, and the model output is correct!

References

https://www.paddlepaddle.org.cn/tutorials/projectdetail/2227103