Computer Vision - Flying Paddle Deep Learning Practice - Deep Learning Network Model

The overall architecture of the deep learning network model mainly consists of three parts: data set, model networking, and learning optimization process. This chapter mainly introduces the algorithm architecture and common models of the deep learning network model in detail, starting from the classic deep learning network model to CNN, RNN is represented by lightweight network design to solve problems such as insufficient memory and real-time performance, as well as the cutting-edge network models Transformer and MLP that have been involved in major computer vision tasks in recent years. In order to further analyze the process of building a deep learning network model, finally, taking the LeNet model algorithm as an example, a network building case was demonstrated under the flying paddle deep learning framework. After studying this chapter, I hope readers can master the following knowledge points:

  1. Understand classic network models (CNN and RNN);
  2. Familiar with cutting-edge network models (Transformer and MLP);
  3. Master the use of flying paddles to build a deep learning network model-LeNet.

In the previous study, we should have roughly understood the current situation and history of computer vision at home and abroad, as well as the basics of deep learning algorithms. We should also have a general understanding of the framework of deep learning. Go and learn about the things mentioned above yourself. That's fine, it won't be too difficult. This article mainly explains the network model of deep learning. While understanding the classic network models, you can also understand the cutting-edge network models, and we will use a simple example to give everyone a general impression of the deep learning network.

With the support of the deep learning development framework, the deep learning network model is constantly updated and iterated. The model architecture ranges from the classic convolutional neural network CNN to the recurrent neural network RNN. Today's Transformer, multi-layer perceptron MLP, and they can be collectively regarded as building a deep learning network model through a series of operations such as network component activation function settings and optimization strategies, and using nonlinear complex mapping to transform the original data into Higher levels of abstract expression.

The overall architecture of the deep learning network model mainly consists of three parts, namely data set, model networking and learning optimization process. The training process of the deep learning network model is the optimization process. The most direct purpose of model optimization is to find the optimal model parameters that make the loss function as small as possible through multiple iterative updates. Generally, the optimization process of neural networks can be divided into two stages. The first stage is to obtain the predicted value of the model through forward propagation, compare the predicted value with the positive label, and calculate the difference between the two as the loss value; The second stage is to calculate the gradient of the loss function for each parameter through backpropagation, and update the value of each parameter according to the preset learning rate and momentum.

In short, a good network model usually has the following characteristics: 1. The model is easy to train and the training steps are simple and easy to converge; 2. The model has high accuracy and can well grasp the intrinsic nature of the data, and can extract useful key features; 3. , The model has strong generalization ability and the model not only performs well on known data, but also can show its robustness on unknown data sets with consistent distribution of known data.

Case:

1. Task introduction

Handwritten numeral recognition is a branch of optical character recognition technology (optical character recognition, OCR). It is the entry-level foundation for junior researchers. It also occupies a very important position in the real industry. Its main research contents are: How to use computers and image classification technology to automatically identify Arabic numerals (0~9) written by people on paper. Therefore, a simple description of this experimental task is as shown in the figure: 

2. Model Principle

近年来,神经网络模型一直层出不穷,在各个计算机视觉任务中都呈现百花齐放的态势。为了让开发者更清楚地了解网络模型的搭建过程,以及为了在后续的各项视觉子任务实战中奠定基础。下面本节将以MNIST手写数字识别为例,在PaddlePaddle深度学习开发平台下构建一个LeNet网络模型并进行详细说明。

LeNet是第一个将卷积神经网络推上计算机视觉舞台的算法模型,它由LeCun在1998年提出。在早期应用于手写数字图像识别任务。该模型采用顺序结构,主要包括7层(2个卷积层、2个池化层和3个全连接层),卷积层和池化层交替排列。以mnist手写数字分类为例构建一个LeNet-5模型。每个手写数字图片样本的宽与高均为28像素,样本标签值是0~9,代表0至9十个数字。

The following is a detailed analysis of the network structure and principle of the LeNet-5 model.

Figure 1 LeNet-5 overall network model

(1) Convolutional layer L1

The input data shape and size of the L1 layer is �� × 1 × 28 × 28Rm × 1 × 28 × 28, which means that the sample batch size is m, the number of channels is 1, and the row and column sizes are both 28. The output data shape size of the L1 layer is ��×6×24×24Rm×6×24×24, which means that the sample batch is m, the number of channels is 6, and the row and column dimensions are 24.

There are two key questions here: First, why did the number of channels change from 1 to 6? The reason is that the convolution layer L1 of the model is set with 6 convolution kernels. Each convolution kernel operates on the input data, and finally 6 sets of data are obtained. Second, why did the row and column size change from 28 to 24? The reason is that the row and column dimensions of each convolution kernel are 5. The convolution kernel (5×5) moves on the input data (28×28), and the step size of each movement is 1, then the rows and columns of the output data The sizes are 28-5+1=24 respectively.

(2) Pooling layer L2

The input data shape size of the L2 layer is ��×6×24×24Rm×6×24×24, which means that the sample batch size is m, the number of channels is 6, and the row and column sizes are both 24. The output data shape size of the L2 layer is ��×6×12×12Rm×6×12×12, which means that the sample batch size is m, the number of channels is 6, and the row and column dimensions are 12.

Here, why does the row and column size change from 24 to 12? The reason is that the filter shape size in the pooling layer is 2×2, it moves on the input data (24×24), and the step size (span) of each movement is 2, and 4 numbers (2×24) are selected each time 2) The maximum value is used as the output, then the row and column sizes of the output data are 24÷2=12.

(3) Convolutional layer L3

The input data shape size of the L3 layer is ��×6×12×12Rm×6×12×12, which means that the sample batch size is m, the number of channels is 6, and the row and column sizes are both 12. The output data shape size of the L3 layer is ��×16×8×8Rm×16×8×8, which means that the sample batch size is m, the number of channels is 16, and the row and column dimensions are both 8.

(4) Pooling layer L4

The input data shape size of the L4 layer is ��×16×8×8Rm×16×8×8, which means that the sample batch size is m, the number of channels is 16, and the row and column sizes are both 8. The output data shape size of the L4 layer is ��×16×4×4Rm×16×4×4, which means that the sample batch is m, the number of channels is 16, and the row and column dimensions are both 4. The filter shape size in pooling layer L4 is 2×2. It moves on the input data (shape size 24×24), and the step size (span) of each movement is 2, and 4 numbers (shape) are selected each time. The maximum value of size 2 × 2) is used as the output.

(5) Linear layer L5

The input data shape size of the L5 layer is ��×256Rm×256, which means the sample batch size is m and the number of input features is 256. The shape size of the output data is ��×120Rm×120, which means the sample batch size is m and the number of output features is 120.

(6) Linear layer L6

The input data shape size of the L6 layer is ��×120Rm×120, which means the sample batch is m and the number of input features is 120. The output data shape size of the L6 layer is ��×84Rm×84, which means that the sample batch size is m and the number of output features is 84.

(7) Linear layer L7

The input data shape size of the L7 layer is ��×84Rm×84, which means that the sample batch size is m and the number of input features is 84. The output data shape size of the L7 layer is ��×10Rm×10, which means that the sample batch size is m and the number of output features is 10.

3. MNIST data set

3.1 Introduction to data sets

The handwritten digit classification data set comes from the MNIST data set, which is publicly available and free of charge. The number of training set samples in this data set is 60,000, and the number of test set samples is 10,000. Each sample is a matrix composed of 28×28 pixels, and the value of each pixel is a scalar, ranging from 0 to 255. It can be considered that the number of color channels of this data set is 1. The data is divided into pictures and labels. The picture is a 28*28 pixel matrix, and the labels are 10 numbers from 0 to 9.

3.2 Data reading

(1) The transform function normalizes and standardizes the data

(2)train_dataset和test_dataset

mode='train' and mode='test' in paddle.vision.datasets.MNIST() are used to obtain the mnist training set and test set respectively.

#导入数据集Compose的作用是将用于数据集预处理的接口以列表的方式进行组合。
#导入数据集Normalize的作用是图像归一化处理,支持两种方式: 1. 用统一的均值和标准差值对图像的每个通道进行归一化处理; 2. 对每个通道指定不同的均值和标准差值进行归一化处理。
import paddle
from paddle.vision.transforms import Compose, Normalize
import os
import matplotlib.pyplot as plt
transform = Compose([Normalize(mean=[127.5],std=[127.5],data_format='CHW')])
# 使用transform对数据集做归一化
print('下载并加载训练数据')
train_dataset = paddle.vision.datasets.MNIST(mode='train', transform=transform)
val_dataset = paddle.vision.datasets.MNIST(mode='test', transform=transform)
print('加载完成')

Let’s take a look at what the images in the dataset look like

train_data0, train_label_0 = train_dataset[0][0],train_dataset[0][1]
train_data0 = train_data0.reshape([28,28])
plt.figure(figsize=(2,2))
print(plt.imshow(train_data0, cmap=plt.cm.binary))
print('train_data0 的标签为: ' + str(train_label_0))
AxesImage(18,18;111.6x108.72) 
The label of train_data0 is: [5]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  if isinstance(obj, collections.Iterator):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return list(data) if isinstance(data, collections.MappingView) else data
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/numpy/lib/type_check.py:546: DeprecationWarning: np.asscalar(a) is deprecated since NumPy v1.16, use a.item() instead
  'a.item() instead', DeprecationWarning, stacklevel=1)

<Figure size 144x144 with 1 Axes>

Let’s take another look at what the data looks like.

print(train_data0)

4. LeNet model construction

Constructing the LeNet-5 model for MNIST handwritten digit classification

#导入需要的包
import paddle
import paddle.nn.functional as F
from paddle.vision.transforms import Compose, Normalize

#定义模型
class LeNetModel(paddle.nn.Layer):
    def __init__(self):
        super(LeNetModel, self).__init__()
        # 创建卷积和池化层块,每个卷积层后面接着2x2的池化层
        #卷积层L1
        self.conv1 = paddle.nn.Conv2D(in_channels=1,
                                      out_channels=6,
                                      kernel_size=5,
                                      stride=1)
        #池化层L2
        self.pool1 = paddle.nn.MaxPool2D(kernel_size=2,
                                         stride=2)
        #卷积层L3
        self.conv2 = paddle.nn.Conv2D(in_channels=6,
                                      out_channels=16,
                                      kernel_size=5,
                                      stride=1)
        #池化层L4
        self.pool2 = paddle.nn.MaxPool2D(kernel_size=2,
                                         stride=2)
        #线性层L5
        self.fc1=paddle.nn.Linear(256,120)
        #线性层L6
        self.fc2=paddle.nn.Linear(120,84)
        #线性层L7
        self.fc3=paddle.nn.Linear(84,10)

    #正向传播过程
    def forward(self, x):
        x = self.conv1(x)
        x = F.sigmoid(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = F.sigmoid(x)
        x = self.pool2(x)
        x = paddle.flatten(x, start_axis=1,stop_axis=-1)
        x = self.fc1(x)
        x = F.sigmoid(x)
        x = self.fc2(x)
        x = F.sigmoid(x)
        out = self.fc3(x)
        return out

model=paddle.Model(LeNetModel())

5. Model optimization process

5.1 Loss function

Since it is a classification problem, we choose the cross-entropy loss function. Cross entropy is mainly used to measure the gap between estimated values ​​and true values. The smaller the cross entropy value, the better the model prediction effect. *

�(��,�^�)=−∑�=1������(�^��)E(yi,y^​i)=−∑j=1q​yji​ln(y^​ ji​)

Among them, ��∈��yi∈Rq is the real value, ���yji​ is the element in ��yi (the value is 0 or 1), �=1,...,�j=1,.. .,q. �^�∈��y^​i∈Rq is the predicted value (the probability of the sample on each category). Among them, the API corresponding to cross-entropy loss in paddle is paddle.nn.CrossEntropyLoss()

5.2 Parameter optimization

After the forward propagation process is defined, the initial parameters are randomized, and then the results of each layer can be calculated. Each time, an m×10 matrix will be obtained as the prediction result, where m is the number of mini-batch samples. Next, the backpropagation process is performed. There must be a difference between the predicted results and the real results. With the goal of reducing the difference, the model parameter gradient is calculated. Through multiple iterations, the model can be optimized so that the predicted results are closer to the true results.

6. Model training and evaluation

Training configuration: Set training hyperparameters

1. The batch size batch_size is set to 64, which means 64 images are input each time;

2. The number of iterations epoch is set to 5, which means 5 rounds of training;

3. Log display verbose=1, indicating output log information with a progress bar.

model.prepare(paddle.optimizer.Adam(parameters=model.parameters()),
              paddle.nn.CrossEntropyLoss(),
              paddle.metric.Accuracy())

model.fit(train_dataset,
          epochs=5,
          batch_size=64,
          verbose=1)

model.evaluate(val_dataset,verbose=1)
The loss value printed in the log is the current step, and the metric is the average value of previous step.
Epoch 1/5
step  10/938 [..............................] - loss: 2.3076 - acc: 0.1062 - ETA: 21s - 23ms/step
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:77: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return (isinstance(seq, collections.Sequence) and
step  20/938 [..............................] - loss: 2.3023 - acc: 0.1023 - ETA: 18s - 20ms/step
step 938/938 [==============================] - loss: 0.1927 - acc: 0.7765 - 16ms/step         
Epoch 2/5
step 938/938 [==============================] - loss: 0.0913 - acc: 0.9584 - 17ms/step        
Epoch 3/5
step 938/938 [==============================] - loss: 0.0232 - acc: 0.9700 - 17ms/step         
Epoch 4/5
step 938/938 [==============================] - loss: 0.0057 - acc: 0.9763 - 18ms/step        
Epoch 5/5
step 938/938 [==============================] - loss: 0.0907 - acc: 0.9798 - 17ms/step         
Eval begin...
The loss value printed in the log is the current batch, and the metric is the average value of previous step.
step 10000/10000 [==============================] - loss: 7.5607e-04 - acc: 0.9794 - 2ms/step         
Eval samples: 10000
{'loss': [0.00075607264], 'acc': 0.9794}

After 5 epoch generation iterations, the accuracy of the LeNet5 model on the MNIST image classification task reached about 98%.

7. Model visualization

model.summary((1,1,28,28))
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
   Conv2D-1       [[1, 1, 28, 28]]      [1, 6, 24, 24]          156      
  MaxPool2D-1     [[1, 6, 24, 24]]      [1, 6, 12, 12]           0       
   Conv2D-2       [[1, 6, 12, 12]]      [1, 16, 8, 8]          2,416     
  MaxPool2D-2     [[1, 16, 8, 8]]       [1, 16, 4, 4]            0       
   Linear-1          [[1, 256]]            [1, 120]           30,840     
   Linear-2          [[1, 120]]            [1, 84]            10,164     
   Linear-3          [[1, 84]]             [1, 10]              850      
===========================================================================
Total params: 44,426
Trainable params: 44,426
Non-trainable params: 0
---------------------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.04
Params size (MB): 0.17
Estimated Total Size (MB): 0.22
---------------------------------------------------------------------------

{'total_params': 44426, 'trainable_params': 44426}

Guess you like

Origin blog.csdn.net/m0_63309778/article/details/133490586