Target detection (7) An article to understand the convolutional neural network in 5 minutes - training Alexnet network practice based on tensorflow2.0

1. What is a Convolutional Neural Network

1. Definition

Convolutional Neural Networks (Convolutional Neural Networks) is a deep learning model or multi-layer perceptron similar to artificial neural networks, commonly used to analyze visual images. The founder of Convolutional Neural Networks is Yann LeCun, a famous computer scientist who is currently working at Facebook. He was the first to solve the problem of handwritten digits on the MNIST dataset through Convolutional Neural Networks.

image-20221022181020167

As shown in the figure above, the convolutional neural network architecture is very similar to the conventional artificial neural network architecture, especially in the last layer of the network, which is fully connected. Also, notice that convolutional neural networks are able to accept multiple feature maps as input instead of vectors.

2. Hierarchical structure of convolutional network

A convolutional neural network is mainly composed of the following 5 layers:

  • Data input layer / Input layer
  • Convolution calculation layer / CONV layer
  • ReLU incentive layer / ReLU layer
  • Pooling layer / Pooling layer
  • Fully connected layer / FC layer

2.1 Data input layer

The processing to be done at this layer is mainly to preprocess the original image data, including:

Demeaning : Center all dimensions of the input data to 0, as shown in the figure below, the purpose is to pull the center of the sample back to the origin of the coordinate system.

Normalization : The amplitude is normalized to the same range, as shown below, that is, to reduce the interference caused by the difference in the value range of the data in each dimension. For example, we have two dimensions of features A and B, and the range of A is 0 to 10, and the range of B is 0 to 10000. If it is problematic to use these two features directly, the good practice is to normalize, that is, the data of A and B become the range of 0 to 1.

PCA/ whitening : use PCA to reduce dimensionality; whitening is to normalize the amplitude of each characteristic axis of the data

The effect diagram of removing the mean and normalizing the two-dimensional data is as follows:

image-20221022181608144

The effect of decorrelation and whitening on two-dimensional data is as follows:

image-20221022181646568

2.2 Convolution Computing Layer

This layer is the most important layer of the convolutional neural network, and it is also the source of the name of the convolutional neural network. In this convolutional layer, there are two key operations:

Local association : each neuron is regarded as a filter (filter)

Window (receptive field) sliding : filter calculates local data

First introduce a few nouns encountered in the convolutional layer:

image-20221022181950590

As shown in Figure 2 above, there are two main parameters that we can adjust. After choosing the size of the filter, we also need to choose the stride and padding. The stride controls how the filter performs convolution calculations around the input. Filters convolve the input by shifting one unit at a time. The distance the filter moves is the stride. The stride is usually set to ensure that the output is an integer rather than a fraction.

What is the padding value? When you apply a 5 x 5 x 3 filter to a 32 x 32 x 3 input, the output will be 28 x 28 x 3 in size. Here the spatial dimension is reduced. If we continue to use convolutional layers, the size reduction speed will exceed our expectations. In the early layers of the network, we want to retain as much information as possible about the original input content so that we can extract the features of those lower layers. Say we want to apply the same convolutional layer, but keep the output size at 32 x 32 x 3. To do this, we can apply zero padding of size 2 to this layer. Zero padding pads with zeros around the borders of the input content.
At the same time, I would like to mention here a misunderstanding that I have been making for a long time when I was learning convolutional neural networks. The elements in the kernel have to be a combination of 1 and 0, so the 3*3 convolution kernel is only 2 to the 9th power at most, and there are 512 different convolution kernels. Such cognition is wrong.
insert image description here
But in the real network training process, the correct understanding is that the elements of the convolution kernel are all 1, but each element of the convolution kernel has a weight value, and this weight value is the parameter to be trained during the training process. Therefore, regardless of the size of the convolution kernel of 1x1, 3x3, or 5x5, different weight values ​​can make there are countless different convolution kernels. In the end, the process of continuously adjusting the weight value during the training process is also to continuously train to find out effective features. The process of accumulating cores.

2.3 Nonlinear layer (activation layer)

The nonlinear layer mainly performs nonlinear mapping on the output results of the convolutional layer. The activation function used by CNN is generally ReLU (The Rectified Linear Unit/rectified linear unit). It is characterized by fast convergence and simple gradient calculation, but it is fragile. As shown below.

image-20221022182047783

In addition to the above activation functions, there are sigmiod activation functions, softmax activation functions, ReLU variant Leaky-relu activation functions, etc. For more information, please refer to https://blog.csdn.net/Dby_freedom/article/details/88946229

2.4 Pooling layer

The pooling layer is sandwiched between consecutive convolutional layers to compress the amount of data and parameters and reduce overfitting. In short, if the input is an image, the main function of the pooling layer is to compress the image.

The methods used in the pooling layer are Max pooling (Max pooling) and average pooling (average pooling), but Max pooling is actually used more. The principle of Max pooling is as follows:

image-20221022182832303

For each 2*2 window, select the largest number as the value of the corresponding element of the output matrix. For example, if the largest number in the first 2*2 window of the input matrix is ​​6, then the first element of the output matrix is ​​6. And so on.

Here is a description of the specific role of the pooling layer:

Feature invariance : that is, the scale invariance of features we often mention in image processing. The pooling operation is the resize of the image. Usually, a dog image is doubled and we can still recognize it as a dog. This shows that this image still retains the most important features of a dog. We can tell at a glance that the picture in the image is a dog. The information removed during image compression is only some irrelevant information, and the remaining Information is a feature with scale invariance, which is the feature that best expresses the image.

**Feature dimensionality reduction:** We know that an image contains a lot of information and many features, but some information is not very useful or repetitive for us to do image tasks, we can remove this kind of redundancy , to extract the most important features, which is also a major role of the pooling operation.

Over-fitting constraints : To a certain extent, over-fitting is prevented and it is easier to optimize.

2.5 Fully connected layer

All neurons between the two layers have weight connections, usually the fully connected layer is at the end of the convolutional neural network. That is, it is the same as the connection method of traditional neural network neurons:

image-20221022182909127

Convolutional neural network is essentially an input-to-output mapping, which can learn a large number of mapping relationships between input and output without any precise mathematical expression between input and output, as long as the known The model trains the convolutional network, and the network has the ability to map between input and output pairs. By continuously adjusting the mapping ability, it is used to extract the features in the image. The convolutional neural network automatically learns the classification features of the image and automatically extracts the features through training. Compared with traditional target detection algorithms, such as Haar-like, Hog and other traditional algorithms, the manual design of the feature representation of the image has great advantages.

Second, understand a little bit of its development process

The LeNet model (LeCun et al., 1998) is the earliest proposed convolutional neural network model, which is mainly used for handwritten digit recognition in the MNIST (modified NIST) data set. The model structure is shown in Figure 2-8. It contains 3 convolutional layers, 2 pooling layers, and 2 fully connected layers. Each convolutional layer and fully connected layer has trainable parameters, laying the foundation for the development of deep convolutional neural networks.

image-20221022183528102

Although LeNet has achieved good results on the small-scale MNIST dataset, complex image classification tasks require large-scale datasets and network models with stronger learning capabilities. In 2012, Krizhevsky et al. (2012) proposed the AlexNet network structure as shown in Figure 2-9. The network contains 5 convolutional layers and 3 fully connected layers. The input image undergoes convolution operations and fully connected layer operations. Finally, input the Softmax classifier with 1000 nodes to complete the image classification. The network uses the linear rectification function (ReLU) as the activation function, introduces local response normalization (LRN) to alleviate the gradient disappearance problem; uses data enhancement and Dropout technology to greatly alleviate the overfitting problem; and uses two GPU parallel computing way of training, improving the training speed. AlexNet won the classification task championship in the 2012 ImageNet competition by far surpassing the runner-up at the time.

image-20221022183605940

The initial layers of AlexNet are convolved with larger-sized kernels, resulting in a large amount of parameters. Simonyan and Zisserman (2015) proposed VGGNet (Visual Geometry Group), which inherited the framework of AlexNet and LeNet. The main contribution lies in the use of 3 × 3 The convolution layer with a small convolution kernel increases the depth of the network and improves the network performance. VGGNet contains 5 structures, the most commonly used are VGGNet-19 and VGGNet-16.

Generally speaking, the most direct way to improve network performance is to increase the network depth, but as the network depth increases, the number of parameters increases, the network is prone to overfitting, and the demand for computing resources also increases significantly. GoogLeNet proposed by Szegedy et al. (2015) uses the Inception-v1 module, which uses sparse connections to reduce the amount of model parameters while ensuring the efficiency of computing resources and improving network performance when the depth reaches 22 layers.

For the structure of different convolutional neural networks, there are no groups, here is a pit first, and then a review post of all commonly used convolutional neural network structures will be published! ! ! Recently, I started to get started with embedded drivers. I don’t know if I will have a chance to play with AI algorithms later, hahahaha.

3. Practical articles-AlexNet network structure realizes image classification

1. Dataset introduction

CIFAR-10 is a small dataset for universal object recognition compiled by Hinton's students Alex Krizhevsky and Ilya Sutskever. A total of 10 categories of RGB color images are included: airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats and trucks. The size of the picture is 32×32, and there are a total of 50,000 training pictures and 10,000 test pictures in the data set. A sample image of CIFAR-10 is shown in the figure.

image-20221022184010009

Compared with the MNIST dataset, CIFAR-10 has the following differences:

  • CIFAR-10 is a 3-channel color RGB image, while MNIST is a grayscale image.
  • The picture size of CIFAR-10 is 32×32, while the picture size of MNIST is 28×28, which is slightly larger than MNIST.
  • Compared with handwritten characters, CIFAR-10 contains real objects in the real world, which not only have a lot of noise, but also have different proportions and characteristics, which brings great difficulties to recognition. Direct linear models such as Softmax perform poorly on CIFAR-10.

2. Detailed explanation of AlexNet network structure

The AlexNet network includes 60 million parameters and 650 million neurons, 5 convolutional layers, and a pooling layer behind some convolutional layers, 3 fully connected layers, and the output is a softmax layer. The author of AlexNet proposed to use Dropout to randomly inactivate some neurons during the forward propagation of the network. The left picture is a normal fully connected forward propagation process. Each node is fully connected to the lower layer node. If After using Dropout, a part of neurons will be randomly deactivated in each layer. It can be changed to think that the Dropout operation reduces the training parameters in the network, thereby solving overfitting.

image-20221022184123238

Let's start to explain the network structure of AlexNet in detail:

The network in the original text has upper and lower layers. The reason is that the author used two GPUs for parallel computing at that time. For the convenience of understanding, only a part of it is needed.

image-20221022184200794

  • The convolution kernel size of the first convolution layer is [11,11], stride=4, and the number of convolution kernels is 48. Since there are 48 upper and lower layers, there are 96 convolution kernels in total. Of course, this The figure does not mark the number of paddings. Only the size of the feature layer after convolution is [55, 55, 96], which can be calculated by using the formula. Add a column of 0 to the left of the input and two columns of 0 to the right. Add a column of 0 above and two columns of 0 below;
  • The second layer is the maximum pooling downsampling. The size and stride of the pooling kernel are not given in the figure, but it can be obtained by checking some other information. Its kernel_size=3, padding=0, stride=2, note, The pooling operation will only change the width and height of the feature, but will not change the depth of the feature matrix;
  • The next time is another convolution layer. According to the label in the figure, it can be concluded that the number of convolution kernels is 128×2=256, and the size of the convolution kernel is 5. At the same time, according to the consulted information and some source codes, you can It is obtained that padding=[2,2], stride=1, and finally the output is [27,27,256] through the formula;
  • Under sampling with a maximum pooling, the size of the convolution kernel is also equal to 3, padding=1, stride=2, and the output is [13,13,256];
  • The third convolution layer: From the information in the figure, it can be concluded that the number of convolution kernels is 192×2=384, and the size of the convolution kernel is 3. According to the information, padding=[1,1], stride=1 , substituted into the formula to get the output [13,13,384];
  • The fourth convolutional layer: The configuration is exactly the same as the third convolutional layer, so the dimensions of the input and output are [13,13,384]
  • The fifth convolution layer: the number of convolution kernels is 158×2=256, the convolution kernel size=3, padding=[1,1], stride=1, and the output is [13,13,256]
  • The last maximum pooling downsampling layer. The figure does not give any information about this layer. By referring to some materials and source code, we get kernel_size=3, padding=0, stride=2, so the final output is [6,6,256]
  • Finally, three fully connected layers are connected, so there is no need to analyze them. It is only necessary to flatten the output after downsampling for full connection. It is necessary to mention the last layer here, which is 1000 in the picture. Nodes, because the data set of the paper has a thousand categories, so there are a thousand nodes. If we want to apply the network to our own data set, it is okay to change a few categories to a few.

3. Code Analysis

3.1 Environment configuration

The environment configuration for running the experimental code is as follows:

Python=3.7.1

Tensorflow-gpu=2.3.1

Numpy=1.18.5

Matplotlib=3.5.1

The source code of this experiment has been uploaded to Github, interested friends can download it and view it.

Link: https://github.com/ilovecooker/cifar-10

3.2 Code display

(1) Data preprocessing

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.datasets import cifar10
class get_data():
    def get_cifar10_data(self):
        # x_train_original和y_train_original代表训练集的图像与标签
, x_test_original与y_test_original代表测试集的图像与标签
        (x_train_original, y_train_original), (x_test_original, y_test_original) = cifar10.load_data()
        # 验证集分配(从测试集中抽取,因为训练集数据量不够)
        x_val = x_test_original[:5000]
        y_val = y_test_original[:5000]
        x_test = x_test_original[5000:]
        y_test = y_test_original[5000:]
        x_train = x_train_original
        y_train = y_train_original
        # 这里把数据从unint类型转化为float32类型, 提高训练精度。
        x_train = x_train.astype('float32')
        x_val = x_val.astype('float32')
        x_test = x_test.astype('float32')
        # 原始图像的像素灰度值为0-255,为了提高模型的训练精度,
通常将数值归一化映射到0-1
        x_train = x_train / 255
        x_val = x_val / 255
        x_test = x_test / 255
        # 图像标签一共有10个类别即0-9,这里将其转化为独热编码(One-hot)向量
        y_train = to_categorical(y_train)
        y_val = to_categorical(y_val)
        y_test = to_categorical(y_test)
        return x_train, y_train, x_val, y_val, x_test, y_test

(2) Build AlexNet network structure

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, 
Dense, Dropout, BatchNormalization,UpSampling2D
"""
定义alexnet网络模型
"""
class model_set():
    def alexnet(self):
        model = Sequential()
        model.add(UpSampling2D(input_shape=(32, 32, 3),size=(7,7)))
        model.add(Conv2D(96, (11, 11), strides=(4, 4), padding='same', 
activation='relu', kernel_initializer='uniform'))
        model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
        model.add(BatchNormalization())
        model.add(Conv2D(256, (5, 5), strides=(1, 1), padding='same', 
activation='relu', kernel_initializer='uniform'))
        model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
        model.add(BatchNormalization())
        model.add(Conv2D(384, (3, 3), strides=(1, 1), padding='same', 
activation='relu', kernel_initializer='uniform'))

        model.add(Conv2D(384, (3, 3), strides=(1, 1), padding='same', 
activation='relu', kernel_initializer='uniform'))

        model.add(Conv2D(256, (3, 3), strides=(1, 1), padding='same', 
activation='relu', kernel_initializer='uniform'))
        model.add(MaxPooling2D(pool_size=(3, 3), strides=(2, 2)))
        model.add(Flatten())
        model.add(Dense(4096, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(4096, activation='relu'))
        model.add(Dropout(0.5))
        model.add(Dense(10, activation='softmax'))
        print(model.summary())
        return model

(3) Model training

import tensorflow as tf
from model import model_set
from data_get import get_data
import matplotlib.pyplot as plt
"""
编译网络并训练
"""
data=get_data()
x_train, y_train, x_val, y_val, x_test, y_test = data.get_cifar10_data()
model_fun=model_set()
model = model_fun.alexnet()

# 编译网络(定义损失函数、优化器、评估指标)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
#设置终止条件
early_stopping=tf.keras.callbacks.EarlyStopping(monitor='val_accuracy', min_delta=0.005,
                              patience=7, verbose=0, mode='auto',
                              baseline=None, restore_best_weights=False)
# 开始网络训练(定义训练数据与验证数据、定义训练代数,定义训练批大小)
train_history = model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=100, batch_size=100, verbose=1,callbacks = [early_stopping])
# 模型保存
model.save('alexnet_cifar10.h5')
# 定义训练过程可视化函数(训练集损失、验证集损失、训练集精度、验证集精度)
def show_train_history(train_history, train, validation):
    plt.plot(train_history.history[train])
    plt.plot(train_history.history[validation])
    plt.title('Train History')
    plt.ylabel(train)
    plt.xlabel('Epoch')
    plt.legend(['train', 'validation'], loc='best')
    plt.show()
show_train_history(train_history, 'accuracy', 'val_accuracy')
show_train_history(train_history, 'loss', 'val_loss')

The running screenshot during the training process is shown in the figure.

image-20221022191513691

The h5 model file after the final training is completed is as shown in the figure below.

image-20221022191526807

(4) Test model and result visualization

from tensorflow.keras import models
from data_get import get_data
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import cifar10
data=get_data()
x_train, y_train, x_val, y_val, x_test, y_test = data.get_cifar10_data()
model=models.load_model("alexnet_cifar10.h5")
# 输出网络在测试集上的损失与精度
score = model.evaluate(x_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# 输出网络在测试集上的损失与精度
score = model.evaluate(x_test, y_test)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
# 测试集结果预测
predictions = model.predict(x_test)
predictions = np.argmax(predictions, axis=1)
print('前20张图片预测结果:', predictions[:20])
# 预测结果图像可视化
(x_train_original, y_train_original), (x_test_original, y_test_original) = cifar10.load_data()
def cifar10_visualize_multiple_predict(start, end, length, width):
    for i in range(start, end):
        plt.subplot(length, width, 1 + i)
        plt.imshow(x_test_original[5000+i], cmap=plt.get_cmap('gray'))
        title_true = 'true=' + str(y_test_original[5000+i])                # 图像真实标签
        title_prediction = ',' + 'prediction' + str(predictions[i])     # 预测结果
        title = title_true + title_prediction
        plt.title(title)
        plt.xticks([])
        plt.yticks([])
    plt.show()
cifar10_visualize_multiple_predict(start=0, end=9, length=3, width=3)

The running results after model testing are shown in the figure.

image-20221022191616187

Guess you like

Origin blog.csdn.net/qq_40959462/article/details/127467602