Deep Learning 04-CNN Classic Model

Introduction

Convolutional neural network (CNN) is a very important network structure in deep learning. It can process various types of data such as images, text, and speech. The following are the top 4 classic models of CNN

  1. LeNet-5

LeNet-5 was proposed by Yann LeCun and others in 1998 and was the first convolutional neural network successfully applied to handwritten digit recognition. It consists of 7 layers of neural network, including 2 convolutional layers, 2 pooling layers and 3 fully connected layers. Among them, the convolutional layer extracts image features, the pooling layer reduces the dimension of the feature map, and the fully connected layer maps the features to the corresponding categories.

The main features of LeNet-5 are the use of Sigmoid activation function, average pooling and no zero padding after convolutional layers. It is widely used in fields such as handwritten digit recognition and face recognition.

  1. AlexNet

AlexNet was proposed by Alex Krizhevsky and others in 2012. It was the first convolutional neural network to achieve significant results in large-scale image recognition tasks. It consists of 5 convolutional layers, 3 fully connected layers and 1 Softmax output layer, which uses ReLU activation function, maximum pooling and Dropout technology.

The main feature of AlexNet is the use of technologies such as GPU accelerated training, data enhancement and randomized Dropout, which greatly improves the generalization ability and robustness of the model. It achieved excellent results far exceeding other models in the ImageNet large-scale image recognition competition.

  1. VGGNet

VGGNet was proposed by Karen Simonyan and Andrew Zisserman in 2014. It is a very deep convolutional neural network with 16 or 19 layers. Each convolutional layer of VGGNet uses a 3x3 convolution kernel and ReLU activation function, making its network structure very clear and easy to understand.

The main feature of VGGNet is the use of a deeper network structure, small convolution kernels and a small number of parameters, which further improves the model's feature extraction capabilities. It also achieved very good results in the ImageNet competition.

  1. GoogLeNet

GoogLeNet was proposed by the Google team in 2014. It is a very deep convolutional neural network with 22 layers. It uses a structure called Inception modules that can reduce the number of parameters while maintaining network depth.

The main feature of GoogLeNet is the use of Inception module, 1x1 convolution kernel and global average pooling technology, which greatly reduces the computational complexity of the model. It achieved very good results in the ImageNet competition and is widely used in other fields.

CNN review

Let’s review several features of CNN: local perception, parameter sharing, and pooling.

local perception

Human beings' cognition of the outside world generally goes from local to global, from one-sided to comprehensive. Similarly, when a machine recognizes an image, it is not necessary to connect the entire image to the neural network pixel by pixel. In the image, it is also local and peripheral. The pixels are closely connected, while the correlation between pixels that are far away is weak, so the local connection mode can be used (connecting the image in blocks, which can greatly reduce the parameters of the model), as shown in the following figure:
Insert image description here

Parameter (weight) sharing

Every natural image (people, landscapes, buildings, etc.) has its own inherent characteristics, that is to say, the statistical characteristics of one part of the image are close to other parts. This also means that the features learned in this part can also be used in another part, and the same learned features can be used. Therefore, in the local connection, each neuron in the hidden layer connects the weight parameters of the local image (for example, 5×5), and these weight parameters are shared with other remaining neurons, then regardless of whether the hidden layer has How many neurons need to be trained? The parameter that needs to be trained is the authority parameter of this local image (for example, 5×5), which is the size of the convolution kernel. This greatly reduces the training parameters. As shown below
Insert image description here

The weight of the convolution kernel refers to the parameters in each convolution kernel, which are used to perform a weighted sum of the pixels at each position when performing a convolution operation on the input data. In a convolutional neural network, the weights of all convolution kernels in the same layer are shared, which means that the weight of each convolution kernel at different positions is the same. Shared weights can reduce the number of parameters that need to be learned in the model, thereby reducing the complexity of the model. At the same time, it can improve the generalization ability of the model, because shared weights can make the model more stable and avoid overfitting. Shared weights are implemented by using the same convolution kernel to perform convolution operations on the input data.
Insert image description here

Pooling

As the model network continues to deepen, there are more and more convolution kernels, and there are still many parameters to be trained. Moreover, direct training using the features extracted by the convolution kernel is prone to overfitting. Recall that the reason why convolution is used to extract features from images is that images have a "static" attribute. Therefore, a natural idea is to extract representative features (perform aggregate statistics, such as maximum value, average value, etc.), this aggregation operation is called pooling, and the pooling process is usually also called the feature mapping process (feature dimensionality reduction), as shown below:

LeNet-5

Overview
LeNet5 was born in 1994 and is one of the earliest convolutional neural networks. It was completed by Yann LeCun and promoted the development of the field of deep learning. At that time, there was no GPU to help train the model, and even the CPU speed was very slow. Therefore, LeNet5 used convolution, parameter sharing, pooling and other operations to extract features through clever design, avoiding a lot of computing costs, and finally using Fully connected neural network performs classification and recognition. This network is also the starting point of a large number of recent neural network architectures and has brought a lot of inspiration to this field.
The network structure diagram of LeNet5 is as follows:
Insert image description here
LeNet5 consists of 7 layers of CNN (excluding the input layer). The original image size input in the above figure is 32×32 pixels. The convolution layer is represented by Ci, and the subsampling layer (pooling, pooling) ) is represented by Si, and the fully connected layer is represented by Fi. Below is a layer-by-layer introduction to its functions and the meaning of the numbers above the schematic diagram.

C1 layer (convolutional layer): 6@28×28

This layer uses 6 convolution kernels, and the size of each convolution kernel is 5×5, thus obtaining 6 feature maps.
(1) Feature map size
Each convolution kernel (5×5) is convolved with the original input image (32×32), so that the size of the feature map obtained is (32-5+1)×( 32-5+1) = 28×28
The convolution process is shown in the figure below (the figure below is 4*4 just for demonstration): the
Insert image description here
convolution kernel and the input image are matched area by area according to the size of the convolution kernel. After matching, the original The size of the input image will become smaller because the edge part of the convolution kernel cannot go out of bounds and can only be matched once. As shown in the figure above, the size after matching calculation becomes Cr×Cc=(Ir-Kr+1)×(Ic-Kc+ 1), where Cr, Cc, Ir, Ic, Kr, and Kc respectively represent the row and column sizes of the convolution result image, input image, and convolution kernel.
Among them, Cr represents the result row, and Cc represents the result column column
(2) Number of parameters
. Due to the sharing of parameters (weights), each neuron of the same convolution kernel uses the same parameters. Therefore, the number of parameters is (5×5+1)×6= 156, where 5×5 is the convolution kernel parameter and 1 is the bias parameter (3) The image size after convolution is 28×28, so each feature map has
28
×28 neurons, each convolution kernel parameter is (5×5+1)×6, therefore, the number of connections in this layer is (5×5+1)×6×28×28=122304

S2 layer (downsampling layer, also called pooling layer): 6@14×14

(1) Feature map size
This layer is mainly used for pooling or feature mapping (feature dimensionality reduction). The pooling unit is 2×2. Therefore, the size of the 6 feature maps becomes 14×14 after pooling. Looking back at the pooling operation mentioned at the beginning of this article, there is no overlap between pooling units. New feature values ​​are obtained after aggregation statistics in the pooling area. Therefore, after 2×2 pooling, every two rows and two columns are recalculated When a feature value comes out, it is equivalent to halving the image size, so the convolved 28×28 image becomes 14×14 after 2×2 pooling.
The calculation process of this layer is: add the values ​​​​in the 2×2 unit, then multiply it by the training parameter w, plus a bias parameter b (each feature map shares the same w and b), and then take the sigmoid value (S function: 0-1 interval), as the corresponding value of the unit. The schematic diagram of convolution operation and pooling is as follows:
Insert image description here
(2) Number of parameters . Since each feature map of the S2 layer shares the same two parameters w and b, it requires
2×6=12 parameters
(3) Number of connections.
The image size after sampling is 14×14, so each feature map of the S2 layer has 14×14 neurons, and the number of connections of each pooling unit is 2×2+1 (1 is the offset). Therefore, the The number of connections in the layer is (2×2+1)×14×14×6 = 5880

C3 layer (convolutional layer): 16@10×10

The C3 layer has 16 convolution kernels, and the convolution template size is 5×5.
(1) Feature map size
Similar to the analysis of C1 layer, the feature map size of C3 layer is (14-5+1) × (14-5+1) = 10 × 10
(2) The number of parameters
needs to be noted that, C3 and S2 are not fully connected but partially connected. Some are connected from C3 to S2 at three layers, some at four layers, or even six layers. In this way, more features are extracted. The connection rules are as shown in the following table: For example, the
Insert image description here
first The column indicates that the 0th feature map of the C3 layer is only connected to the 0th, 1st and 2nd feature maps of the S2 layer. The calculation process is: use 3 convolution templates to connect with the 3 feature maps of the S2 layer respectively. The feature maps are convolved, and then the convolution results are added and summed, plus an offset, and then the sigmoid is taken to obtain the corresponding feature map after convolution. The other columns are similar (some have 3 convolution templates, some have 4, some have 6). Therefore, the number of parameters of layer C3 is (5×5×3+1)×6 + (5×5×4+1)×9 +5×5×6+1 = 1516

(3) Number of connections:
The size of the feature map after convolution is 10×10, and the number of parameters is 1516, so the number of connections is 1516×10×10= 151600

S4 (downsampling layer, also called pooling layer): 16@5×5

(1) Feature map size
Similar to the analysis of S2, the pooling unit size is 2×2. Therefore, this layer has a total of 16 feature maps like C3, and the size of each feature map is 5×5.
(2) Number of parameters
Similar to the calculation of S2, the number of parameters required is 16×2 = 32
(3) Number of connections
The number of connections is (2×2+1)×5×5×16 = 2000

C5 layer (convolutional layer): 120

(1) Feature map size
This layer has 120 convolution kernels, and the size of each convolution kernel is still 5×5, so there are 120 feature maps. Since the size of the S4 layer is 5 × 5, and the convolution kernel size of this layer is also 5 × 5, the feature map size is (5-5+1) × (5-5+1) = 1 × 1. In this way, the layer just becomes fully connected. This is just a coincidence. If the original input image is relatively large, the layer is not fully connected.
(2) Number of parameters
Similar to the previous analysis, the number of parameters in this layer is 120×(5×5×16+1) = 48120
(3) Number of connections
Since the size of the feature map of this layer is exactly 1×1, so The number of connections is 48120×1×1=48120

F6 layer (fully connected layer): 84

1) Feature map size
The F6 layer has 84 units. The reason for choosing this number is from the design of the output layer, which corresponds to a 7×12 bitmap, as shown in the figure below, -1 means white, 1 means black , so that the black and white of the bitmap of each symbol corresponds to a code.
Insert image description here
This layer has 84 feature maps, the size of the feature map is the same as C5, 1×1, and it is fully connected with the C5 layer.
(2) Number of parameters
Since it is fully connected, the number of parameters is (120+1)×84=10164. Like a classic neural network, the F6 layer calculates the dot product between the input vector and the weight vector, adds a bias, and then passes it to the sigmoid function to get the result.
(3) Number of connections
Since it is a full connection, the number of connections is the same as the number of parameters, which is also 10164.

OUTPUT layer (output layer): 10

The Output layer is also a fully connected layer with a total of 10 nodes, representing the numbers 0 to 9 respectively. If the value of the i-th node is 0, it means that the result of network identification is the number i.
(1) Feature map size
This layer uses the radial basis function (RBF) network connection method. Assuming that x is the input of the previous layer and y is the output of RBF, the calculation method of RBF output is: Wij in the above
Insert image description here
formula The value is determined by the bitmap encoding of i, which ranges from 0 to 9, and j, which takes values ​​from 0 to 7×12-1. The closer the RBF output value is to 0, the closer the recognition result of the current network input is to the character i.

(2) Number of parameters.
Because it is a full connection, the number of parameters is 84×10=840.
(3) Number of connections.
Because it is a full connection, the number of connections is the same as the number of parameters, which is also 840.

Through the above introduction, we have learned about the structure, feature map size, number of parameters, number of connections and other information of each layer of the LeNet network. The following figure is the process of identifying the number 3. You can review the functions of each layer one by one according to the above introduction:
Insert image description here

Programming implementation

import tensorflow as tf
from tensorflow.keras import layers, models
from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt
import numpy as np
#开启tensorflow支持numpy函数,astype是numpy的函数
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
# 加载MNIST数据集
(x_train, y_train), (x_test, y_test) = mnist.load_data()
ori_x_test1=x_test

# 将图像从28*28转换成32*32
x_train = tf.pad(x_train, [[0,0], [2,2], [2,2]], mode='constant')
x_test = tf.pad(x_test, [[0,0], [2,2], [2,2]], mode='constant')

# 将像素值缩放到0-1之间
x_train, x_test = x_train.astype('float32') / 255.0, x_test.astype('float32') / 255.0

# 定义Lenet-5模型
model = models.Sequential([
    # 第一层卷积层,6个卷积核,大小为5*5,使用sigmoid激活函数
    layers.Conv2D(6, (5, 5), activation='relu', input_shape=(32, 32, 1)),
    # 第一层池化层,大小为2*2
    layers.MaxPooling2D((2, 2)),
    # 第二层卷积层,16个卷积核,大小为5*5,使用sigmoid激活函数
    layers.Conv2D(16, (5, 5), activation='relu'),
    # 第二层池化层,大小为2*2
    layers.MaxPooling2D((2, 2)),
    # 第三层卷积层,120个卷积核,大小为5*5,使用sigmoid激活函数
    layers.Conv2D(120, (5, 5), activation='relu'),
    # 将卷积层的输出拉平
    layers.Flatten(),
    # 第一层全连接层,84个节点,使用sigmoid激活函数
    layers.Dense(84, activation='relu'),
    # 输出层,共10个节点,对应0-9十个数字,使用softmax激活函数
    layers.Dense(10, activation='softmax')
])

# 编译模型
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

# 训练模型
model.fit(x_train, y_train, epochs=5, validation_data=(x_test, y_test))
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])



#取出其中一个测试数据进行测试
testdata = ori_x_test1[100]
testdata = testdata.reshape(-1,28,28)
testdata = tf.pad(testdata, [[0,0], [2,2], [2,2]], mode='constant')
testdata=testdata.reshape(-1, 32, 32, 1)
# 将像素值缩放到0-1之间
testdata = testdata.astype('float32') / 255.0
predictions = model.predict(testdata)
print("预测结果:", np.argmax(predictions))

# 绘制第10个测试数据的图形
plt.imshow(ori_x_test1[100], cmap=plt.cm.binary)
plt.show()

Output:
Test loss: 0.03826029598712921
Test accuracy: 0.9879999756813049
Prediction result: 6
Insert image description here

Reference: https://my.oschina.net/u/876354/blog/1632862

AlexNet

In 2012, Alex Krizhevsky and Ilya Sutskever designed a deep convolutional neural network AlexNet in the laboratory of Geoff Hinton at the University of Toronto. It won the 2012 ImageNet LSVRC championship, and its accuracy rate far exceeded the second place (top5 error rate 15.3%, the second was 26.2%), which caused a lot of sensation. AlexNet can be said to be a network structure with historical significance. Before this, deep learning had been silent for a long time. Since the birth of AlexNet in 2012, subsequent ImageNet champions have all used convolutional neural networks (CNN). And the layers are getting deeper and deeper, making CNN become the core algorithm model in image recognition and classification, bringing about the explosion of deep learning.
In previous articles on this blog, we have introduced the technical principles of convolutional neural networks (CNN) (Dahua convolutional neural network), and also reviewed the three important characteristics of convolutional neural networks (CNN) (Dahua CNN classic model: LeNet), interested students can open the link and review it again. I will not repeat the introduction of the basic knowledge of CNN here. The following will first introduce the characteristics of AlexNet, and then decompose and analyze the AlexNet network structure layer by layer.

Features of AlexNet model

The success of AlexNet is related to the characteristics of this model design, mainly including:

  • Used a non-linear activation function: ReLU
  • Methods to prevent overfitting: Dropout, data augmentation
  • Others: multi-GPU implementation, use of LRN normalization layer

1. Use ReLU activation function.
Traditional neural networks generally use nonlinear functions such as Sigmoid or tanh as excitation functions. However, they are prone to gradient dispersion or gradient saturation. Taking the Sigmoid function as an example, when the input value is very large or very small, the gradient of these neurons is close to 0 (gradient saturation phenomenon). If the input initial value is large, the gradient needs to be multiplied during backpropagation. The last Sigmoid derivative will cause the gradient to become smaller and smaller, making it difficult for the network to learn. (For details, see the article on our blog: Excitation functions commonly used in deep learning).
In AlexNet, the ReLU (Rectified Linear Units) excitation function is used. The formula of this function is: f (x)=max (0,x). When the input signal < 0, the output is 0. When the input signal > 0 When , the output is equal to the input, as shown in the figure below:

Insert image description here
Use ReLU to replace Sigmoid/tanh. Since ReLU is linear and the derivative is always 1, the amount of calculation is greatly reduced and the convergence speed will be much faster than Sigmoid/tanh, as shown in the following figure: 2. Data
Insert image description here
augmentation

There is a view that neural networks are fed by data. If we can increase training data and provide massive data for training, the accuracy of the algorithm can be effectively improved, because this can avoid overfitting and further increase and deepen the network. structure. When the training data is limited, some new data can be generated from the existing training data set through some transformations to quickly expand the training data.
Among them, the simplest and most common way to deform image data: horizontal flipping of the image, random cropping, translation transformation, color and lighting transformation from the original image, as shown in the following figure: During training, AlexNet performs data augmentation like
Insert image description here
this Processing:
(1) Random cropping, randomly crop the 256×256 picture to 224×224, and then flip it horizontally, which is equivalent to increasing the number of samples by ((256-224)^2)×2=2048 times;
( 2) During the test, the upper left, upper right, lower left, lower right, and middle were cropped 5 times respectively, and then flipped for a total of 10 cuts, and then the results were averaged. The author said that if random cropping is not done, large networks are basically overfitted;
(3) Do PCA (principal component analysis) on the RGB space, and then make a (0, 0.1) Gaussian perturbation on the principal component, that is, The color and lighting were changed, and the error rate dropped by another 1%.

3. Overlapping Pooling.
General pooling does not overlap. The window size of the pooling area is the same as the step size, as shown in the figure below: The
Insert image description here
pooling used in AlexNet can be Overlapping, that is, during pooling, the step size of each move is smaller than the pooling window length. The AlexNet pooling size is a 3×3 square, and the moving step size of each pooling is 2, so there will be overlap. Overlapping pooling can avoid overfitting, and this strategy contributed to a Top-5 error rate of 0.3%.
4. Local Response Normalization (LRN for short)
There is a concept in neurobiology called "lateral inhibition" (lateral inhibition), which means that activated neurons inhibit neighboring neurons. The purpose of normalization is "suppression". Local normalization draws on the idea of ​​"side suppression" to achieve local suppression. Especially when using ReLU, this "side suppression" is very useful because the response result of ReLU is is unbounded (can be very large), so normalization is required. The scheme of using local normalization helps to increase the generalization ability.
The formula of LRN is as follows. The core idea is to use adjacent data for normalization. This strategy contributes a Top-5 error rate of 1.2%.
Insert image description here
5. Dropout
Dropout is introduced mainly to prevent over-fitting. In the neural network, Dropout is implemented by modifying the structure of the neural network itself. For a neuron in a certain layer, the neuron is set to 0 with a defined probability. This neuron does not participate in forward and backward propagation, just like in the network. has been deleted, while keeping the number of neurons in the input layer and output layer unchanged, and then update the parameters according to the learning method of the neural network. In the next iteration, some neurons are randomly deleted (set to 0) until the end of training.
Dropout should be regarded as a big innovation in AlexNet, so much so that Hinton, the "father of neural networks", used Dropout in his speeches for a long time. Dropout can also be seen as a combination of models. The network structure generated each time is different. Overfitting can be effectively reduced by combining multiple models. Dropout only requires twice the training time to achieve model combination ( Similar to averaging) effect, very efficient.
As shown in the figure below:
Insert image description here
6. Multi-GPU training
AlexNet used GTX580 GPU for training. Since a single GTX 580 GPU only has 3GB of memory, this limits the maximum size of the network trained on it, so they placed Half of the cores (or neurons) distribute the network on two GPUs for parallel computing, which greatly speeds up the training of AlexNet.

Layer-by-layer analysis of AlexNet network structure

The following figure is the network structure diagram of AlexNet:
Insert image description here
The AlexNet network structure has a total of 8 layers. The first 5 layers are convolutional layers, the next 3 layers are fully connected layers, and the output of the last fully connected layer is passed to a 1000-way softmax layer, corresponding to 1000 The distribution of class labels.
Since AlexNet uses two GPUs for training, the network structure diagram consists of two parts: one GPU runs the layer above the graph, and the other runs the layer below the graph. The two GPUs only communicate at specific layers. For example, the kernels of the second, fourth, and fifth convolutional layers are only connected to the kernel feature map of the previous layer on the same GPU, the third convolutional layer is connected to all kernel feature maps of the second layer, and the fully connected layer The neurons in are connected to all neurons in the previous layer.

The AlexNet structure is analyzed layer by layer below:

The first layer (convolutional layer)

Insert image description here
The processing flow of this layer is: Convolution-->ReLU-->Pooling-->Normalization. The flow chart is as follows:
Insert image description here
(1) The original image size of the convolution
input is 224×224×3 (RGB image). During training, it will be preprocessed to 227×227×3. In this layer, 96 11×11×3 convolution kernels are used for convolution calculation to generate new pixels. Since two GPUs are used for parallel operations, the upper and lower parts of the network structure diagram are respectively responsible for the operations of 48 convolution kernels.
The convolution kernel moves along the image in a certain step size in the x-axis and y-axis directions to calculate convolution, and then generates a new feature map, the size of which is: floor ((img_size - filter_size)/stride) +1 = new_feture_size, where floor means rounding down, img_size is the image size, filter_size is the kernel size, stride is the step size, new_feture_size is the size of the feature map after convolution. This formula means that the image size minus the convolution kernel size divided by the step size, plus The subtracted kernel size pixel corresponds to one generated pixel, and the result is the size of the feature map after convolution.
The convolution movement step size of this layer in AlexNet is 4 pixels. The size of the feature map generated after the movement calculation of the convolution kernel is (227-11)/4+1=55, that is, 55×55. (2) The 55×55 pixel layer after
ReLU convolution is processed by the ReLU unit to generate an activation pixel layer, and the size is still 2 sets of 55×55×48 pixel layer data. (3) The pixel layer after pooling RuLU is then subjected to a pooling operation. The size of the pooling operation is 3×3 and the step size is 2. Then the size of the pooled image is (55-3)/2+1=27 , that is, the size of the pixels after pooling is 27×27×96 (4) Normalization




The pooled pixel layer is then normalized. The size of the normalization operation is 5×5. The normalized pixel size remains unchanged, still 27×27×96. These 96 pixel layers are divided into Two sets of 48 pixel layers each are operated on a separate GPU.

The second layer (convolutional layer)

Insert image description here
This layer is similar to the first layer. The processing flow is: Convolution-->ReLU--> Pooling-->Normalization. The flow chart is as follows: (1) The input data of the second layer of
Insert image description here
convolution
is the output of the first layer. A 27×27×96 pixel layer (divided into two groups of 27×27×48 pixel layers and placed in two different GPUs for operation). To facilitate subsequent processing, the upper, lower, left and right edges of each pixel layer are filled here. 2 pixels (filled with 0), that is, the size of the image becomes (27+2+2) × (27+2+2). The convolution kernel size of the second layer is 5×5, and the moving step is 1 pixel. It is the same as the calculation formula for point (1) of the first layer. The pixel layer size after calculation by the convolution kernel becomes (27+ 2+2-5)/1+1=27, that is, the size after convolution is 27×27.
This layer uses 256 5×5×48 convolution kernels, which are also divided into two groups, each group is 128, and are assigned to two GPUs for convolution operations. The result is two groups of 27×27×128 volumes. The accumulated pixel layer.
(2) ReLU
These pixel layers are processed by the ReLU unit to generate an activation pixel layer, and the size is still two sets of 27×27×128 pixel layers.
(3) Pooling
is then processed by a pooling operation. The size of the pooling operation is 3×3 and the step size is 2. The size of the image after pooling is (57-3)/2+1=13, that is, pooling The final pixel size is 2 groups of 13×13×128 pixel layers
(4) normalized
and then normalized. The scale of the normalization operation is 5×5. The scale of the normalized pixel layer is 2 sets of 13×13×128 pixel layers are operated by 2 GPUs respectively.

The third layer (convolutional layer)

Insert image description here
The processing flow of the third layer is: Convolution-->ReLU
Insert image description here
(1) The
input data of the third layer of convolution is 2 groups of 13×13×128 pixel layers output by the second layer. To facilitate subsequent processing, each pixel layer The upper, lower, left and right edges are filled with 1 pixel, and after filling, it becomes (13+1+1)×(13+1+1)×128, which is distributed in two GPUs for operation.
Each GPU in this layer has 192 convolution kernels, and the size of each convolution kernel is 3×3×256. Therefore, the convolution kernel in each GPU can perform convolution operations on all data of 2 sets of 13×13×128 pixel layers. As shown in the structure diagram of this layer, the two GPUs are connected by crossing dotted lines, which means that each GPU processes input from all GPUs of the previous layer.
The step size of this layer of convolution is 1 pixel, and the size after the convolution operation is (13+1+1-3)/1+1=13, that is, there are a total of 13×13×192 convolution kernels in each GPU. , there are a total of 13×13×384 convolved pixel layers in 2 GPUs. (2) The pixel layer after ReLU convolution is processed by the ReLU unit to generate an activated pixel layer. The size is still 2 groups of 13×13×192 pixel layers, which are allocated to two groups of GPUs for processing
.

The fourth layer (convolutional layer)

Insert image description here
Similar to the third layer, the processing flow of the fourth layer is: Convolution-->ReLU
Insert image description here
1)
The input data of the fourth layer of convolution is the 2 groups of 13×13×192 pixel layers output by the third layer, similar to the third layer. layer. In order to facilitate subsequent processing, the upper, lower, left and right edges of each pixel layer are filled with 1 pixel. The filled size becomes (13+1+1)×(13+1+1)×192, which is distributed between two GPUs. perform operations in.
Each GPU in this layer has 192 convolution kernels, and the size of each convolution kernel is 3×3×192 (different from the third layer, there is no dotted line connection between the GPUs in the fourth layer, that is, between the GPUs There is no communication between them). The moving step size of the convolution is 1 pixel, and the size after the convolution operation is (13+1+1-3)/1+1=13. There are 13×13×192 convolution kernels in each GPU. After 2 GPU convolutions, a 13×13×384 pixel layer is generated.
(2)
The pixel layer after ReLU convolution is processed by the ReLU unit to generate an activation pixel layer. The size is still 2 groups of 13×13×192 pixel layers, which are allocated to two GPUs for processing.

The fifth layer (convolutional layer)

Insert image description here
The processing flow of the fifth layer is: Convolution-->ReLU-> Pooling
Insert image description here
(1) The
input data of the fifth layer of convolution is the 2 groups of 13×13×192 pixel layers output by the fourth layer. In order to facilitate subsequent processing, The upper, lower, left and right edges of each pixel layer are filled with 1 pixel, and the filled size becomes (13+1+1)×(13+1+1). 2 sets of pixel layer data are sent to 2 different GPUs. Perform calculations.
Each GPU in this layer has 128 convolution kernels. The size of each convolution kernel is 3×3×192. The step size of the convolution is 1 pixel. The size after convolution is (13+1 +1-3)/1+1=13, each GPU has 13×13×128 convolution kernels, and 2 GPUs generate a 13×13×256 pixel layer after convolution.
(2)
The pixel layer after ReLU convolution is processed by the ReLU unit to generate an activation pixel layer. The size is still 2 groups of 13×13×128 pixel layers, which are processed by two GPUs respectively.
(3) Pooling
2 groups of 13×13×128 pixel layers are performed on two different GPUs respectively. The size of the pooling operation is 3×3, the step size is 2, and the size of the pooled image is ( 13-3)/2+1=6, that is, the size of the pixels after pooling is two sets of 6×6×128 pixel layer data, with a total of 6×6×256 pixel layer data.

The sixth layer (fully connected layer)

Insert image description here
The processing flow of the sixth layer is: Convolution (fully connected)–>ReLU–>Dropout
Insert image description here
(1) Convolution (fully connected)
The input data of the sixth layer is the output of the fifth layer, with a size of 6×6×256. There are a total of 4096 convolution kernels in this layer, and the size of each convolution kernel is 6×6×256. Since the size of the convolution kernel is exactly the same as the size of the feature map (input) to be processed, that is, each convolution kernel in the convolution kernel The coefficients are only multiplied by one pixel value of the feature map (input) size, corresponding to one-to-one, so this layer is called a fully connected layer. Since the convolution kernel and the feature map have the same size, there is only one value after the convolution operation. Therefore, the size of the pixel layer after convolution is 4096×1×1, that is, there are 4096 neurons.
(2)
The 4096 operation results of ReLU generate 4096 values ​​through the ReLU activation function.
(3) Dropout
and then through Dropout operation, output 4096 result values.

Layer 7 (fully connected layer)

Insert image description here
The processing flow of the seventh layer is: fully connected-->ReLU->Dropout. The
Insert image description here
4096 data output by the sixth layer are fully connected with the 4096 neurons of the seventh layer, and then processed by ReLU to generate 4096 data, and then After Dropout processing, 4096 data are output.

Layer 8 (fully connected layer)

Insert image description here
The processing flow of the eighth layer is: fully connected.
Insert image description here
The 4096 data output by the seventh layer are fully connected with the 1000 neurons of the eighth layer. After training, 1000 float values ​​are output, which is the prediction result.

The above is a layer-by-layer analysis of the AlexNet network structure diagram. It looks quite complicated. Here is a simplified diagram, which looks much more refreshing. Through the previous
Insert image description here
introduction, we can see the characteristics and innovations of AlexNet, which are mainly as follows:
Insert image description here

Programming implementation

Download the imagenet data set.
The keras.datasets module provided by Keras can be used to directly load the ImageNet data set. However, it should be noted that the ImageNet dataset is very large and contains millions of high-resolution images, so it usually requires the use of distributed computing or training on GPUs.

The CIFAR-10 data set is a commonly used image classification data set, containing 10 categories of images. Each category contains 6,000 32x32 pixel color images, a total of 60,000 images, of which 50,000 are used for training and 10,000 are used test. These 10 categories are:

  1. airplane
  2. automobile
  3. birds
  4. cat
  5. deer
  6. dog
  7. frog
  8. horse
  9. ship
  10. truck

The label of each image is an integer between 0 and 9, corresponding to one of the 10 categories mentioned above. Therefore, we can use these labels to train and test image classification models.
You can use the following code to load this small sample dataset:

from tensorflow.keras.datasets import cifar10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
print(x_train.shape)

After execution, there is a download process in the log. The download is very slow. We can manually download the path
https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz and rename it to
: cifar-10-batches-py.tar.gz, and then upload it to the ~/.keras/datasets directory (no need to decompress it). The program will decompress the file offline. Under the window, it is: C:\Users\your user. Run keras\datasets
Insert image description here
again to output
(50000, 32, 32, 3)
and randomly load 100 images to see the effect.

# 随机选择100张图片进行显示
indices = np.random.choice(len(x_train), size=100, replace=False)
images = x_train[indices]
labels = y_train[indices]

# 绘制图片
fig = plt.figure(figsize=(10, 10))
for i in range(10):
    for j in range(10):
        index = i * 10 + j
        ax = fig.add_subplot(10, 10, index + 1)
        ax.imshow(images[index])
        ax.set_xticks([])
        ax.set_yticks([])
        ax.set_title(labels[index][0])
plt.show()

Display
Insert image description here
Because the data set has a total of 60,000 images, format 32 32, and uses the Alexnet model for calculation, the image needs to be converted to 224 224, the number of rgb channels is 3, and each pixel needs to be converted to float32, which causes the GPU to take up too much memory and cause memory problems. Overflow
requires video memory = 60000 224 224 3 4>=30GB,
so incremental training is required

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.python.ops.numpy_ops import np_config
np_config.enable_numpy_behavior()
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
"""
在Python中,我们可以使用TensorFlow或Keras等深度学习框架来加载CIFAR-10数据集。为了有效地处理大量图像数据,我们可以使用生成器函数和yield语句来逐批加载数据。
生成器函数是一个Python函数,它使用yield语句来产生一个序列的值。当函数执行到yield语句时,它会将当前的值返回给调用者,并暂停函数的执行。当函数再次被调用时,它会从上一次暂停的位置继续执行,并返回下一个值。
"""
def cifar10_generator(x, y, batch_size):
    """
    CIFAR-10 data generator.
    """
    while True:
        for i in range(0, len(x), batch_size):
            x_batch = x[i:i+batch_size]
            y_batch = y[i:i+batch_size]
            x_batch = tf.image.resize_with_pad(x_batch, target_height=224, target_width=224)
            x_batch = x_batch.astype('float32') / 255.0
            yield x_batch, y_batch

from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout

def alexnet(input_shape, num_classes):
    model = tf.keras.Sequential([
        Conv2D(96, (11,11), strides=(4,4), activation='relu', input_shape=input_shape),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(256, (5,5), strides=(1,1), padding='same', activation='relu'),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(384, (3,3), strides=(1,1), padding='same', activation='relu'),
        Conv2D(384, (3,3), strides=(1,1), padding='same', activation='relu'),
        Conv2D(256, (3,3), strides=(1,1), padding='same', activation='relu'),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.5),
        Dense(4096, activation='relu'),
        Dropout(0.5),
        Dense(num_classes, activation='softmax')
    ])
    return model

# 定义一些超参数
batch_size = 256
epochs = 5
learning_rate = 0.001

# 定义生成器
train_generator = cifar10_generator(x_train, y_train, batch_size)
test_generator = cifar10_generator(x_test, y_test, batch_size)

# 定义模型
input_shape = (224,224,3)
num_classes = 10
model = alexnet(input_shape, num_classes)

# 编译模型
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 定义 ModelCheckpoint 回调函数
checkpoint = tf.keras.callbacks.ModelCheckpoint('./AlexNet.h5', save_best_only=True, save_weights_only=False, monitor='val_loss')

# 训练模型
model.fit(train_generator,
          epochs=epochs,
          steps_per_epoch=len(x_train)//batch_size,
          validation_data=test_generator,
          validation_steps=len(x_test)//batch_size,
          callbacks=[checkpoint]
          )
test_loss, test_acc = model.evaluate(test_generator, y_test)
print('Test accuracy:', test_acc)

Prediction results and display images

# 在这里添加您的识别代码
model = tf.keras.models.load_model('./AlexNet.h5')
srcImage=x_test[105]
p_test=np.array([srcImage])
p_test = tf.image.resize_with_pad(p_test, target_height=224, target_width=224)
p_test = p_test.astype('float32') / 255.0
predictions = model.predict(p_test)
print("识别结果为:" + str(np.argmax(predictions)))
# 绘制第10个测试数据的图形
plt.imshow(srcImage, cmap=plt.cm.binary)
plt.show()

Output: 1
Insert image description here

Reference: https://my.oschina.net/u/876354/blog/1633143

VGGNet

In 2014, Oxford University Computer Vision Group (Visual Geometry Group) and researchers from Google DeepMind developed a new deep convolutional neural network: VGGNet, and won the second place in the ILSVRC2014 competition classification project (the first place was GoogLeNet , also proposed in the same year) and ranked first in positioning projects.
VGGNet explored the relationship between the depth of a convolutional neural network and its performance, and successfully constructed a 16- to 19-layer deep convolutional neural network, proving that increasing the depth of the network can affect the final performance of the network to a certain extent, making errors The rate has dropped significantly, and the scalability is very strong, and the generalization ability to migrate to other image data is also very good. So far, VGG is still used to extract image features.
VGGNet can be regarded as a deepened version of AlexNet, which is composed of two parts: convolutional layer and fully connected layer.

Features of VGG

Let’s take a look at the structure diagram of VGG
Insert image description here
first. 1. The structure is simple.
VGG consists of 5 layers of convolutional layers, 3 layers of fully connected layers, and a softmax output layer. Max-pooling (maximization pooling) is used to separate the layers. All hidden layers The activation units all use the ReLU function.
2. Small convolution kernel and multiple convolution sub-layers
VGG uses multiple convolution layers with smaller convolution kernels (3x3) instead of one convolution layer with a larger convolution kernel. On the one hand, it can reduce parameters, and on the other hand it is quite Because more non-linear mappings are performed, the fitting/expression ability of the network can be increased.
Small convolution kernel is an important feature of VGG. Although VGG is imitating the network structure of AlexNet, it does not use the larger convolution kernel size (such as 7x7) in AlexNet, but reduces the size of the convolution kernel (3x3). , increase the number of convolutional sub-layers to achieve the same performance (VGG: from 1 to 4 convolutional sub-layers, AlexNet: 1 sub-layer).
The author of VGG believes that the receptive field size obtained by stacking two 3x3 convolutions is equivalent to a 5x5 convolution; and the receptive field obtained by stacking three 3x3 convolutions is equivalent to a 7x7 convolution. This can increase non-linear mapping and also reduce parameters (for example, the parameters of 7x7 are 49, and the parameters of three 3x3 are 27), as shown in the figure below: 3. Compared with AlexNet's 3x3 pool, the small
Insert image description here
pooling
kernel Pooling core, VGG all uses 2x2 pooling core.
4. Large number of channels.
The number of channels in the first layer of the VGG network is 64, and each subsequent layer has been doubled to a maximum of 512 channels. The increase in the number of channels allows more information to be extracted.
5. Deeper layers and wider feature maps.
Since the convolution kernel focuses on expanding the number of channels and the pooling focuses on reducing the width and height, the model architecture is made deeper and wider while controlling the increase in calculation amount.
6. Fully connected to convolution (test phase)
This is also a feature of VGG. In the network testing phase, the three fully connected in the training phase are replaced with three convolutions, so that the fully connected network obtained by the test does not have fully connected limit, so it can accept input of any width or height, which is important during the testing phase.
As shown in the first picture of this section, the input image is 224x224x3. If the next three layers are all fully connected, then in the testing phase, all the test images can only be scaled to 224x224x3 to meet the following fully connected layers. The input quantity requirement makes it inconvenient to carry out testing work.
As for "fully connected to convolution", the replacement process is as follows:
Insert image description here
For example, if a 7x7x512 layer needs to be fully connected with a layer of 4096 neurons, then it is replaced by a convolution with a channel number of 4096 and a convolution kernel of 1x1 on the 7x7x512 layer. .
The idea of ​​​​this "fully connected to convolution" is that the VGG author refers to the working idea of ​​OverFeat. For example, the picture below shows that after OverFeat replaces the fully connected with convolution, it can process the volume calculation at any resolution (on the entire picture) Product, this is the advantage of not needing to rescale the original image.
Insert image description here

VGG network structure

The figure below is the VGG network structure from the paper "Very Deep Convolutional Networks for Large-Scale Image Recognition" (large-scale image recognition based on very deep convolutional networks). It was in this paper that VGG was proposed, as shown below:
Insert image description here
In In this paper, 6 network structures A, A-LRN, B, C, D, and E were used for testing. These 6 network structures are similar and are composed of 5 layers of convolutional layers and 3 layers of fully connected layers. , the difference is that the number of sub-layers of each convolutional layer is different, increasing from A to E (the number of sub-layers is from 1 to 4), and the total network depth is from 11 to 19 layers (added layers are shown in bold) , the convolution layer parameters in the table are expressed as "conv⟨receptive field size -number of channels ", for example, con3-128, which means using a 3x3 convolution kernel and the number of channels is 128. For the sake of brevity, the ReLU activation function is not shown in the table.
Among them, network structure D is the famous VGG16, and network structure E is the famous VGG19.

Taking network structure D (VGG16) as an example, the processing process is introduced as follows. Please compare the table above and the picture below. Pay attention to the numerical changes in the picture, which will help you understand the processing process of VGG16: 1. Input a 224x224x3 picture
Insert image description here
. After 64 3x3 convolution kernels are used for two convolutions + ReLU, the convolved size becomes 224x224x64.
2. Perform max pooling (maximization pooling), and the pooling unit size is 2x2 (the effect is that the image size is halved) , the size after pooling becomes 112x112x64
3. After 128 3x3 convolution kernels for two convolutions + ReLU, the size becomes 112x112x128
4. After 2x2 max pooling, the size becomes 56x56x128
5. After 256 3x3 convolution kernel is used for cubic convolution + ReLU, the size becomes 56x56x256
6. 2x2 max pooling is used, the size becomes 28x28x256
7. 512 3x3 convolution kernels are used for cubic convolution + ReLU, the size becomes 28x28x512
8. Perform 2x2 max pooling and the size becomes 14x14x512
9. Perform three convolutions + ReLU through 512 3x3 convolution kernels and the size becomes 14x14x512
10. Perform 2x2 max pooling and the size becomes 7x7x512
11. Perform full connection + ReLU with two layers of 1x1x4096 and one layer of 1x1x1000 (three layers in total)
12. Output 1000 prediction results through softmax

The above is the processing process of each layer of VGG16 (network structure D). The processing process of A, A-LRN, B, C, E and other network structures is also similar. The execution process is as follows (taking VGG16 as an example): From the above process, we can
Insert image description here
see The VGG network structure is quite simple, consisting of a small convolution kernel, a small pooling kernel, and ReLU. The simplified diagram is as follows (taking VGG16 as an example):
Insert image description here
Although the depth of the six network structures A, A-LRN, B, C, D, and E has increased from 11 layers to 19 layers, the amount of parameters has not changed much. This is because Basically, small convolution kernels (3x3, only 9 parameters) are used. The number of parameters (millions) of these 6 structures has not changed much. This is because in the network, the parameters are mainly concentrated in the entire network. connection layer.
Insert image description here
The author conducted a single-scale evaluation of six network structures: A, A-LRN, B, C, D, and E. The error rate results are as follows: From the above table, it can be seen: 1. There is no performance gain in the LRN layer (A
Insert image description here
-
LRN )
VGG The author found through the network A-LRN that the LRN layer (local response normalization, local response normalization) used by AlexNet did not improve performance, so the LRN layer did not appear in other groups of networks.
2. As the depth increases, the classification performance gradually improves (A, B, C, D, E).
From 11-layer A to 19-layer E, the error rate of top1 and top5 decreases significantly as the network depth increases.
3. Multiple small convolution kernels perform better than a single large convolution kernel (B)
The VGG author did an experiment using B and compared it with a shallower network that was not in the experimental group. The shallower network used conv5x5 to replace the two components of B. conv3x3, the results show that multiple small convolution kernels are better than a single large convolution kernel.

Finally, let’s make a summary:
1. Performance can be effectively improved by increasing depth;
2. The best model: VGG16, which has only 3x3 convolution and 2x2 pooling from beginning to end, is simple and beautiful;
3. Convolution can replace full connection and can be adapted to Pictures of various sizes

Programming implementation

Downloading the ILSVRC2014 data set from image-net currently requires registration, and it requires approval, which is troublesome. You can download the ILSVRC2017 version on the Alibaba Cloud Tianchi data set (you can use DingTalk or Alipay real-name authentication to log in. Many large data sets can be logged in directly. Download), address: https://tianchi.aliyun.com/dataset/92252, download imagenet_object_localization_patched2019 (1).tar.gz, the data set size is 155GB.
Because the data set is too large, I still use cifar10 here.

VGGNet and AlexNet are both deep neural network models, and VGGNet is deeper than AlexNet, so it requires more computing resources and time to train. Specifically, VGGNet has 16 or 19 layers, while AlexNet only has 8 layers. This means that VGGNet needs to process more parameters and data, and requires longer training time. In addition, VGGNet uses smaller convolution kernels, which also results in more calculations. Therefore, it is normal for VGGNet training to be much slower than AlexNet.

import tensorflow as tf
from tensorflow.keras.datasets import cifar10
from tensorflow.python.ops.numpy_ops import np_config
from tensorflow.keras.layers import Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.models import Sequential

np_config.enable_numpy_behavior()
(x_train, y_train), (x_test, y_test) = cifar10.load_data()


def cifar10_generator(x, y, batch_size):
    while True:
        for i in range(0, len(x), batch_size):
            x_batch = x[i:i+batch_size]
            y_batch = y[i:i+batch_size]
            x_batch = tf.image.resize_with_pad(x_batch, target_height=224, target_width=224)
            x_batch = x_batch.astype('float32') / 255.0
            yield x_batch, y_batch


def vggnet(input_shape, num_classes):
    # 定义VGGNet
    model = Sequential([
        # 第一层卷积和池化
        Conv2D(64, (3, 3), activation='relu', padding='same', input_shape=input_shape),
        Conv2D(64, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # 第二层卷积和池化
        Conv2D(128, (3, 3), activation='relu', padding='same'),
        Conv2D(128, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # 第三层卷积和池化
        Conv2D(256, (3, 3), activation='relu', padding='same'),
        Conv2D(256, (3, 3), activation='relu', padding='same'),
        Conv2D(256, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # 第四层卷积和池化
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # 第五层卷积和池化
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        Conv2D(512, (3, 3), activation='relu', padding='same'),
        MaxPooling2D((2, 2)),

        # 将输出的特征图展平,并连接全连接层
        Flatten(),
        Dense(4096, activation='relu'),
        Dense(4096, activation='relu'),
        Dense(10, activation='softmax')
    ])

    return model

# 定义一些超参数
batch_size = 128
epochs = 5
learning_rate = 0.001

# 定义生成器
train_generator = cifar10_generator(x_train, y_train, batch_size)
test_generator = cifar10_generator(x_test, y_test, batch_size)

# 定义模型
input_shape = (224,224,3)
num_classes = 10
model = vggnet(input_shape, num_classes)
model.summary()
# 编译模型
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# 定义 ModelCheckpoint 回调函数
checkpoint = tf.keras.callbacks.ModelCheckpoint('./VGGNet.h5', save_best_only=True, save_weights_only=False, monitor='val_loss')

# 训练模型
model.fit(train_generator,
          epochs=epochs,
          steps_per_epoch=len(x_train)//batch_size,
          validation_data=test_generator,
          validation_steps=len(x_test)//batch_size,
          callbacks=[checkpoint]
          )
test_loss, test_acc = model.evaluate(test_generator, y_test)
print('Test accuracy:', test_acc)



Reference: https://my.oschina.net/u/876354/blog/1634322

GoogLeNet

In 2014, GoogLeNet and VGG were the two heroes of the ImageNet Challenge (ILSVRC14) that year. GoogLeNet won the first place and VGG won the second place. The common feature of these two types of model structures is that they are deeper. VGG inherits some framework structures of LeNet and AlexNet, while GoogLeNet has made a bolder attempt at network structure. Although the depth is only 22 layers, its size is much smaller than AlexNet and VGG. GoogleNet has 5 million parameters, and AlexNet has 5 million parameters. It is 12 times that of GoogleNet, and the parameters of VGGNet are 3 times that of AlexNet. Therefore, when memory or computing resources are limited, GoogleNet is a better choice; judging from the model results, the performance of GoogLeNet is even better.

Little knowledge: GoogLeNet is a deep network structure developed by Google. Why is it not called "GoogleNet" but "GoogLeNet"? It is said to pay tribute to "LeNet", so it is named "GoogLeNet"

So, how does GoogLeNet further improve performance?
Generally speaking, the most direct way to improve network performance is to increase network depth and width. Depth refers to the number of network layers, and width refers to the number of neurons. However, this method has the following problems:
(1) There are too many parameters, and if the training data set is limited, overfitting can easily occur; (
2) The larger the network and the more parameters, the greater the computational complexity and difficulty in application;
(3) ) The deeper the network, the problem of gradient dispersion is prone to occur (the gradients tend to disappear as they travel further back), making it difficult to optimize the model.
Therefore, some people joke that "deep learning" is actually "deep parameter adjustment."
The way to solve these problems is of course to reduce the parameters while increasing the depth and width of the network. In order to reduce the parameters, it is natural to think of changing the full connection into a sparse connection. However, in terms of implementation, the actual amount of calculation will not be significantly improved after the full connection is changed to a sparse connection, because most hardware is optimized for dense matrix calculations. Although the sparse matrix has a small amount of data, the calculation time is very long. Difficult to reduce.

So, is there a way to maintain the sparsity of the network structure while taking advantage of the high computational performance of dense matrices? A large amount of literature shows that sparse matrices can be clustered into denser submatrices to improve computing performance. Just like the human brain can be seen as a repeated accumulation of neurons, therefore, the GoogLeNet team proposed the Inception network structure, which is to construct a A "basic neuron" structure is used to build a sparse and high computing performance network structure.
[Here comes the question] What is Inception?
Inception has gone through the development of multiple versions such as V1, V2, V3, V4, etc., and is constantly improving. We will introduce them one by one below.

Inception V1

By designing a sparse network structure that can generate dense data, it can not only increase the performance of the neural network, but also ensure the efficiency of computing resource usage. Google proposed the basic structure of the original Inception:
Insert image description here
this structure stacks the convolutions (1x1, 3x3, 5x5) and pooling operations (3x3) commonly used in CNN (the sizes after convolution and pooling are the same, and the channels are Plus), on the one hand, it increases the width of the network, and on the other hand, it also increases the adaptability of the network to scale.
The network in the convolutional layer of the network can extract every detailed information of the input, and the 5x5 filter can also cover most of the input of the receiving layer. A pooling operation can also be performed to reduce the space size and reduce overfitting. On top of these layers, a ReLU operation is performed after each convolutional layer to increase the nonlinear characteristics of the network.
However, in the original version of Inception, all convolution kernels are performed on all outputs of the previous layer, and the amount of calculation required for the 5x5 convolution kernel is too large, resulting in a very thick feature map. In order to avoid In this case, a 1x1 convolution kernel is added before 3x3, before 5x5, and after max pooling to reduce the thickness of the feature map. This also forms the network structure of Inception v1, as shown in the following figure:
Insert image description here
What is the use of the 1x1 convolution kernel?
The main purpose of 1x1 convolution is to reduce dimensionality and is also used to correct linear activation (ReLU). For example, the output of the previous layer is 100x100x128. After passing through a 5x5 convolution layer with 256 channels (stride=1, pad=2), the output data is 100x100x256, where the parameters of the convolution layer are 128x5x5x256= 819200. And if the output of the previous layer first passes through a 1x1 convolution layer with 32 channels, and then passes through a 5x5 convolution layer with 256 outputs, then the output data is still 100x100x256, but the amount of convolution parameters has been reduced to 128x1x1x32 + 32x5x5x256= 204800, a decrease of approximately 4 times.

The network structure of GoogLeNet based on Inception is as follows (22 layers in total): The
Insert image description here
above figure is explained as follows:
(1) GoogLeNet adopts a modular structure (Inception structure) to facilitate addition and modification;
(2) The network finally adopts average pooling (Average pooling) to replace the fully connected layer. The idea comes from NIN (Network in Network). It turns out that this can increase the accuracy by 0.6%. However, a fully connected layer was actually added at the end, mainly to facilitate flexible adjustment of the output;
(3) Although the fully connected layer was removed, Dropout was still used in the network;
(4) In order to avoid gradient disappearance, the network Two additional auxiliary softmax are added for forward conduction gradient (auxiliary classifier). The auxiliary classifier uses the output of an intermediate layer for classification and adds a smaller weight (0.3) to the final classification result. This is equivalent to model fusion and adds a backpropagation gradient to the network. The signal also provides additional regularization, which is very beneficial to the training of the entire network. In actual testing, these two additional softmax will be removed.

The details of the GoogLeNet network structure diagram are as follows:
Insert image description here
Note: "#3x3 reduce" and "#5x5 reduce" in the above table indicate the number of 1x1 convolutions used before the 3x3 and 5x5 convolution operations.

The detailed list of GoogLeNet network structure is analyzed as follows:
0. Input
The original input image is 224x224x3, and all have undergone zero-mean preprocessing (the mean value is subtracted from each pixel of the image).
1. The first layer (convolutional layer)
uses a 7x7 convolution kernel (sliding step size 2, padding 3), 64 channels, and the output is 112x112x64. After convolution, the ReLU operation is performed and 3x3
max pooling (step size 2 ), the output is ((112 - 3+1)/2)+1=56, that is, 56x56x64, and then perform ReLU operation
2. The second layer (convolutional layer)
uses a 3x3 convolution kernel (sliding step size is 1, padding is 1), 192 channels, the output is 56x56x192, after convolution and ReLU operation,
3x3 max pooling (step size is 2), the output is ((56 - 3+1)/2)+1=28, that is, 28x28x192 , then perform ReLU operation
3a, and the third layer (Inception 3a layer)
is divided into four branches, using convolution kernels of different scales for processing
(1) 64 1x1 convolution kernels, and then RuLU, output 28x28x64
(2) 96 1x1 convolution kernels, used as dimensionality reduction before the 3x3 convolution kernel, become 28x28x96, and then perform ReLU calculation, and then perform 128 3x3 convolutions (padding is 1), output 28x28x128 (3) 16
1x1 The convolution kernel, as the dimensionality reduction before the 5x5 convolution kernel, becomes 28x28x16. After the ReLU calculation, 32 5x5 convolutions are performed (padding is 2), and the output is 28x28x32
(4) The pool layer uses a 3x3 kernel (padding is 1), outputs 28x28x192, and then performs 32 1x1 convolutions to output 28x28x32.
Connect the four results, and parallelize the third dimension of the four parts of the output results, that is, 64+128+32+32=256, the final output is 28x28x256
3b, third layer (Inception 3b layer)
(1) 128 1x1 volumes Convolution kernel, then RuLU, output 28x28x128
(2) 128 1x1 convolution kernels, as dimensionality reduction before the 3x3 convolution kernel, become 28x28x128, perform ReLU, and then perform 192 3x3 convolutions (padding is 1), Output 28x28x192
(3) 32 1x1 convolution kernels, used as dimensionality reduction before the 5x5 convolution kernel, become 28x28x32. After performing ReLU calculation, 96 5x5 convolutions (padding is 2) are performed, and the output is 28x28x96 (
4 ) pool layer, using a 3x3 kernel (padding is 1), outputs 28x28x256, and then performs 64 1x1 convolutions, outputting 28x28x64.
Connect the four results and connect the third dimension of the four output results in parallel, that is, 128+192+96+64=480, and the final output is 28x28x480

The fourth layer (4a, 4b, 4c, 4d, 4e), the fifth layer (5a, 5b)... are similar to 3a and 3b and will not be repeated here.

Judging from the experimental results of GoogLeNet, the effect is obvious, and the error rate is lower than that of MSRA, VGG and other models. The comparison results are shown in the following table:
Insert image description here

Inception V2

GoogLeNet has been studied and used by many researchers due to its excellent performance. Therefore, the GoogLeNet team further explored and improved it and produced an upgraded version of GoogLeNet.
The original intention of GoogLeNet design is to be accurate and fast. Although the accuracy can be improved by simply stacking the network, it will lead to a significant decrease in computing efficiency. Therefore, how to improve the expressive ability of the network without increasing the amount of calculation too much? It became a problem.
The solution for the Inception V2 version is to modify the internal calculation logic of Inception and propose a relatively special "convolution" calculation structure.

1. Convolution decomposition (Factorizing Convolutions)
A large-sized convolution kernel can bring a larger receptive field, but it also means that more parameters will be generated. For example, a 5x5 convolution kernel has 25 parameters, and a 3x3 convolution kernel has 25 parameters. There are 9 parameters, the former is 25/9=2.78 times of the latter. Therefore, the GoogLeNet team proposed that a single 5x5 convolutional layer can be replaced by a small network composed of two consecutive 3x3 convolutional layers, which reduces the number of parameters while maintaining the receptive field range, as shown below: So this
Insert image description here
alternative Will it cause a decrease in expressive ability? A large number of experiments have shown that it does not cause loss of expression.
It can be seen that the large convolution kernel can be completely replaced by a series of 3x3 convolution kernels. Can it be decomposed into smaller sizes? The GoogLeNet team considered an nx1 convolution kernel, as shown in the figure below, replacing the 3x3 convolution with three 3x1 convolutions:
Insert image description here
therefore, any nxn convolution can be replaced by a 1xn convolution followed by an nx1 convolution. The GoogLeNet team found that the effect of using this decomposition in the early stage of the network is not good, and the effect is better when used on medium-sized feature maps (feature map size is recommended to be between 12 and 20).
Insert image description here
2. Reduce the size of the feature map.
Generally, if you want to reduce the size of the image, there are two ways:
Insert image description here
pooling first and then Inception convolution, or doing Inception convolution first and then pooling. However, the first method (left picture) of pooling (pooling) will cause the feature representation to encounter a bottleneck (feature loss). The second method (right picture) is a normal reduction, but the calculation amount is very large. In order to maintain feature representation and reduce the amount of calculation at the same time, the network structure is changed to the figure below, and two parallel modules are used to reduce the amount of calculation (convolution, pooling are executed in parallel, and then merged). Inception V2 is used to make an improved version of GoogLeNet
Insert image description here
. , the network structure diagram is as follows:
Insert image description here
Note: Figure 5 in the above table refers to Inception without evolution, Figure 6 refers to the small convolution version of Inception (using 3x3 convolution kernel instead of 5x5 convolution kernel), and Figure 7 refers to the asymmetric version of Inception (using 1xn, nx1 convolution kernel instead of nxn convolution kernel).

After experiments, the model results have been greatly improved compared with the old GoogleNet, as shown in the following table:
Insert image description here

Inception V3

One of the most important improvements of Inception V3 is factorization, which decomposes 7x7 into two one-dimensional convolutions (1x7, 7x1), and the same is true for 3x3 (1x3, 3x1). This benefit can not only speed up calculations, but also Splitting one convolution into two convolutions further increases the depth of the network and increases the nonlinearity of the network (ReLU is performed for each additional layer).
In addition, the network input changed from 224x224 to 299x299.

Inception V4

Inception V4 studies the combination of Inception module and residual connection. The ResNet structure greatly deepens the network depth, greatly improves the training speed, and also improves the performance (for an introduction to the technical principles of ResNet, see the previous article on this blog: Dahua Deep Residual Network ResNet).
Inception V4 mainly uses residual connection to improve the V3 structure, resulting in Inception-ResNet-v1, Inception-ResNet-v2, and Inception-v4 networks.
The residual structure of ResNet is as follows:
Insert image description here
Combining this structure with Inception, it becomes the following figure:
Insert image description here
Through a combination of 20 similar modules, Inception-ResNet is constructed as follows:
Insert image description here

Programming implementation

Subsequent additions

Reference: https://my.oschina.net/u/876354/blog/1637819

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130677530