Deep Learning 03-Convolutional Neural Network (CNN)

Introduction

CNN, or Convolutional Neural Network, is a deep learning model commonly used for image and video processing. Compared with traditional neural networks, CNN has a better ability to process images and sequence data because it can automatically learn features in images and extract the most useful information.

A core feature of CNN is the convolution operation, which can perform sliding window calculations on the image and extract image features through filters (also called convolution kernels) and pooling layers (Max Pooling). The convolution operation can effectively reduce the number of weights and reduce the amount of calculation, while also retaining the spatial structure information of the image. The pooling layer can reduce the amount of calculation and improve the robustness of the model without changing the dimension of the feature map.

Typical structures of CNN include convolutional layers, pooling layers, fully connected layers, etc. At the same time, in order to prevent over-fitting, CNN will also add some regularization techniques, such as Dropout and L2 regularization.

CNN is widely used in image classification, target detection, speech recognition and other fields. In image classification tasks, the classic models of CNN include LeNet-5, AlexNet, VGG and GoogleNet/Inception. The design ideas and network structures of these models are different, but they have all made important contributions to the development of convolutional neural networks. contribute.

development path

Convolutional neural network (CNN) is a deep learning model that is widely used in image recognition, computer vision and other fields. In the development process of CNN, many classic models have emerged. Here is a brief introduction to several famous models.

  1. LeNet-5

LeNet-5 was proposed by Yann LeCun and others in 1998 and was the first widely used convolutional neural network model. It is mainly used for handwritten digit recognition and contains convolutional layers, pooling layers and fully connected layers. The design of LeNet-5 enables it to achieve very good performance on the MNIST handwritten digit recognition task. It is characterized by a small number of convolution kernels (6 and 16) and a small number of parameters. The first convolution layer uses 6 convolution kernels with a size of 5×5, and the second convolution layer uses 16 A convolution kernel of size 5×5. This design can effectively reduce the number of parameters of the model, but it is the originator of convolutional neural networks and lays the foundation for subsequent models.

  1. AlexNet

AlexNet was proposed by Alex Krizhevsky and others in 2012. It was the first convolutional neural network model to achieve excellent results in the ImageNet image classification competition. It uses multiple convolutional layers and pooling layers, and uses ReLU activation function and Dropout regularization technology. The design of AlexNet made it significantly ahead of other models in the ImageNet image classification competition, thus leading a new round of development of convolutional neural networks. It is characterized by the use of a large number of convolution kernels (nearly 6,000) and a large number of parameters, but it has good performance in terms of accuracy and efficiency.

  1. VGG

VGG was proposed by Karen Simonyan and Andrew Zisserman in 2014. Its main contribution is to propose the use of smaller convolution kernels (3x3) instead of larger convolution kernels. This design makes the network deeper and has fewer parameters, thereby improving efficiency and accuracy. VGG contains 16 or 19 convolutional layers and pooling layers, and these layers all use the same convolution kernel size and stride. VGG achieved very good results in the ImageNet image classification competition, and also provided inspiration for subsequent ResNet and other models.

  1. GoogleNet/Inception

GoogleNet was proposed by the Google team in 2014. Its main contribution is the Inception module, which can increase the depth and width of the network without increasing the amount of parameters. The Inception module uses multiple convolution kernels and pooling layers of different sizes for feature extraction, and then connects them together to form a module. GoogleNet also uses a global average pooling layer instead of a fully connected layer, further reducing the number of parameters. GoogleNet achieved very good results in the ImageNet image classification competition, and also provided inspiration for subsequent ResNet, DenseNet and other models.

  1. ResNet

ResNet was proposed by the Microsoft Research Asia team in 2015. Its main contribution is to propose residual learning, which can solve the degradation problem of deep convolutional neural networks. The degradation problem refers to the phenomenon that as the depth of the network increases, the accuracy decreases. Residual learning avoids the loss of information by introducing cross-layer connections to directly transfer input to output. ResNet contains a deeper network structure (152 layers), but achieves better accuracy. The design ideas of ResNet were inherited by subsequent DenseNet, MobileNet and other models.

  1. DenseNet

DenseNet was proposed by Gao Huang et al. in 2017. Its main contribution is to propose dense connections, which can increase the depth and width of the network, thus improving efficiency and accuracy. Dense connection refers to connecting the output of each layer to the inputs of all subsequent layers to form a dense connection structure. This design makes the network more compact, has fewer parameters, and can also improve the reusability of features. DenseNet achieved very good results in the ImageNet image classification competition, and also provided inspiration for subsequent models such as ShuffleNet and EfficientNet.

  1. MobileNet

MobileNet was proposed by the Google team in 2017. Its main contribution is to propose depth-separable convolution, which can reduce the number of parameters while maintaining good accuracy. Depthwise separable convolution refers to dividing the convolution operation into two steps: depth convolution and point-wise convolution, thereby reducing the amount of calculation and parameters. MobileNet uses multiple depth-separable convolutional layers and pooling layers to achieve efficient image classification and target detection in resource-constrained environments such as mobile devices. The design ideas of MobileNet were inherited by subsequent models such as ShuffleNet and EfficientNet.

  1. ShuffleNet

ShuffleNet was proposed by the Microsoft Research Asia team in 2018. Its main contribution is to propose channel reorganization and group convolution, which can significantly reduce the amount of parameters and calculations while maintaining accuracy. Channel reorganization refers to grouping and regrouping input channels to allow information exchange between different groups. Group convolution refers to dividing the convolution operation into two steps: intra-group convolution and inter-group convolution, thereby reducing the amount of calculation and parameters. ShuffleNet uses multiple channel reorganization and group convolutional layers to achieve efficient image classification and target detection in resource-constrained environments.

  1. EfficientNet

EfficientNet was proposed by the Google team in 2019. Its main contribution is to propose network scaling and composite coefficients, which can significantly reduce the amount of parameters and calculations while maintaining accuracy. Network scaling refers to scaling the depth, width, and resolution of the network simultaneously to optimize without changing the model structure. Composite coefficients refer to combining the scaling coefficients of depth, width and resolution to obtain a more efficient model. EfficientNet achieved very good results in the ImageNet image classification competition, and also provided inspiration for subsequent model optimization.

  1. RegNet

RegNet was proposed by the Facebook AI Research team in 2020. Its main contribution is to propose adaptive rules for the network structure, which can significantly reduce the amount of parameters and calculations while maintaining accuracy. Adaptive rules refer to automatically adjusting the hyperparameters of the network structure through search and optimization to obtain a more efficient model. RegNet achieved very good results in the ImageNet image classification competition, and also provided inspiration for subsequent model optimization.

The above are several well-known convolutional neural network models. Their design ideas and network structures are different, but they have all made important contributions to the development of convolutional neural networks.

Schematic principle

Convolutional neural networks shine in image recognition, achieving unprecedented accuracy and having a wide range of applications. Next, image recognition will be used as an example to introduce the principles of convolutional neural networks.

Case

Assume that given a picture (maybe the letter X or the letter O), CNN can be used to identify whether it is an X or an O, as shown in the figure below, how to do it?
Insert image description here

Image input

If a classic neural network model is used, the entire image needs to be read as the input of the neural network model (i.e., fully connected). When the size of the image becomes larger, the parameters connected to it will become many, resulting in a large amount of calculation. Very big.
Our human cognition of the outside world generally proceeds from local to global. We first have a perceptual understanding of the local part, and then gradually gain cognition of the whole. This is the human understanding model. The spatial connection in the image is similar. The pixels in the local range are closely connected, while the correlation between pixels farther away is weaker. Therefore, each neuron does not actually need to perceive the global image, it only needs to perceive the local part, and then synthesize the local information at a higher level to obtain the global information. This mode is an important artifact in reducing the number of parameters in convolutional neural networks: the local receptive field.
Insert image description here

Feature extraction

If the letter For example, translation, scaling, rotation, micro-deformation, etc., as shown in the figure below:
Insert image description here
Our goal is to accurately identify X and O in various morphological changes through CNN, which involves how to effectively extract features , as the key factor for identification.
Recall the "local receptive field" mode mentioned earlier. For CNN, it is to compare small pieces and find some rough features (small image patches) at roughly the same position in the two images. For matching, compared to the traditional comparison of the entire image one by one, CNN's small block matching method can better compare the similarities between the two images. As shown in the figure below
Insert image description here
: Taking the letter
Insert image description here
black, the three important characteristics of the letter X are as follows:
Insert image description here

The above feature extraction is an assumption. In fact, when there are multiple images as input, the convolutional neural network will extract features for each image. The specific process is as follows: the input image passes through the first convolution layer, and the convolution kernel will be in the image. Swipe up to extract some low-level features, such as edges, corners, etc.

In a convolutional neural network, if multiple different convolution kernels are used, the local receptive field size of each convolution kernel is the same, but the weights of different convolution kernels are different, so that each convolution kernel can The kernel learns different features.

For example, suppose we use three different convolution kernels in the convolution layer, where the weight of the first convolution kernel is used to detect edges, the weight of the second convolution kernel is used to detect texture features, and the weight of the third convolution kernel is used to detect edges. The weights of the convolution kernels are used to detect the shape of the target. The local receptive fields of these three convolution kernels are all the same size, but because their weights are different, each convolution kernel can learn different features.

It should be noted that the size and step size of the convolution kernel will also affect the local receptive field size of each convolution kernel. If the size of the convolution kernel is larger, its local receptive field will also become larger accordingly; if the step size is larger, the distance the convolution kernel slides each time will also become larger accordingly, thus affecting the convolution kernel. The size of the local receptive field.

For example, convolution kernel
[-1, 0, 1],
[-1, 0, 1],
[-1, 0, 1]]

This matrix is ​​actually a convolution kernel, also known as Sobel filter. It can be used to detect vertical edges in images.
In computer vision, edges refer to areas in an image with large changes in gray value. Vertical edges refer to the change in gray value from the upper part to the lower part of the image or from the lower part to the upper part.
The working principle of the convolution kernel is to convolve it with the pixels of the image to extract the features of the image. In this example, the center element of the convolution kernel is 0, indicating that it has nothing to do with the center pixel of the image. The upper row of elements [-1, 0, 1] of the convolution kernel indicates that it performs a convolution operation with the upper pixels of the image. In the same way, the elements in the lower row of the convolution kernel [-1, 0, 1] indicate that it performs a convolution operation with the lower pixels of the image.

When the convolution kernel is convolved with pixels in the image, if there are vertical edges in the image, the convolution results will show obvious changes. Specifically, on one side of the vertical edge, the convolution result will get a larger positive value, while on the other side of the vertical edge, the convolution result will get a larger negative value. In this way, we can identify the vertical edges in the image through the thresholded convolution results, and the negative parts are directly returned to 0.

For example, let's say we have an image where part of it is a vertical edge. We apply the convolution kernel to the vertical edge part of this image, and the convolution result will show positive and negative values, so that we can extract the position of the vertical edge by thresholding the convolution result.

Hope this example can help you understand why the matrix [-1, 0, 1] can be used to detect vertical edges.
Another example is
[[-0.1111, -0.1111, -0.1111],
[-0.1111, 1.0000, -0.1111],
[-0.1111, -0.1111, -0.1111]],
which is called a Laplacian filter or a sharpening filter. It can be used to enhance edges in images.
In this matrix, the center element 1 means that it is related to the center pixel of the image. And the surrounding elements -0.1111 means they are related to the surrounding pixels of the image.

When the convolution kernel is convolved with the image, the value of the central pixel will be amplified, while the value of the surrounding pixels will be suppressed. In this way, at the edge of the image, due to the large changes in pixel values, the convolution result will show larger positive and negative values, thus enhancing the contrast of the edge.

For example, let's say we have an image that contains some edges. We apply this convolution kernel to the image, and the convolution result will enhance the contrast of the edges, making them clearer.

Therefore, this convolution kernel can detect edges and make them more obvious by enhancing the contrast of the edges.

edge

The edge is the place where the gray value of the pixel in the image changes significantly. It usually represents information such as the edge, contour or texture of the object in the image. In image processing and computer vision, edge detection is a commonly used technique that can be used to segment images, extract features, etc.
For example, as shown below,
Insert image description here
the effect of edge extraction
Insert image description here

corner point

Corner points are special points in local areas of the image with obvious angle changes. Corner points are usually formed by the intersection of edges in different directions, have Gaussian curvature, and are one of the important features in images. Corner detection is also a commonly used technology in image registration, object tracking, image matching, etc. Commonly used corner detection algorithms include Harris corner detection, Shi-Tomasi corner detection, etc.
As shown below
Insert image description here

opencv

OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library. It can help developers quickly build computer vision applications such as image processing, object detection, face recognition, video analysis, etc.

OpenCV was originally initiated by Intel Corporation and has now become a cross-platform open source project that supports multiple programming languages, including C++, Python, Java, etc., and can run on operating systems such as Windows, Linux, macOS, etc.

Opencv is used here to extract the edges and corners of a certain picture. For example, the picture is
Insert image description here

I won’t go into details about opencv here, I will explain it later in the article.
code

#%%
import cv2  #注意安装open-cv  conda install open-cv
import numpy as np
import matplotlib.pyplot as plt

# 读入lena图像
img = cv2.imread('d:/9.png')
# 将BGR图像转换为RGB图像,便于matplotlib显示
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

# 将图像转换为灰度图像
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
gray_ori=gray
# 使用Canny边缘检测函数检测图像的边缘
edges = cv2.Canny(gray, 100, 200)

# 创建SIFT对象
sift = cv2.xfeatures2d.SIFT_create()
# 检测图像的特征点
keypoints = sift.detect(gray, None)
# 在图像上绘制特征点
img_sift = cv2.drawKeypoints(img, keypoints, None, flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)

# 检测图像的角点
dst = cv2.cornerHarris(gray, 2, 3, 0.04)
# 将角点标记为红色
img_corner = img.copy()
img_corner[dst > 0.01 * dst.max()] = [255, 0, 0]

# 创建一个Matplotlib窗口并显示图像及其各种特征
plt.rcParams['font.family'] = 'SimHei'
fig, axs = plt.subplots(2, 2, figsize=(10, 10))
axs[0, 0].imshow(img)
axs[0, 0].set_title('原始图像')
axs[0, 1].imshow(edges, cmap='gray')
axs[0, 1].set_title('边缘')
axs[1, 0].imshow(img_sift)
#SIFT的全称是Scale Invariant Feature Transform,尺度不变特征变换。具有旋转不变性、尺度不变性、亮度变化保持不变性,是一种非常稳定的局部特征。
axs[1, 0].set_title('SIFT特征')
axs[1, 1].imshow(img_corner)
axs[1, 1].set_title('角点特征')
plt.show()

Output effect
Insert image description here

Feature extraction principle

Please read this section after reading the [Convolution] chapter.
Commonly used convolution kernels include the following:

  1. Gaussian filter: used for image smoothing, which can reduce image noise.
  2. High-pass filter: used to highlight high-frequency information in the image, such as edges, corners, etc.
  3. Low-pass filter: used to highlight low-frequency information in the image, such as blur, smoothing, etc.
  4. Sobel filter: used to detect edge information in images.
  5. Laplacian filter: used to enhance high-frequency information of images, such as edges, details, etc.
  6. Scharr filter: Similar to Sobel filter, but more responsive to edges.
  7. Prewitt filter: Similar to the Sobel filter, but has a smoother response to edges.

These convolution kernels can be used for different tasks in image processing, such as edge detection, image smoothing, image enhancement, etc. You can choose appropriate convolution kernels to process images based on different tasks.

The convolution kernel defined below can be regarded as a high-pass filter, because its center pixel is given a larger weight and the surrounding pixels have smaller weights. This weight distribution enables the convolution kernel to detect high-frequency information in the image, such as edges, corners, etc. In the convolution operation, the convolution kernel is multiplied by each pixel in the image, and the results are added together to obtain a new pixel value. If the difference between the pixel values ​​around the center pixel of the convolution kernel and the center pixel value is large, the result of the convolution operation will be larger, indicating that this pixel may be an edge point. Therefore, this convolution kernel can highlight edge information in the image.

kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])

There is the following picture.
Insert image description here
Use opencv to load it and convolve it with a convolution kernel.

import cv2
import numpy as np
from myutils.common import show,fillColor
# 读取图片
img = cv2.imread('./images/z.png')

# 将图像转换为灰度图像
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

# 定义卷积核
kernel = np.array([[-1,-1,-1], [-1,9,-1], [-1,-1,-1]])
# kernel = np.array([[-1,-1,-1,-1,-1],[-1,-1,-1,-1,-1], [-1,-1,20,-1,-1],[-1,-1,-1,-1,-1], [-1,-1,-1,-1,-1]])
# kernel = cv2.getGaussianKernel(5, 1)

# 对灰度图像进行卷积操作,#注意如果-1 <0的值会被归一化为0
edges = cv2.filter2D(gray, cv2.CV_32F, kernel)
print(edges[:][edges<0])
# 对卷积结果进行ReLU处理
edges_relu = np.maximum(0, edges)
show(img,'Original Image',cmap="gray",debug=True)  
show(edges, 'Edges Image',cmap="gray",debug=True)
show(edges_relu, 'Edges ReLU Image',cmap="gray",debug=True)

def show(dilate, title, cmap=None, debug=False):
    if debug:
        plt.title(title)
        plt.imshow(dilate, cmap=cmap)
        plt.show()

Insert image description here

Why is it said that the convolution operation extracts linear features instead of using relu?

Let us take a simple example to illustrate that the convolution operation itself cannot extract nonlinear features.

Suppose we have an input matrix X containing the following values:

X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

Now, we use a convolution kernel K of size 2x2 to convolve X. The value of the convolution kernel is as follows:

K = [[1, 1], [1, 1]]

We can use matrix multiplication to perform convolution operations. Specifically, we flip the K matrix and perform a dot product operation with the X matrix to obtain an output matrix Y:

Y = K*X = [[12, 16], [24, 28]]

It can be seen that the output matrix Y is a linear combination of the input matrix X, so the convolution operation itself can only extract linear features of the input matrix X, such as edges and textures.

However, when we use a nonlinear activation function, such as the ReLU activation function, to process the output matrix Y, we can convert linear features into nonlinear features. For example, when we apply the ReLU function to Y, the resulting nonlinear characteristics are:

ReLU(Y) = [[12, 16], [24, 28]]

Therefore, the convolution operation by itself can only extract linear features of the input matrix, but when combined with a nonlinear activation function, nonlinear features can be extracted.

convolution

So how are these features calculated for matching? (Don’t tell me that pixels are matched one by one, sweat!)
At this time, we have to invite today’s important guest: convolution. So what is convolution? Don’t worry, I’ll explain it slowly below.
When a new image is given, CNN does not know exactly which parts of the original image these features need to match, so it will try every possible position in the original image, which is equivalent to turning this feature into a filter. This matching process is called a convolution operation, which is also the origin of the name of the convolutional neural network.
The convolution operation is shown in the figure below:
Insert image description here

The yellow part is a convolution kernel, that is, the features extracted from the previous picture
[[1,0,1]
[0,1,0]
[1,0,1]]
are the same as every possible 3* in the image 3 images are calculated (convolution at the same position multiplied and then added / the current number of aggregation matrices is 9), the calculation result is a number placed at the center of the current convolution, and finally a new one is obtained with the outermost layer removed. matrix, please refer to the following for specific calculation logic.

In this case, to calculate a feature (feature) and the result of a certain small patch corresponding to it on the original image, you only need to multiply the pixel values ​​at the corresponding positions in the two small patches, and then multiply the entire small patch. The results of the operation are accumulated and finally divided by the total number of pixels in the small block (note: it is not necessary to divide by the total number).
If both pixels are white (both have a value of 1), then 1 1 = 1, if both are black, then (-1) (-1) = 1, that is, each pair of pixels can match , the multiplication result is 1. Similarly, any non-matching pixels are multiplied by -1. The specific process is as follows (matching results of the first, second..., last pixel):
First, use one of the three features we extracted previously for convolution
Insert image description here
, for example, circle the first feature with a green frame Comparison of parts is exactly the same.
Insert image description here
Insert image description here
Insert image description here
According to the calculation method of convolution, the convolution calculation after the first feature matching is as follows, and the result is 1. The
Insert image description here
matching of other positions is also similar (such as the matching of the middle part)
Insert image description here
and so on. For the three The feature image continuously repeats the above process. Through the convolution operation of each feature (feature), a new two-dimensional array will be obtained, called a feature map ( feature map ). The closer the value is to 1, the more complete the match between the corresponding position and the feature. The closer it is to -1, the more complete the match between the corresponding position and the opposite side of the feature. The closer the value is to 0, the more complete the match between the corresponding position and the feature. The closer the value is to 0, the more complete the match between the corresponding position and the feature. . As shown in the figure below:
Insert image description here
It can be seen that when the image size increases, the number of internal addition, multiplication and division operations will increase very quickly, and the size of each filter and the number of filters will grow linearly. With so many factors at play, the amount of calculations can easily become quite large.

Pooling

In order to effectively reduce the amount of calculation, another effective tool used by CNN is called "Pooling". Pooling is to shrink the input image, reduce pixel information, and retain only important information.
The operation of pooling is also very simple. Usually, the pooling area is 2 2 in size, and then converted into corresponding values ​​according to certain rules, such as taking the maximum value (max-pooling) and the mean (mean) in the pooling area. -pooling), etc., use this value as the resulting pixel value.
The figure below shows
the max-pooling result of the 2 2 pooling area in the upper left corner. Take the maximum max(0.77,-0.11,-0.11,1.00) of this area as the pooled result, as shown in the figure below: Pooling area to the
Insert image description here
left , the second small block takes the maximum value max (0.11, 0.33, -0.11, 0.33) as the result after pooling, as shown below: The
Insert image description here
other areas are similar, taking the maximum value in the area as the result after pooling, and finally after After pooling, the result is as follows:
Insert image description here
perform the same operation on all feature maps, the result is as follows:
Insert image description here
max-pooling retains the maximum value in each small block, which is equivalent to retaining the best value of this block. The match result (since a value closer to 1 means a better match). In other words, it will not specifically focus on which place in the window is matched, but only whether there is a match somewhere.
By adding a pooling layer, the image is reduced, which can greatly reduce the amount of calculation and reduce the machine load.

Activation function ReLU (Rectified Linear Units)

Commonly used activation functions include sigmoid, tanh, relu, etc. The first two, sigmoid/tanh, are more common in fully connected layers, and the latter, ReLU, are common in convolutional layers.
Looking back at the perceptron mentioned earlier, the perceptron receives each input, then sums it up, and then outputs it after passing through the activation function. The function of the activation function is to add nonlinear factors and perform nonlinear mapping of the output results of the convolution layer.
Insert image description here
In convolutional neural networks, the activation function generally uses ReLU (The Rectified Linear Unit, modified linear unit), which is characterized by fast convergence and simple gradient calculation. The calculation formula is also very simple, max(0,T), that is, for negative input values, the output is all 0, and for positive values, the output is unchanged.
Let’s take a look at the operation process of the ReLU activation function in this case:
the first value is taken as max(0,0.77), and the result is 0.77, as shown below. The
Insert image description here
second value is taken as max(0,-0.11), and the result is 0, as shown below Figure
Insert image description here
By analogy, after passing the ReLU activation function, the results are as follows:
Insert image description here
Execute the ReLU activation function operation on all feature maps, and the results are as follows:
Insert image description here

deep neural network

By combining the convolution, activation function, and pooling mentioned above, it becomes the following figure:
Insert image description here
By increasing the depth of the network and adding more layers, a deep neural network is obtained, as shown below:
Insert image description here

Fully connected layers

The fully connected layer plays the role of a "classifier" in the entire convolutional neural network, that is, after passing through deep networks such as convolution, activation function, and pooling, the results are identified and classified through the fully connected layer.
First, the results of the deep network after convolution, activation function, and pooling are strung together, as shown in the figure below:
Insert image description here
Since the neural network belongs to supervised learning, during model training, the model is trained based on the training samples to obtain a fully connected The weight of the layer (such as the weight of all connections predicting the
Insert image description here
letter The calculated results are weighted and summed to obtain the predicted value of each result, and then the largest value is taken as the recognition result (as shown in the figure below, the final calculated recognition value of the letter X is 0.92, and the recognition value of the letter O is 0.51, then The result is judged to
Insert image description here
be
Insert image description here

Convolutional Neural Networks

After stringing together all the above results, a "Convolutional Neural Network" (CNN) structure is formed, as shown in the figure below: Finally, to
Insert image description here
review and summarize, the convolutional neural network mainly consists of two parts, one part is feature extraction ( convolution, activation function, pooling), and the other part is classification recognition (fully connected layer). The following figure is the famous handwritten text recognition convolutional neural network structure diagram:
Insert image description here

Reference for the content of this chapter: https://my.oschina.net/u/876354/blog/1620906

Convolution API

Conv2D

Conv2D is one of the core layers in convolutional neural networks. It is a layer used for convolution processing of images or other two-dimensional data. The function of Conv2D is to perform a series of convolution operations on the input two-dimensional image or data through the convolution kernel, thereby extracting the features in the image or data.

The input of the Conv2D layer is a tensor. The shape of the tensor is usually (batch_size, height, width, channel), where batch_size represents the number of input data, height and width represent the height and width of the input data, and channel represents the number of channels of the input data. (For example, the number of channels of an RGB image is 3).

The output of the Conv2D layer is also a tensor, which represents the feature map obtained after the convolution operation. The shape of the output tensor is usually (batch_size, conv_height, conv_width, filters), where conv_height and conv_width represent the height and width of the feature map obtained after the convolution kernel is applied, and filters represent the number of convolution kernels, that is, the number of channels of the output feature map. .

During the convolution process, the Conv2D layer applies the convolution kernel to the input data, and obtains the convolved output feature map by calculating the convolution operation between each convolution kernel and the input data one by one. During the convolution process, parameters such as the size, step size, and filling method of the convolution kernel can be freely set to adapt to different application scenarios.
In TensorFlow 2.0 and Keras, you can create a Conv2D layer with the following code:

from tensorflow.keras.layers import Conv2D

conv_layer = Conv2D(filters=32, kernel_size=(3, 3), strides=(1, 1), padding='same', activation='relu', input_shape=(height, width, channel))
  • filters: The number of convolution kernels, that is, the number of output feature maps.
  • kernel_size: The size of the convolution kernel, which can be an integer, representing the side length of a square convolution kernel, or a tuple, representing convolution kernels with different lengths and widths.
  • strides: stride, which is the distance the convolution kernel moves on the input feature map. It can be an integer, indicating the distance between two adjacent convolution kernels, or it can be a tuple, indicating that the step size is different in the length and width directions.
  • padding: padding method, can be 'same' or 'valid'. 'same' means that the size of the output feature map is the same as the size of the input feature map, and some values ​​need to be padded around the input feature map; 'valid' means that no padding is required, and the size of the output feature map will be based on the input feature map and convolution kernel. varies with size.
  • activation: activation function, used to add nonlinear transformation to the feature map. Common activation functions include 'relu', 'sigmoid', 'tanh', etc.
  • input_shape: The shape of the input feature map, which can be a triplet representing height, width and number of channels. This parameter needs to be specified in the first convolutional layer.
  • kernel_regularizer: In deep learning, in order to prevent model overfitting, regularization technology is usually used to constrain the model. One of the commonly used regularization methods is L2 regularization. L2 regularization refers to adding an L2 norm penalty term to the loss function of the model to limit the size of the model weight.
    In Keras, use regularizers.l2(0.001) to add an L2 regularization penalty. Among them, 0.001 is the regularization parameter, which controls the strength of the regularization. The larger the regularization parameter, the greater the impact of the penalty term on the weight, and the complexity of the model will be reduced, thus effectively preventing overfitting.
    Specifically, regularizers.l2(0.001) can be applied to any weight matrix in neural networks, such as fully connected layers, convolutional layers, etc. In the definition of the network, we can add L2 regularization using the kernel_regularizer parameter in the corresponding layer. For example, the code to add a fully connected layer with L2 regularization in Keras is as follows:
layers.Conv2D(32, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=(28, 28, 1)),

Convolution example

Take minist10 pictures and use 10 convolution kernels for convolution, output feature maps, and display the image.
Because each picture will generate 10 convolution kernels, a total of 100 feature maps are generated.

#%%
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

# 加载mnist数据集
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# 取1张训练集图片
images = train_images[:10]

# 将图片转换为float类型
images = images.astype('float32') / 255.0
# 将图片reshape成4D张量,大小为(10, 28, 28, 1),也就是第一个维度表示有10张图像,每张图像由28行、28列和1个# 通道(灰度)组成
images = np.expand_dims(images, axis=3)
# 定义卷积核数量
num_filters = 10

# 定义卷积层
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(num_filters, (3, 3), activation='relu', input_shape=(28, 28, 1)),
])

# 计算卷积后的特征图
features = model.predict(images)

# 绘制卷积后的特征图
fig, axs = plt.subplots(nrows=num_filters, ncols=10, figsize=(10, num_filters))
for i in range(num_filters):
    for j in range(10):
        axs[i][j].imshow(features[j, :, :, i], cmap='gray')
        axs[i][j].axis('off')
plt.show()

output
Insert image description here

The np.expand_dims function is used to expand the dimensions of an array on the specified axis. In this example, images is an array of shape (10, 28, 28), representing 10 28x28 grayscale images. However, machine learning models usually need to input a 4-dimensional array, namely (number of samples, image height, image width, number of channels). Therefore, we need to expand the last dimension (number of channels) of the images array by one dimension and turn it into an array of shape (10, 28, 28, 1).
Specifically, axis=3 means expanding the dimensions on the 3rd axis of the array (counting from 0), which adds one dimension to the last dimension of each image, turning each image into a shape of ( 28, 28, 1) three-dimensional array. Finally, the shape of the images array becomes (10, 28, 28, 1), which means that there are 10 28x28 grayscale images, each image consists of one channel. In this way, images can be passed as input to the machine learning model.

As can be seen from the output picture above, the output of some convolution kernels is biased towards edges, some corners, and some textures.

MaxPooling2D

keras.layers.MaxPooling2D((2, 2)) is a layer in Keras that is used for maximum pooling operations.
Maximum pooling is a commonly used convolutional neural network operation that can reduce the number of parameters in an image without changing the image size, thereby reducing the amount of calculation and memory consumption. The max pooling operation divides the input image into non-overlapping blocks and takes the maximum value for each block as the output. In convolutional neural networks, max pooling is usually used interchangeably with convolutional layers to extract spatial features of images.

The parameter of the MaxPooling2D layer is a tuple (2, 2), indicating that the size of the pooling window is 2x2. This means that the input image will be divided into multiple blocks of size 2x2, and the maximum value of each block will be taken as the output. If the pooling window size is set to (3, 3), then the input image will be divided into multiple blocks of size 3x3, and the maximum value of each block will be taken as the output.

In short, the MaxPooling2D layer can help the convolutional neural network extract the spatial features of the image while reducing the amount of calculation and memory consumption.

Flatten

keras.layers.Flatten() is a layer in Keras that is used to "flatten" the input into a one-dimensional vector.

In a convolutional neural network, convolutional layers and pooling layers are usually used to extract features of images, and then fully connected layers are used for classification. The input of the fully connected layer is a one-dimensional vector, so the previous feature map needs to be "flattened" into a one-dimensional vector. This is what the Flatten layer does.

The Flatten layer does not have any parameters, it just unfolds the input tensor into a one-dimensional vector in order. For example, if the input tensor has shape (batch_size, 7, 7, 64), the output shape of the Flatten layer is (batch_size, 7 7 64 ).

When building a convolutional neural network, a Flatten layer is usually added after the convolutional layer and the pooling layer to flatten the feature map into a one-dimensional vector, and then connect it to the fully connected layer for classification.

Dense|Dropout

See multilayer perceptron

Handwritten digit recognition

Convolutional mnist dataset

We will load the MNIST dataset and preprocess, scale the pixel values ​​to between 0 and 1, and split the dataset into training and test sets.
For a detailed explanation of data processing, please refer to the multi-layer perceptron.

import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import layers
from tensorflow.keras.datasets import mnist
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import regularizers
(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = x_train.astype('float32') / 255.
x_test = x_test.astype('float32') / 255.

x_train = x_train[..., tf.newaxis]
x_test = x_test[..., tf.newaxis]

num_classes = 10

y_train = tf.keras.utils.to_categorical(y_train, num_classes)
y_test = tf.keras.utils.to_categorical(y_test, num_classes)

Next, we will define a convolutional neural network model. We will use two convolutional layers and two pooling layers, followed by two fully connected layers and an output layer. We will also use dropout and L2 regularization to prevent overfitting.

model = tf.keras.Sequential([
    layers.Conv2D(32, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001), input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
    layers.Conv2D(64, (3, 3), activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    layers.MaxPooling2D((2, 2)),
    layers.Flatten(),
    layers.Dense(128, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    layers.Dropout(0.5),
    layers.Dense(64, activation='relu', kernel_regularizer=regularizers.l2(0.001)),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')
])

model.summary() is a method of the model object in Keras, used to print out the structural information of the model, including the name of each layer, output shape, number of parameters, etc. This is useful for debugging, optimizing the model, and understanding the model structure.

model.summary()

We will then compile the model and use data augmentation techniques to further prevent overfitting. Data augmentation technology will apply a series of random transformations, such as rotation, translation, scaling, etc., to generate new training samples. This makes the model more robust and prevents overfitting.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

datagen = ImageDataGenerator(
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    zoom_range=0.1
)

Next, we will use the training set to train the model and the test set to evaluate the model's performance.

datagen.fit(x_train)
batch_size = 1024
epochs = 10
checkpoint = tf.keras.callbacks.ModelCheckpoint('./model.h5', save_best_only=True, save_weights_only=False, monitor='val_loss')
history = model.fit(datagen.flow(x_train, y_train, batch_size=batch_size),
                    epochs=epochs,
                    validation_data=(x_test, y_test),
                    steps_per_epoch=len(x_train) // batch_size,callbacks=[checkpoint])

score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])

The difference between steps_per_epoch and batch_size
is that batch_size refers to the number of samples included in each training batch (batch). In deep learning, the training set is usually divided into multiple batches, each batch containing several samples. The advantage of this is that matrix operations can be used to speed up calculations, and the order of samples can also be randomly disrupted during the training process to avoid overfitting.

steps_per_epoch refers to the number of batches that the model needs to be trained in one epoch. Since each epoch contains multiple batches, steps_per_epoch needs to be set to specify how many batches need to be passed in an epoch. Usually, the value of steps_per_epoch can be calculated from the training set size and batch_size. For example, if the training set size is 1000 and batch_size is 32, then 1000 / 32 = 31 batches need to be trained in one epoch, so steps_per_epoch should be set to 31.

It should be noted that steps_per_epoch is not necessarily equal to the training set size divided by batch_size. If the training set size is not divisible by batch_size, then the last batch may contain fewer than batch_size samples. In order to avoid this situation, you can use the downward rounding operation // to calculate steps_per_epoch to ensure that the entire training set can be processed in each epoch.

fine-tuning

Fine-tuning refers to the method of fine-tuning an already trained model for specific tasks or specific data sets to achieve better performance. Usually, we will use a model pre-trained on a large-scale data set, such as ImageNet and other data sets. This model has learned many common features and patterns during the training process. We can fine-tune this model, adjust some parameters or add some new layers to make this model more suitable for new tasks or new data sets. This approach is usually more efficient than training a model from scratch because the pre-trained model already has good initial weights and feature extraction capabilities.

mnist-c dataset

MNIST-C is a variant of the MNIST data set, which is a MNIST data set with artificial noise added. The MNIST dataset is a handwritten digit recognition dataset containing 60,000 training samples and 10,000 test samples, each sample is a 28 x 28 pixel grayscale image. The MNIST-C dataset is created by adding random noise to the images of the MNIST dataset. These noises include blur, distortion, brightness changes, etc., making the model more robust.

The MNIST-C dataset is very useful for testing the robustness of machine learning models because it can test the robustness of the model to different types of noise. Each image in the MNIST-C dataset contains a label indicating the number it represents. These labels are identical to the corresponding labels in the MNIST dataset, so you can use the same training and testing process to train and test your model.

Download the data set at https://github.com/google-research/mnist-c/. This github address is the source code address. The actual download address is mentioned in the readme: https://zenodo.org/record/3239543# .ZF2rzXZByUl, download and unzip.
Insert image description here
These folders are all numpy array exports in npy format.
Read the first 10 pictures of each folder to display

# 数据集的开源地址:https://github.com/google-research/mnist-c/
import os
import numpy as np
import matplotlib.pyplot as plt
#加载数据集并打印每个子文件夹前10个数据集
data_root = './mnist_c'
dirlist=os.listdir(data_root)
fig, axs = plt.subplots(len(dirlist), 10, figsize=(10, 10))

for i, folder_name in enumerate(dirlist):
    folder_path = os.path.join(data_root, folder_name)
    if os.path.isdir(folder_path):
        file_path = os.path.join(folder_path, 'train_images.npy')
        data = np.load(file_path)
        for j in range(0,10):
            axs[i, j].imshow(data[j].reshape(28,28), cmap='gray')
            axs[i, j].axis('off')
plt.tight_layout()
plt.show()

output
Insert image description here

fine-tuning method training

Assume that we start to try out the trained model of minist in ./model.h5. We need to load the model and then try out the model to continue training the minist-c data.

#%%

import os
import numpy as np
import tensorflow.keras as layers
import tensorflow as tf
import datetime

TARGET_MODEL_DIR="./"
MODEL_NAME="model.h5"
epochs_count=5
"""
   jupyter打印的日志太大导致ipynb打开很慢,这里写个一模一样代码的py运行
"""
def againTrain(x_train, y_train, x_test, y_test):
    targetModel=os.path.join(TARGET_MODEL_DIR,MODEL_NAME)
    #记载CNN模型
    model=tf.keras.models.load_model(targetModel)
    """
    在使用Fine-tuning方法微调预训练模型时,通常会冻结模型的前几层,只调整模型的后面几层,这是因为:
    1.预训练模型的前几层通常是针对原始数据集的通用特征提取器,这些特征对于不同的任务和数据集都是有用的,因此我们可以直接保留这些特征提取器,不需要进行微调。
    2.预训练模型的后几层通常是针对特定任务进行的微调,这些层的参数需要根据具体任务和数据集进行调整,以使模型更好地适应特定的任务和数据集。
    3.如果我们将整个模型的所有层都进行微调,会导致训练时间较长,而且可能会出现过拟合等问题。因此,冻结前几层可以有效地减少训练时间,并提高模型的泛化能力。
    总之,冻结模型的前几层可以节省计算资源和训练时间,同时还可以提高模型的泛化能力,使其更好地适应新的任务和数据集。
    """
    model.layers[0].trainable = False
    model.layers[1].trainable = False
    # 对输入图像进行预处理
    x_train = x_train.reshape(-1, 28, 28, 1)
    x_train = x_train.astype('float32') / 255.0
    x_test = x_test.reshape(-1, 28, 28, 1)
    x_test = x_test.astype('float32') / 255.0
    y_train = tf.keras.utils.to_categorical(y_train, num_classes=10)
    y_test = tf.keras.utils.to_categorical(y_test, num_classes=10)
    now = datetime.datetime.now()  # 获取当前时间
    format_time = now.strftime("%Y-%m-%d%H-%M-%S")  # 转换为指定格式
    checkpoint = tf.keras.callbacks.ModelCheckpoint(targetModel, save_best_only=True, save_weights_only=False, monitor='val_loss')
    # 继续训练模型
    history = model.fit(x_train, y_train, batch_size=128, epochs=epochs_count, validation_data=(x_test, y_test),
                        callbacks=[checkpoint])
    test_loss, test_acc = model.evaluate(x_test, y_test)
    print('Test accuracy:', test_acc)
"""
  传入mnist-c,数据会非常大加载数据很慢,这里每加载一份子目录就训练一次,节省内存开销。
"""
def loadDataMnistC(data_root,func):
    dirlist=os.listdir(data_root)
    for i, folder_name in enumerate(dirlist):
        folder_path = os.path.join(data_root, folder_name)
        if os.path.isdir(folder_path):
            print("开始读取:"+folder_path)
            train_images = np.load(os.path.join(folder_path, 'train_images.npy'))
            train_labels = np.load(os.path.join(folder_path, 'train_labels.npy'))
            test_images = np.load(os.path.join(folder_path, 'test_images.npy'))
            test_labels = np.load(os.path.join(folder_path, 'test_labels.npy'))
            print("开始训练:"+folder_path)
            func(train_images,train_labels,test_images,test_labels)
            print("训练完成:"+folder_path)
# 加载 MNIST-C 数据集
data_root = './mnist_c'
model=None;
loadDataMnistC(data_root,againTrain)
print("全部训练完成")

Here, a certain type is read once every time, and then the model is written back to the subfolder for training. The final model is obtained until the training is completed.

Guess you like

Origin blog.csdn.net/liaomin416100569/article/details/130597944