Introduction to deep learning - overview of deep convolutional neural network model (Deep Convolution Neural Network, DCNN)

This article mainly summarizes what I have learned before. The latest technology is still under research...

1 Introduction

Machine learning is a method and means to realize artificial intelligence. It specializes in studying how computers can simulate or implement human learning behavior to acquire new knowledge and skills, and reorganize existing knowledge structures to continuously improve their own performance. As a research direction of artificial intelligence, computer vision technology has progressed with the development of machine learning. Especially in the past 10 years, machine learning technology represented by deep learning has set off a revolution in computer vision. This article will introduce the typical deep learning technology - deep convolutional neural network, mainly introducing the basic knowledge of deep convolutional neural network.

2 Basics of deep convolutional neural network

With the continuous development of information technology, the amount of various types of video image data has increased dramatically. It is of great significance to extract hidden information from a large amount of video image data and tap its potential value.

With the development of artificial intelligence, computer vision technology has been increasingly widely used. For example, face recognition and identity analysis in video surveillance, analysis and recognition of various medical images in medical diagnosis, fine-grained visual classification, face image attribute recognition, fingerprint recognition, scene recognition, etc. Computer vision technology has gradually penetrated into people's daily life. in life and applications . However, the traditional computer vision method of manually extracting image features and then performing machine learning is becoming increasingly inadequate in dealing with these applications. Since 2006, deep learning has entered people's field of vision. Especially after AlexNet won the ImageNet Large-scale Visual Recognition Challenge in 2012, deep learning has made remarkable developments in the field of artificial intelligence. It has been widely used in computer vision and speech recognition. , natural language processing, multimedia and many other fields have achieved great success. The biggest difference between deep learning and traditional pattern recognition methods is that it automatically learns features from big data instead of using manually designed features, and good features can greatly improve the performance of pattern recognition systems. In the field of computer vision, deep convolutional neural networks have become a hot research topic, and they play a vital role in computer vision tasks such as image classification, target detection, and image segmentation. This section will introduce the basic knowledge of deep convolutional neural networks.

2.1 Artificial Neural Network

The human nervous system is composed of hundreds of billions of neurons, and each neuron consists of dendrites, axons, and cell bodies. Human beings receive countless visual and auditory information every day. The processing of this information is completed by the nervous system. The dendrites receive the information and transmit it to the cell body, and the axon receives the information and transmits the information out. Artificial neural network simulates the structure and function of the human nervous system and is an information processing system.

Figure 2-1 Artificial neuron structure

Figure 2-1 shows the structure of a single artificial neuron. The neuron has n inputs, respectively X1,...,Xn. The value on the connecting line is the weight of each input, respectively W1,...,Wn. The activation function (activation function) is f, the bias term (bias) is b, and the output of the neuron is: y=f(\sum WiXi+b).

Artificial neural network is a computing model that consists of a large number of artificial neurons (also called nodes) connected to each other. Each two connected neurons represents a weighted value for the passing signal. The forward neural network is a simple artificial neural network model, including an input layer, a hidden layer and an output layer. The input layer has one layer, the output layer has one layer, the hidden layer has multiple layers, and each layer has several nodes. There are no connections between nodes, and the relationship between nodes between layers is measured by weight. The forward neural network can also be called a fully connected layer. Its network structure is more complex than a single neuron, but the output still satisfies the relationship between the input and the weight multiplied by the activation function.

The output of artificial neural networks is affected by many aspects, such as network structure, input X, weights Wand activation functions. After the neural network is constructed, its network structure and activation function are fixed. If you want to change the output, you must change the weights  W. Therefore, the training process of the neural network is a process of continuously optimizing parameters. Training consists of two processes: forward propagation and back propagation. First, build a neural network and input the training data, and the neural network calculates the output result, which is forward propagation; then, calculates the difference between the output result and the real label of the input data (ie, the loss function), and the network uses this difference value to update Wthe value of the parameter, which is backpropagation. The training process of the neural network is to cyclically perform forward propagation and back propagation, adjust the weight parameters of each neuron, and fit the nonlinear relationship, and finally obtain better model accuracy.

The degree of model fitting is closely related to the activation function. The role of the neural network activation function is to perform non-linear calculations on the input signal and pass its output to the next node. There are three common activation functions: Sigmoid activation function, Tanh activation function, and ReLU activation function. The Sigmoid activation function is shown in Figure 2-2. When the input is large or small, the gradient of the function is very small. Since the chain derivation rule is used in the backpropagation algorithm to calculate the gradient of the parameters, when the model uses the Sigmoid function, it is easy Wto The problem of vanishing gradient occurs. The Tanh function is similar to the Sigmoid function. When the input is large or small, the gradient of the function is small, which is prone to the problem of gradient disappearance and is not conducive to weight update. The ReLU function has more advantages: when the input is a positive number, the gradient is constant and will not be zero; the calculation speed is fast. But it also has a fatal shortcoming: when the input is negative, the gradient disappears completely, so the use of the activation function needs to be determined based on actual needs.

Figure 2-2 Sigmoid activation function Figure 2-3 Tanh activation function Figure 2-4 Relu activation function

In backpropagation, the loss function (loss) is very important. Commonly used loss functions include mean squared error_loss (mse_loss), custom loss function and cross entropy loss function (cross entropy). The most commonly used loss function in image classification is the cross entropy loss function. In the classification problem, assuming there are n categories in total, the output of the classifier is the probability that the input is predicted to be these n categories, that is, the output is n probabilities. As shown in formula (1), cross entropy describes the difference between two probability distributions. The smaller the difference, the closer the two probabilities are. The larger the difference, the greater the difference between the two probabilities. qrepresents the real label and prepresents the predicted value.

                                                         H(p,q)=-\sum q(x)logp(x)                    (1)

2.2 Convolutional Neural Networks (CNNs)

The convolutional neural network is an improvement of the artificial neural network. It adds a convolutional layer and a pooling layer. The training process of the convolutional neural network is the same as that of the artificial neural network, including forward propagation and reverse propagation. To spread. The convolutional layer and the pooling layer are introduced in detail below.

(1)  Convolutional layer

Convolutional neural network is a weight-sharing network. Compared with ordinary neural networks, its model complexity is reduced and the number of parameters is greatly reduced. This is due to two important characteristics of convolution: local receptive field and parameter sharing ( Parameter Sharing).

In traditional neural networks, each input neuron and output neuron are fully connected, and the relationship between each input neuron and output neuron is described by a parameter. The convolution kernel filter size of a convolutional neural network is much smaller than the input size, and its connections are sparse. The local pixels of the image are closely connected, and the correlation of distant pixels is weak. Neurons do not need to perceive the global image, but only need to perceive the local image, and combine local information at a high level to obtain global information. Use a convolution kernel to process an image and detect locally meaningful features. The spatial connection range with the same size as the convolution kernel is called the receptive field. The higher the number of convolutional layers, the greater the receptive field. The larger the value, the larger the original image area corresponding to the receptive field. The local receptive field greatly reduces the number of parameters, as illustrated below. If the input image size is 1000 \times1000, the number of hidden layer neurons is 1 million. If they are fully connected, there will be 1000 \times1000 =1000000= 10^{12}connections and 10^{12}one parameter; if the convolution kernel size is 10 \times10, then the local receptive field The size is 10 \times10. Each neuron in the hidden layer only needs \timesto be connected to an area of ​​10 10 size. There are a total of 10 \times10 1000000= 10^{8}connections, that is, there are 10^{8}parameters.

In the above example, the use of local receptive fields greatly reduces the number of parameters, but the number is still very large. Each neuron is connected to \timesan image area of ​​10 10 size, so each neuron has 10 \times10 = 100 parameters. If the parameters of each neuron are the same, that is, each neuron uses the same convolution kernel. For deconvolution, only 100 parameters are needed in total, which is parameter sharing. However, one convolution kernel can only extract one feature, so multiple convolution kernels are used during convolution. The parameters of each convolution kernel filter are different, which means that different features of the input image are extracted. With several convolution kernels, several features can be extracted, and these features are arranged to form a feature map. Local receptive fields and parameter sharing greatly reduce the amount of model parameters, saving memory space while maintaining model performance without degradation.

(2) Pooling layer

Figure 2-5 Schematic diagram of max pooling

The pooling layer is also called a downsampling layer. Generally speaking, pooling operates on each feature map with the purpose of reducing the size of the feature map. Pooling is divided into two types: Max Pooling and Average Pooling. Figure 2-5 shows maximum pooling. The pooling kernel size is 2 2 and the step size is 2 2. In layman’s \timesterms \times, The maximum pooling operation is to find \timesthe overlapping part of the 2 2 pooling kernel and a certain depth feature map, and then take the maximum value of the overlapping area to obtain the downsampled value. Move the pooling kernel position according to the step size to obtain the output feature map. The operation of average pooling is similar to that of maximum pooling, except that the average value of the pixels in the overlapping area is taken.

2.3 Optimization of deep convolutional neural networks

In order to achieve better performance, the layers of convolutional neural networks are getting deeper and deeper. The deeper the layers of deep convolutional neural networks, the more parameters need to be learned, and the more difficult the network is to optimize. Without good optimization methods, overfitting or underfitting problems will occur.

Overfitting means that the model has poor generalization ability and has good fit on the training set, but poor fit on the validation set. In layman's terms, it means that the model learns the training data too well. The model can recognize the images in the training set very well, but cannot recognize the images in the non-training set. There are two reasons. One is that there is too little data in the training set, and the other is that there are too many training iterations. There are four main methods to alleviate overfitting, as follows:

(1) Early stopping. After each iteration (epoch), the error rate of the validation set (validation error) is calculated. If the error rate no longer decreases, the training is terminated. This is a method to stop losses in time. The generalization ability of the model is no longer improved, and continuing to train is a waste of time. However, it is unscientific to only rely on the error rate after one iteration, because the error rate after this iteration may increase or decrease. Therefore, you can determine whether to terminate training based on the error rate of the validation set after 10, 20, and other iterations.

(2) Data set expansion. This is the most direct and effective way to reduce overfitting. Without good quality and large quantity of data, it is impossible to train a good model. The data set can be expanded from two aspects: adding data from the source. For example, when classifying images, directly adding images to the training set. However, this method is difficult to implement because it is not known how much data will be added; making changes to the original data. , thereby obtaining more data, such as rotating the original picture, adding noise to the original data, intercepting a part of the original data, etc.

(3) Regularization. Regularization includes L0 regularization, L1 regularization and L2 regularization. L2 regularization is commonly used in machine learning. The L2 regularization term has the effect of making the parameters smaller and intensified. Smaller parameters mean that the complexity of the model is lower, so that the model fits the training data just right to improve the generalization ability of the model.

(4)drop out. Dropout, as a type of model fusion, causes neurons to stop working with a certain probability, effectively reducing test errors. Given an input, the network samples different structures, which share a set of parameters. Since a neuron does not depend on certain specific neurons, dropout reduces the complex co-adaptation between neurons and enhances the robustness of the network.

The model does not fit well on the training set but does well on the validation set, which is underfitting. The reason is that underfitting means that the model has not learned enough about the training data, has insufficient feature learning, and has poor representation ability. Underfitting can be alleviated by the following methods:

(1) Add other feature items. Insufficient feature items will lead to underfitting. Underfitting caused by insufficient feature items can be well solved by adding feature items. Methods for adding feature items include combination, generalization, correlation, etc. These methods are applicable in many scenarios.

(2) Add polynomial features. For example, adding quadratic or cubic terms to a linear model can enhance the generalization ability of the model.

Guess you like

Origin blog.csdn.net/qq_44918501/article/details/130270044