Introduction to Convolutional Neural Network (CNN) Network Structure and Model Principles

overview

This article only introduces the convolution layer, pooling layer and other network structure and construction principles, as well as some prerequisite knowledge of convolution. The content of the fully connected layer, the construction and optimization of the classification model and loss function are the same as those of the fully connected neural network, and will not be explained here.

Neural network model construction and algorithm introduction: https://blog.csdn.net/stephon_100/article/details/125452961

Convolutional neural network is a deep feed-forward neural network. Convolving the same image with different convolution kernels is actually filtering the image with convolution kernels to extract different features.

Therefore, the convolutional neural network model is also a model for automatically extracting features, with classification functions.

Assuming that the number of input neurons in the convolutional layer is M, the convolution size is K, the step size is S, and P zeros are filled at both ends of the input, then the number of neurons in the convolutional layer is (M - K + 2P )/S+1

Usually it can be made integer by choosing the appropriate convolution size and stride .

 

main content:

Convolutional Neural Network (CNN or ConvNet) is a deep feedforward neural network with local connections and weight sharing.

The convolutional neural network was originally mainly used to process image information. When processing images with a fully connected feedforward network, there are two problems:

(1) There are too many parameters: if the input image size is 100 × 100 × 3 (that is, the image height is 100, width is 100 and RGB 3 color channels), in the fully connected feedforward network, each of the first hidden layer There are 100 × 100 × 3 = 30 000 mutually independent connections from each neuron to the input layer, and each connection corresponds to a weight parameter. With the increase of the number of neurons in the hidden layer, the scale of the parameters will also increase sharply. This will lead to very low training efficiency of the entire neural network, and it is also prone to overfitting.

(2) Local invariance features: Objects in natural images have local invariance features, such as scaling, translation, rotation and other operations do not affect their semantic information. However, fully connected feedforward networks are difficult to extract these local invariant features, and data enhancement is generally required to improve performance.

The fully connected layer is generally at the last layer of the convolutional network. Convolutional neural networks have three structural properties: local connections, weight sharing, and pooling. These properties make the convolutional neural network invariant to translation, scaling and rotation to a certain extent. Compared with feedforward neural network, convolutional neural network has fewer parameters.

Prepare supplementary knowledge

Definition of convolution : Convolution, also called convolution, is an important operation in analytical mathematics. In signal processing or image processing, one-dimensional or two-dimensional convolutions are often used

One-dimensional convolution:

Two-dimensional convolution:

According to the definition of convolution, the calculation in the above figure requires convolution kernel flipping. Flip refers to cross-correlation from two dimensions (from top to bottom, from left to right) is an inverse order, that is, a rotation of 180 degrees. After flipping, the dot product operation can be performed directly, that is, the corresponding positions are multiplied and then summed. As shown in the picture:

The mean filter (Mean Filter) commonly used in image processing is a two-dimensional convolution, which sets the pixel value at the current position as the average value of all pixels in the filter window, that is 

Visual display:

In image processing, convolution is often used as an effective method for feature extraction. The result obtained after an image undergoes a convolution operation is called a Feature Map. Figure 5.3 shows several commonly used filters in image processing and their corresponding feature maps. The uppermost filter in the figure is a commonly used Gaussian filter, which can be used to smooth and denoise the image; the middle and lowermost filters can be used to extract edge features. 

 

Therefore, it is concluded that convolving an image with a convolution kernel is actually filtering an image with a convolution kernel to extract specific features. 

( 3 ) Cross-correlation

In the field of machine learning and image processing, the main function of convolution is to slide a convolution kernel (ie filter) on an image (or certain features), and obtain a new set of features through convolution operations. In the process of calculating the convolution, the convolution kernel needs to be flipped. In the specific implementation, the convolution is generally replaced by cross-correlation operations, which will reduce some unnecessary operations or overhead.

Cross-Correlation (Cross-Correlation) is a reverse order, that is, rotated 180 degrees. The function to measure the correlation between two sequences is usually implemented by calculating the dot product with a sliding window. Given an image and a convolution kernel , their cross-correlation is:

Comparing with formula (5.7), we can see that the difference between cross-correlation and convolution is only whether the convolution kernel is flipped. Therefore, cross-correlation can also be called non-inverted convolution.

 

Convolution and cross-correlation comparison (extended): 

 It can be seen that the relationship between the convolution operation and the cross-correlation operation is:

 

Given an image  and a convolution kernel , their cross-correlation is

 

Comparing with formula (5.7), we can see that the difference between cross-correlation and convolution is only whether the convolution kernel is flipped. Therefore, cross-correlation can also be called non-inverted convolution.

Therefore, cross-correlation is usually implemented directly by calculating the dot product of the sliding window, rather than a strict mathematical convolution that is to flip the sliding window and then calculate the dot product.

The use of convolution in neural networks is for feature extraction, and whether the convolution kernel is flipped has nothing to do with its feature extraction capabilities. Especially when the convolution kernel is a learnable parameter, convolution and cross-correlation are equivalent in power. Therefore, for the convenience of implementation (or description), we use cross-correlation instead of convolution. In fact, convolution operations in many deep learning tools are actually cross-correlation operations.

In the following description, unless otherwise stated, convolution generally refers to "cross-correlation". The convolution symbol is represented by ⊗, that is, the convolution is not flipped. The real convolution is denoted by *.

On the basis of the standard definition of convolution, the sliding step size and zero padding of the convolution kernel can also be introduced to increase the diversity of convolution and to perform feature extraction more flexibly.

Stride refers to how many pixels each time the convolution kernel slides when sliding.

Zero padding is zero padding at both ends of the input vector.

Assuming that the number of input neurons in the convolutional layer is M, the convolution size is K, and the step size is S, and P zeros are filled at both ends of the input, then the number of neurons in the convolutional layer is (M - K + 2P)/S+1

Usually it can be made integer by choosing the appropriate convolution size and stride .

 

convolutional neural network

According to the definition of convolution, the convolutional layer has two very important properties: local connection and weight sharing.

 

 

Take the above picture as an example:

If it is a fully connected operation, each neuron needs to connect all the previous 25 input pixels x, so there are 25 weights w and 1 bias b.

But if it is a convolution operation, the entire 5x5 image only uses this convolution kernel window to slide, so the weight parameter only requires a convolution window size of 3x3=9 parameters and a bias parameter b. The above figure is based on the step size It is 1 and the result of the convolution obtained by sliding the window without adding 0 at both ends of the image

Usually the size of the convolution kernel is much smaller than the size of the input image being convolved, so the convolution operation requires much less parameters to be trained than the full connection.

(1) Convolution layer

The function of the convolutional layer is to extract a filtered image of the input image, that is, a feature of the input image. Different convolution kernels are equivalent to different feature extractors. The content mentioned above is the convolution of two-dimensional images. But usually a color image is composed of multiple color channels, so it is generally represented as three-dimensional information. Then the neurons of the convolution kernel are also convoluted in three dimensions.

The three-dimensional structure is usually expressed as, its size is height x width x depth, ie MxNxD,

At the input layer, the feature map is the image itself. If it is a grayscale image, there is a feature map, and the depth of the input layer is D=1; if it is a color image, there are feature maps of RGB three color channels, and the depth of the input layer is D=3.

A feature map is a feature extracted by an image (or other feature map) after convolution, and each feature map can be used as a class of extracted image features. In order to improve the representation ability of the convolutional network, multiple different convolution kernels can be stacked in each convolutional layer to obtain multiple different feature maps of an image to better represent the features of the image.

 Explanation: Each convolutional layer contains P neurons, namely the convolution kernel, so the convolution of this layer will generate P feature maps. The size of an input image is MxN, and the depth is D. If it is a color image, there are 3 channels of RGB, so d=3. The size of the convolution kernel is the size of UxV, and the depth is D.

 (2) Pooling layer (pooling layer)

The pooling layer (Pooling Layer) is also called the subsampling layer (Subsampling Layer), and the function of the pooling layer is to perform feature selection and reduce the number of features, thereby reducing the number of parameters and avoiding overfitting.

(1) Maximum Pooling (Maximum Pooling or Max Pooling): For a region, select the maximum activity value of all neurons in this region as the representation of this region

(2) Mean Pooling: Generally, the average value of the activity values ​​of all neurons in the area is taken.

The pooling layer can not only effectively reduce the number of neurons, but also make the network invariant to some small local shape changes and have a larger receptive field.

A typical pooling layer divides each feature map into 2 × 2 non-overlapping regions, and then uses the maximum pooling method for downsampling. The pooling layer can also be regarded as a special convolution layer, the convolution kernel size is KxK, the step size is SxS, and the convolution kernel is a max function or a mean function. An overly large sampling area will drastically reduce the number of neurons and cause excessive information loss.

 

Guess you like

Origin blog.csdn.net/stephon_100/article/details/125453364