The role of each layer of CNN convolutional neural network

A convolutional neural network (CNN) consists of an input layer, a convolutional layer, an activation function, a pooling layer, and a fully connected layer.

1. Convolution layer

1.1 Function

for feature extraction
insert image description here

insert image description here

The input image is 32 32 3, 3 is its depth (ie R, G, B), the convolutional layer is a 5 5 3 filter (receptive field), here note: the depth of the receptive field must be the same as the depth of the input image . A 28 28 1 feature map can be obtained by convolving a filter with the input image. The above picture uses two filters to obtain two feature maps;

1.2 Calculation process

The input image and the corresponding position elements of the filter are multiplied and summed, and finally b is added to obtain the feature map.

As shown in the figure, the depth of the first layer of filter
w 0 is multiplied by the corresponding elements in the blue box of the input image and then summed to get 0, and the other two depths get 2, 0, then there are 0+2+0+ 1=3 is the first element 3 of the feature map on the right side of the figure. After convolution, the blue box of the input image slides again, stride=2, as follows:

insert image description here

As shown in the figure above, the convolution is completed and a 3 3 1 feature map is obtained; one more thing to note here is the zero
pad item, which is to add a boundary to the image, and the boundary elements are all 0. (No effect on the original input) Generally have

F=3 => zero pad with 1

F=5 => zero pad with 2

F=7=> zero pad with 3, the border width is an empirical value, and the zero
pad is added to make the input image and the convolved feature map have the same dimension, such as:

The input is 5 5 3, the filter is 3 3
3, and the zero pad is 1, then the input image after adding the zero pad is 7 7 3, and the size of the feature map after convolution is 5 5 1 ((7-3) /1+1), same as the input image;

1.3 Sharing principles

Another feature of the convolutional layer is the principle of "weight sharing".
Without this principle, the feature map consists of 10 32x32x1 feature maps, that is, there are 1024 neurons on each feature map, and each neuron corresponds to a 5x5x3 area on the input image, that is, a neuron and the input image. There are 75 connections in this area, that is, 75 weight parameters, and there are 75x1024x10=768000 weight parameters in total, which is very complicated, so the convolutional neural network introduces the "weight" sharing principle, that is, each The 75 weight parameters corresponding to each neuron are shared by each neuron, so only 75x10=750 weight parameters are needed, and the threshold of each feature map is also shared, that is, 10 thresholds are required, and a total of 750+ is required 10=760 parameters.

Replenish:

(1) Doing 1*1 convolution for multi-channel images is actually to multiply each channel of the input image by a coefficient and add them together, which is equivalent to "connecting" the original independent channels in the original image together;

(2) When the weight is shared, it is not shared in all dimensions, but shared in each channel on each filter ;

2. Activation layer

The so-called incentive is actually a nonlinear mapping of the output of the convolutional layer.
  If the activation function is not used (in fact, it is equivalent to the activation function being f(x)=x), in this case, the output of each layer is a linear function of the input of the previous layer. It is easy to conclude that no matter how many neural network layers there are, the output is a linear combination of inputs, which is the same as the effect without hidden layers. This is the most primitive perceptron.

insert image description here

Commonly used incentive functions are:

  • Sigmoid function
  • Tanh function
  • resume
  • Leaky ReLU
  • UP
  • Maxout

Suggestions for the incentive layer : First of all, ReLU, because the iteration speed is fast, but the effect may not be increased. If ReLU fails, consider using Leaky ReLU or Maxout, and the general situation can be solved at this time. Tanh function has better effect in text and audio processing.

3. Pooling layer

Pooling: Also known as undersampling or downsampling. It is mainly used for feature dimension reduction, compressing the number of data and parameters, reducing overfitting, and improving the fault tolerance of the model. There are:

(1) Max Pooling: Maximum pooling . To pick the largest one, we define a spatial neighborhood (for example, a 2x2 window) and extract the largest element from the modified feature map within the window. Max pooling has been shown to work better.
(2) Average Pooling: average pooling. For averaging, we define a spatial neighborhood (e.g., a 2x2 window) and compute the mean value from the modified feature maps within the window.

insert image description here

4. Fully connected layer

After several times of convolution + excitation + pooling in the past, it finally comes to the fully connected layer (output layer) , and the model will learn a high-quality feature image fully connected layer. In fact, before the fully connected layer, if the number of neurons is too large and the learning ability is strong, overfitting may occur. Therefore, the dropout operation can be introduced to randomly delete some neurons in the neural network to solve this problem. Local normalization (LRN), data enhancement and other operations can also be performed to increase robustness.
  After coming to the fully connected layer, it can be understood as a simple multi-classification neural network (such as: BP neural network), and the final output is obtained through the softmax function. The entire model is trained.
  All neurons between the two layers have weight connections, usually the fully connected layer is at the end of the convolutional neural network. That is, it is the same as the connection method of traditional neural network neurons:

insert image description here

The general structure of the convolutional neural network is shown in the figure:

insert image description here

In addition : in the CNN network, the parameters of the convolutional layer in the first few layers account for a small proportion, and the calculation amount accounts for a large proportion; while the fully connected layer in the back is just the opposite, and most CNN networks have this feature. Therefore, when we optimize computing acceleration, we focus on the convolutional layer; when optimizing parameters and weights, we focus on the fully connected layer.

5. Summary

(1) Convolutional layer: local perception. In the process of human brain recognizing pictures, it does not recognize the whole picture at once, but first perceives each feature in the picture locally, and then synthesizes the parts at a higher level. operation to obtain global information.
(2) Activation layer: Do a nonlinear mapping on the output of the convolutional layer.
(3) Pooling layer: mainly used for feature dimension reduction, compressing the number of data and parameters, reducing over-fitting, and improving the fault tolerance of the model.
(4) Fully connected layer: connect all the features, and send the output value to the classifier (such as softmax classifier).

reference link

Guess you like

Origin blog.csdn.net/qq_43750528/article/details/130122764