Convolutional Neural Network - Input Layer, Convolutional Layer, Activation Function, Pooling Layer, Fully Connected Layer


Reprinted from: https://blog.csdn.net/yjl9122/article/details/70198357

Convolutional Neural Network (CNN) consists of input layer, convolution layer, activation function, pooling layer, and fully connected layer, namely INPUT (input layer)-CONV (convolutional layer)-RELU (activation function)-POOL (pooling layer) Layer) - FC (Fully Connected Layer)

convolutional layer

Use it for feature extraction as follows: 


The input image is 32*32*3, 3 is its depth (ie R, G, B), and the convolutional layer is a 5*5*3 filter (receptive field). Note here: the depth of the receptive field must be the same as the input The images have the same depth. A 28*28*1 feature map can be obtained through the convolution of a filter and the input image. The above picture uses two filters to obtain two feature maps;

We usually use multiple convolutional layers to get deeper feature maps. as follows: 


 

The process of convolution is illustrated as follows: 


 
The input image and the corresponding position elements of the filter are multiplied and summed, and finally b is added to obtain the feature map. As shown in the figure, the depth of the first layer of filter w0 is multiplied by the corresponding elements in the blue box of the input image and then summed to get 0, and the other two depths get 2, 0, then there are 0+2+0+1 =3 is the first element 3 of the feature map on the right side of the figure. After the convolution, the blue box of the input image slides again, stride (step size) = 2, as follows:

As shown in the above figure, after completing the convolution, a 3*3*1 feature map is obtained; here is also a point to note, that is, the zero pad item, which is to add a border to the image, and the border elements are all 0. (It has no effect on the original input. ) generally have

F=3 => zero pad with 1 
F=5 => zero pad with 2 
F=7=> zero pad with 3, the border width is an empirical value, and the zero pad is added to make the input image and convolution The latter feature maps have the same dimensions as:

The input is 5*5*3, the filter is 3*3*3, and the zero pad is 1, then the input image after adding the zero pad is 7*7*3, and the size of the feature map after convolution is 5*5 *1 ((7-3)/1+1), the same as the input image; 
and the calculation method of the size of the feature map is as follows: 


write picture description here

Another feature of convolutional layers is the principle of "weight sharing". As shown below: 


Without this principle, the feature map consists of 10 feature maps of 32*32*1, that is, there are 1024 neurons on each feature map, and each neuron corresponds to a 5*5*3 area on the input image. That is, there are 75 connections between a neuron and this area of ​​the input image, that is, 75 weight parameters, there are 75*1024*10=768000 weight parameters, which is very complicated, so the convolutional neural network introduces " Weights" sharing principle, that is, 75 weight parameters corresponding to each neuron on a feature map are shared by each neuron, so only 75*10=750 weight parameters are needed, and the threshold of each feature map is Also shared, that is, 10 thresholds are required, then a total of 750+10=760 parameters are required.

The so-called weight sharing means that, given an input image, use a filter to scan the image. The number in the filter is called the weight. Each position of the image is scanned by the same filter, so the weight is the same. , that is, sharing.

activation function

If the input changes very small, resulting in very different results in the output structure, we do not want to see this situation, in order to simulate more subtle changes, the input and output values ​​are not just 0 to 1, but can be between 0 and 1 any number of ,

The activation function is used to add nonlinear factors, because the expression of the linear model is not enough. 
The literal meaning of the sentence is easy to understand, but what is the situation when dealing with images? We know that in the neural network, we mainly use convolution to process images, that is, assign a weight to each pixel, which is obviously linear. But for our sample, it is not necessarily linearly separable. In order to solve this problem, we can make linear changes, or we can introduce nonlinear factors to solve problems that cannot be solved by linear models. 
Insert a sentence here to compare the above activation functions. Because the mathematical foundation of neural networks is differentiable everywhere, the selected activation function must ensure that the data input and output are also differentiable, and the operation feature is continuous loop calculation. Therefore, in the process of each generation cycle, the value of each neuron is also constantly changing. 
This leads to a very good effect when the tanh feature difference is obvious, and the feature effect will be continuously expanded during the cycle process. However, when the feature difference is more complex or the difference is not particularly large, more subtle classification judgment is required. , the sigmoid effect is just fine. 
There is one more thing to note. If sigmoid and tanh are used as activation functions, you must pay attention to the normalization of the input, otherwise the activated value will enter the flat area, so that the output of the hidden layer will all converge, but ReLU does not need Inputs are normalized to prevent them from saturating.

Constructing a sparse matrix, that is, sparseness, this feature can remove redundancy in the data and retain the characteristics of the data to the greatest extent possible, that is, a sparse matrix with most 0s. In fact, this feature is mainly for Relu, which is max(0,x), because the neural network is constantly recalculating, in fact, it is trying to constantly try to use a matrix with a majority of 0 to try to express the data As a result, due to the existence of sparse characteristics, this method becomes faster and more effective. So we can see that most of the current convolutional neural networks basically use the ReLU function.

Commonly used activation functions 
The properties of activation functions should have: 
(1) Non-linear. The linear activation layer has no effect on the deep neural network, because its effect is still various linear transformations of the input.
(2) Continuously differentiable. Requirements for gradient descent. 
(3) The range is preferably not saturated. When there is a saturated interval, if the system optimization enters this segment, the gradient is approximately 0, and the network learning will stop. 
(4) Monotonicity, when the activation function is monotonic, the error function of the single-layer neural network is convex, which is easy to optimize. 
(5) Approximate linearity at the origin, so that when the weight is initialized to a random value close to 0, the network can learn faster without adjusting the initial value of the network. 
At present, the commonly used activation functions only have some of the above properties, and none of them have all of them~~

  • Sigmoid function 


    Currently eliminated 
    Disadvantages: 
    ∙  Very small gradient value at saturation . Since the gradient of the back layer is transmitted to the front layer in a multiplicative manner when the BP algorithm is back-propagated, when the number of layers is relatively large, the gradient transmitted to the front layer will be very small, and the network weights cannot be effectively updated. , which is the gradient dissipation. If the weights of this layer are initialized such that f(x) is saturated, the network basically cannot update the weights. 
    ∙ The  output value is not centered at 0.

  • Tanh function 


    write picture description here

    where σ(x) is the sigmoid function, which still has the problem of saturation.

  • ReLU function 


    write picture description here

    A new activation function proposed by Alex in 2012. The proposal of this function largely solves the gradient dissipation problem of BP algorithm when optimizing deep neural network. 

    Advantages: 
    ∙ When x>0, the gradient is always 1, there is no gradient dissipation problem, and the convergence is fast; 
    ∙ Enlarged network sparseness. When x<0, the output of this layer is 0. The more neurons that are 0 after training, the greater the sparsity, the more representative the extracted features and the stronger the generalization ability. That is to get the same effect, the fewer neurons that really work, the better the generalization performance of the network 
    ∙ The amount of computation is small; 
    Disadvantage: 
    If a certain gradient in the latter layer is particularly large, it will cause W to become particularly large after the update, resulting in The input of this layer is < 0, and the output is 0. At this time, the layer will be 'die' and not updated. When the learning rate is relatively large, 40% of the neurons may be 'die' at the beginning of training, so a good setting for the learning rate is required. 
    It can be seen from the advantages and disadvantages that the max(0,x) function is a double-edged sword, which can not only form the sparsity of the network, but also may cause many neurons that are always in 'die', requiring tradeoff.

  • Leaky ReLU function 


    write picture description here 
    Improves the death characteristics of ReLU, but also loses some sparsity and adds a hyperparameter. At present, its benefits are not clear

  • Maxout function 


    write picture description here

Generalized ReLU and Leaky ReLU, improved death characteristics, but also lost some sparsity, and each nonlinear function increased by twice the parameters

In real use, the most commonly used ReLU function is the ReLU function. Pay attention to the setting of the learning rate and the proportion of dead nodes.

pooling layer

Compressing the input feature map, on the one hand, makes the feature map smaller and simplifies the network computing complexity; on the other hand, it performs feature compression and extracts the main features, as follows: 


write picture description here

There are generally two types of pooling operations, one is Avy Pooling and the other is max Pooling, as follows: 


write picture description here

Similarly, a 2*2 filter is used, max pooling is to find the maximum value in each area, where stride=2, and finally the main features are extracted from the original feature map to obtain the right image. 
(Avy pooling is not used much now, the method is to sum each 2*2 area element, and then divide by 4 to get the main features), while the general filter takes 2*2, the maximum takes 3*3, and the stride takes 2, compress to 1/4 of the original. 
Note: The pooling operation here is the reduction of the feature map, which may affect the accuracy of the network, so it can be compensated by increasing the depth of the feature map (the depth here becomes 2 times the original) . 


In convolutional neural networks, we often encounter pooling operations, and the pooling layer is often behind the convolutional layer. Pooling is used to reduce the feature vector output by the convolutional layer, while improving the result (not easy to overfit) .

Why is it possible to reduce the dimension? 
Because images have a "static" property, this means that features that are useful in one area of ​​the image are likely to be applicable in another. Therefore, in order to describe large images, a natural idea is to aggregate statistics on features at different locations. For example, one can calculate the average (or maximum value) of a particular feature on an area of ​​the image to represent the feature.

  • General Pooling

    Pooling works on non-coincident areas in the image (this is different from convolution operations), and the process is as follows.

    We define the size of the pooling window as sizeX, which is the side length of the red square in the figure below, and define the horizontal displacement/vertical displacement of two adjacent pooling windows as stride. In general pooling, since each pooling window is not repeated, sizeX=stride. 


    write picture description here

    The most common pooling operations are average pooling mean pooling and max pooling max pooling: 
    Average pooling: Calculate the average value of the image area as the pooled value of the area. 
    Maximum pooling: Select the maximum value of the image area as the pooled value of this area.

  • Overlapping Pooling (OverlappingPooling 
    As the name says, there will be overlapping areas between adjacent pooling windows, and sizeX>stride. 
    In the paper, Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep In convolutional neural networks, "in NIPS, 2012. The authors used overlapping pooling, and with other settings unchanged, the top-1 and top-5 error rates were reduced by 0.4% and 0.3%, respectively.

  • Spatial Pyramid Pooling Spatial Pyramid Pooling 
    can convert the convolutional features of images of any scale into the same dimension, which not only allows CNN to process images of any scale, but also avoids cropping and warping operations, resulting in some information loss is of great significance. 
    General CNNs require the size of the input image to be fixed. This is because the input of the fully connected layer needs to have a fixed input dimension, but there is no limit to the image scale in the convolution operation. All authors propose spatial pyramid pooling, first let The image is convolved, and then converted into features of the same dimension and input to the fully connected layer, which can extend the CNN to images of any size 
    write picture description here

    The idea of ​​spatial pyramid pooling comes from the Spatial Pyramid Model, which turns a pooling into a pooling of multiple scales. Using different size pooling windows to act on convolutional features, we can get 1X1, 2X2, 4X4 pooling results. Since there are 256 filters in conv5, we get 1 256-dimensional feature, 4 256-dimensional features, and 16 256-dimensional features, and then these 21 256-dimensional features are linked and input into the fully connected layer, in this way, images of different sizes are converted into features of the same dimension. 


    write picture description here 
    To obtain the same size of pooling results for different images, it is necessary to dynamically calculate the size and step size of the pooling window according to the size of the image. Assuming that the size of the output of conv5 is a*a, you need to get the pooling result of size n*n, you can let the window size sizeX be, and the step size is . The following figure takes the output size of conv5 as 13*13 as an example.

fully connected layer

Concatenate all the features and send the output value to a classifier (such as a softmax classifier).

The overall structure is roughly as follows: 
write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325702755&siteId=291194637