Deep learning TF-7. Convolutional neural network CNN

1. Overview of Convolutional Neural Networks

    Convolutional Neural Network (CNN) is a feedforward neural network, which consists of several convolutional layers and pooling layers. The basic structure of CNN consists of an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. Generally, several convolutional layers and pooling layers are selected, and the convolutional layer and the pooling layer are alternately arranged, that is, a convolutional layer is connected to a pooling layer, and a convolutional layer is connected after the pooling layer, and so on. Since each neuron of the output feature map in the convolutional layer is locally connected to its input, and the corresponding connection weight is weighted and summed with the local input and the bias value is added to obtain the neuron input value, the process is equivalent In the convolution process, CNN also got its name.
    The convolutional neural network evolved from the multilayer perceptron (MLP). For high-dimensional input, it is impractical to fully connect all neurons with the neurons in the previous layer, so partial connections (local perception) are used. As can be seen from the figure below, the number of connections is reduced exponentially, and the parameters are also reduced exponentially.
Insert picture description here
Insert picture description here
    The convolutional neural network has the structural characteristics of local area connection, weight sharing, and downsampling, which makes the convolutional neural network perform well in the field of image processing. Compared with other neural networks, the particularity of convolutional neural networks is mainly in two aspects: weight sharing and local connection. Weight sharing makes the network structure of convolutional neural networks more similar to biological neural networks. Local connections are not like traditional neural networks. Each neuron in the n-1th layer is connected to all neurons in the nth layer, but between the neurons in the n-1th layer and some neurons in the nth layer. connection. The effect of these two features is to reduce the complexity of the network model and reduce the number of weights.
Insert picture description here
Insert picture description here
    The sliding window completes the scanning of the entire area, realizing the fusion of local information to global information; while sliding, the amount of parameters is not increased through weight sharing (w is unchanged). In a convolutional neural network, the convolution kernel (or filter) in the convolutional layer is similar to a sliding window, sliding back and forth in the entire input image with a specific step size, and after convolution operation, we get The feature map of the input image, this feature map is the local feature extracted by the convolution layer, and this convolution kernel shares parameters. In the training process of the entire network, the convolution kernel containing the weight will also be updated until the training is completed.
    Why does the convolutional layer have multiple convolution kernels? Because weight sharing means that each convolution kernel can only extract one feature, in order to increase the expressive ability of CNN, multiple convolution kernels need to be set. However, the number of convolution kernels in each convolution layer is a hyperparameter.

Second, the structure of the convolutional neural network

    Convolutional neural networks generally include convolutional layers, pooling layers and fully connected layers.
    The combined effect of the convolutional layer and the pooling layer enables CNN to extract better features in the image (feature extraction). The role of the convolutional layer is to extract the features of the image; the role of the pooling layer is to sample the features, which reduces the resolution of the image, can use fewer training parameters, and can also reduce the degree of overfitting of the network model. Convolutional layers and pooling layers generally appear alternately in the network. One convolutional layer plus one pooling layer is called a feature extraction process, but not every convolutional layer is followed by a pooling layer. Most networks have only three Layer pooling layer. The end of the network is generally 1 to 2 fully connected layers, the fully connected layer is responsible for connecting the extracted feature maps, and finally the final classification result is obtained through the classifier.

1. Convolutional layer

    The process of convolution is the process of extracting features. As the depth of the convolutional neural network increases, more and more advanced features can be extracted, which is closer to the essence of things. Generally speaking, high-level features can be used in the classification of different categories of things, while low-level features can be used in the classification of similar things.
    In the convolutional layer, it usually contains multiple learnable convolution kernels. The feature map output by the previous layer is convolved with the convolution kernel, that is, the dot product operation is performed between the input item and the convolution kernel, and then the result Input the activation function to get the output feature map. Each output feature map may be a combination and convolution of multiple input feature maps. The calculation formula of
  the output value a l j of the j-th unit of the convolutional layer l is:
Insert picture description here
where a l-1 j represents a set of selected input feature maps, and k represents a learnable convolution kernel. The figure below shows the convolution process.
Insert picture description here
The convolution kernel k is usually regarded as a sliding window, and this sliding window slides forward with a set step length. Here the size of the input image is 5×5, that is, M=5, the size of the convolution kernel is 3×3, that is, k=3, and the step size is 1, that is, s=1. According to the output calculation formula of the convolution layer
Insert picture description here
, the size of the output image can be calculated N = 3.
The convolution process: convolve a 5×5 input image with a 3×3 convolution kernel to obtain a 3×3 output image.
There are two disadvantages of such convolution:

  • Each convolution will cause the image size to become smaller. If the image is small and the number of convolutions is large, only one pixel may be left in the end
  • The edge pixels of the matrix of the input image have been calculated only once, and the intermediate pixels have been convolved and calculated multiple times, which means that the edge information of the image is lost. In order to solve these two problems, the input image needs to be filled (Padding)
1.1 Padding

Insert picture description here
    A layer of pixels is filled around the input image matrix, usually the filled element is 0, and the number of filled pixels is 1, that is, P=1. After the edge pixels are filled, they are no longer edge pixels and can be calculated multiple times. The edge pixels in the output image are affected by the edge pixels of the input image, which reduces the shortcomings of edge information loss. In addition, according to the calculation formula of the convolution layer, the output convolution feature map becomes 5×5, which is the same size as the image, which solves the shortcoming that convolution will make the image smaller.
Common filling methods are Valid and Same filling

  • Valid filling
    No filling is used, that is, the image of M×M is convolved with the convolution kernel of k×k. If the step size is 1, the output is (M-k+1)
  • Same padding Make the
    output convolution feature map size equal to the input image size by padding. At this time, the padding width P=(k-1)/2, but the convolution kernel is odd

    In the field of computer vision, k is usually an odd number. On the one hand, it can ensure that the number of filled pixels P is an integer when Same filling is used, and the filling of the original picture is symmetric; on the other hand, the convolution kernel with odd width has a central pixel, which can be Indicates the location of the convolution kernel.

1.2 Stride

Insert picture description here
    The stride refers to the distance that the convolution kernel moves each time when it moves on the input, which is explained directly in the figure above. Among them, if you press the red frame to move, stride = 1; if you press the blue frame to move, then stride = 2. After adding stride, suppose the input size is M×M, the convolution kernel size is k×k, stride=s, padding=p, and the output calculation formula of the convolution layer is:
Insert picture description here
Stride can be used to complete the dimensionality reduction operation.

1.3 Multi-channel calculation

    In addition to the length and width of the convolution kernel, there is a parameter of the number of channels. The first thing to be clear is that the number of channels of a single convolution kernel is equal to the number of channels of the image. If the image is in RGB mode, the size of the convolution kernel It is h×w×3. When there is only one convolution kernel, the number of channels of the image after convolution calculation is one-dimensional, and the calculation method is simple and crude. The corresponding position of each channel is multiplied and then added between different channel numbers.
Insert picture description here
    Generally, there is more than one convolution kernel, and the situation of multiple convolution kernels is not complicated. You can directly perform a single convolution kernel operation on each convolution kernel, and then put them together.
Insert picture description here
Insert picture description here

1.4 layers.Conv2D

Insert picture description here
Insert picture description here
Insert picture description hereInsert picture description here

1.5 tf.nn.conv2d

Insert picture description here

2. Pooling layer

    The role of the pooling layer (pooling) is mainly to reduce the dimensionality. The dimensionality is reduced by downsampling the results after convolution, which is divided into maximum pooling, average pooling, minimum pooling and random pooling. The pooling layer greatly reduces the number of connections than the convolutional layer, that is to say, it reduces the dimensionality of the features, thereby avoiding overfitting, and at the same time making the features of the pooling output have translation invariance. The pooling layer uses the principle of image local correlation to down-sample the image, which reduces the amount of data while retaining useful information.
Insert picture description here
Insert picture description here
    Mean pooling is to average all the feature points, while maximum pooling is to maximize the feature points. Random pooling is somewhere in between. By assigning probabilities to pixels according to their numerical value, and then sub-sampling according to the probability, in the average sense, it is similar to the mean sampling, and in the local sense, it obeys the maximum sampling. Guidelines.
Insert picture description here
    According to Boueau’s theory, it can be concluded that in the process of feature extraction, mean pooling can reduce the variance of the estimated value caused by the limited neighborhood size, but more of the image background information is retained; while the maximum pooling can reduce the volume The multi-layer parameter error causes the deviation of the estimated mean error, which can retain more texture information. Although random pooling can retain the information of mean pooling, the random probability value is indeed artificially added, and the setting of random probability has a large impact on the result and cannot be estimated.

2.1 Actual Combat of Pooling

Insert picture description here

layers.MaxPool2d(2,strides=2) The
first 2 means the size of the pooling box is 2×2, strides=2 means the step size is 2

2.2 upsample—upsampling

Insert picture description here
Insert picture description here

2.3 ReLU layer-sampling

Insert picture description here
Insert picture description here

3. Fully connected layer

    The fully connected layer is to expand the output of the convolutional layer and the pooling layer into a one-dimensional form. The fully connected layer is actually an ordinary neural network. The convolutional layer and the pooling layer only perform feature extraction and reduce the amount of parameters. In order to generate the final output, I need a fully connected layer to implement the classification/regression task.

Refer to the following article:
Convolutional Neural Network

Build a convolutional neural network from scratch

Guess you like

Origin blog.csdn.net/weixin_46649052/article/details/114031540