Common understanding of convolutional neural network

First release a familiar example of a convolutional neural network:


Before starting, it is necessary to introduce the terminology in convolutional neural network.

Convolution layer: The feature map after the convolution kernel (filter) operation. The picture shows the C layer

Sampling layer: Also known as the pooling layer, the main purpose is to reduce the number of features in the previous layer. Assuming the filter size is 2, after the pooling process, the number of features is reduced by a factor of 2*2.

Local receptive field: We divide the input image into many small square matrices, each of which is called a local receptive field.

Kernel: A matrix containing weights.

Closer to home, compared to artificial neural networks

        The biggest difference between a convolutional network and an artificial neural network is the addition of a convolutional area in front of the network. The convolution area is mainly the intersection of the convolution layer and the sampling layer.

1) Input layer to C1 layer (convolution operation)

  In the input layer, there is an image of size 32*32, we get the C1 layer through the first layer convolution operation. In layer C1, each of our neurons is connected to a local receptive field of size 5*5 in the input layer . At this time, there are (32-5+1)*(32-5+1)=28*28 different choices in the 32*32 size feature map of the input layer, and we combine the different combinations of these 28*28 All are selected into the feature map of C1.


There are 5 feature maps in the C1 layer, and the number of weights is: (5*5+1)*5. There is no fixed calculation formula for the number of feature maps here, and it is manually specified.

2) From C1 layer to S2 layer (sampling or pooling operation)

  Downsampling, the main purpose of this process is to reduce the number of features in the previous layer. If the filter size is 2, after the pooling process, the number of features is reduced by 2*2 times. The size of the feature map in the previous layer is 28*28, so the size of the feature map obtained after sampling in this layer is 14*14.

Note: The sampling process generally does not increase the number of feature maps, only the convolution process increases the number of feature maps.


  The number of weights required for this process is: (1+1)*5, because when the local receptive field is used as input, multiplied by the trainable parameters and the trainable bias through mean pooling, there are only two weight variables. . There are a total of 5 feature maps, so the number of weights required is: 10.

3) From layer S2 to layer C3

  This process is similar to the input layer to the C1 layer, we set the filter size to 5*5 and the feature map size of the C3 layer to 10*10. From each feature map in S2, there are (14-5+1)*(14-5+1)=10*10 choices, however, we remove all neurons in one feature map in C3 from S2 Select the local receptive field in different feature maps of , instead of adding all possible selections of a feature map in C2 to a feature map in C3 (a bit of a mouthful, just understand). Simply put, each feature surface in C3 extracts features from different feature surfaces in S2, so that different types of features can be extracted and the number of connections can be controlled.

  The number of weights required at this time is: (5*5+1)*16=416.

4) From layer C3 to layer S4

  This layer is a typical pooling (sampling) process, which is basically similar to the process from the C1 layer to the S2 layer. This layer has 16 feature maps and the filter size is 2*2. Through this sampling, each feature map is reduced by 2*2=4 times, and the number of weights required for this process is: (1+1)*16=32.

5) From layer S4 to layer C5

  The C5 layer has a total of 120 feature maps, the size of each feature map is 1*1, and the size of the filter is 5*5. This is equivalent to that the neurons in the C5 layer are fully connected to a feature map in the S4 layer.

  So far, we have transformed the original 32*32-dimensional image into a vector with only 120 features, which can then be processed using an artificial neural network.

  6) From layer C5 to layer F6

  This process is similar to the weight connection of artificial neural network, and the connection process is full connection.

  7) From the F6 layer to the output layer

  The output layer has a total of 10 nodes, each output node is an RBF (Radial Basis Function) unit, and each RBF unit calculates the Euclidean distance between the input vector and the parameter vector. The farther the input is from the parameter vector, the larger the output of the RBF. An RBF output can be understood as a penalty term that measures how well the input pattern matches a model of the class associated with the RBF.




Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324958007&siteId=291194637