[Deep Learning] Basic CNN Operators

CNN is a classic deep learning neural network. Starting from the operators involved in LeNet-5, try to decompose the CNN network. Before analyzing the targeted network structure, it will help us to see the details and grasp the law of model evolution more intuitively.

Because the classic network LeNet-5 covers the most basic convolutional layer, activation layer, pooling layer and fully connected layer in the convolutional neural network , this article wants to take this as an example to analyze these basic layers in the CNN network. The operator analyzes first to grasp the core of the CNN network. With the development of convolutional neural network, layers such as deconvolution layer, batch normalization (BN) layer and Shortcut are proposed. These newly proposed layer operators enable deep neural network to solve image processing and other complex problems. Faster and more efficient, let's not say anything here.

The picture below is from: LeNet-5 Network Details

In CNN, the eigenvalues ​​of a particular layer are only connected to the local spatial region in the previous layer, and have a consistent set of shared parameters in the entire space occupancy of the image, that is, local perception and shared parameters , which simulates the human recognition of objects. How the human brain works - a single cortical neuron only responds to stimuli in a restricted area of ​​space (i.e. receptive field).


convolution operator

The purpose of the convolution operation is to extract information or features from an image. Any image can be regarded as a numerical matrix, and a specific set of numbers in the matrix can constitute a feature. The purpose of the convolution operation is to scan this matrix and try to mine relevant or interpretable features for the image. Because the feature map is obtained based on the multiplication of the corresponding elements between the kernel and the corresponding pixel on the original input image, every time the value of the kernel is changed, a different feature map will be produced .


The important concepts in the convolution operator are:

  1. Convolution kernel (Kernel) . In image processing, the weighted average of a small area of ​​pixels in the input image becomes an operation of the output image. Among them, the weight is defined by a function, and this function is called a convolution kernel, which is intuitively understood as a filter matrix . Commonly used convolution kernel sizes are 3×3, 5×5, etc.
  2. Padding . Padding refers to the way to handle the boundaries of the input feature map (Feature Map). In order not to discard the information of the original image, so that the deeper input still has a large enough amount of information, it is often first to fill the boundary of the input feature map (generally the filling is 0), and then perform the convolution operation. The Padding operation can make the size of the output feature map not too small, or consistent with the size of the input feature map.
  3. Step length (Stride) . The step size is the number of pixels moved by each step when the convolution kernel traverses the input feature map. If the step size is 1, it will move 1 pixel at a time; if the step size is 2, it will move 2 pixels at a time (that is, skip 1 pixel).
  4. Output feature map dimensions . Define the input feature map size as III , the convolution kernel size isKKK , the sliding step isSSS , the number of filled pixels isPPP , the output feature map size isOOO , then

O = ⌊ I − K + 2 P S ⌋ + 1 O = \lfloor \frac{I-K+2P}{S} \rfloor + 1 O=SIK+2P+1


The following is a 3×3 data matrix III use a 3×3 convolution kernelKKK is taken as an example to illustrate the specific operation process of the convolution operator:

The following shows the process of using a convolution kernel to obtain a feature map. In practical applications, multiple convolution kernels are used to obtain multiple output feature maps . Use different color boxes to indicate that block data of the same size as the convolution kernel are sequentially extracted from the input feature map matrix . In order to improve the computational efficiency of the convolution operation, the img2col (convert the image matrix into columns) operation is required, that is, to vectorize the data in these different color boxes, a total of 3×3 vectors, and finally obtain 9 vectorized data matrices. At the same time, the convolution kernel is expanded into rows, ready to be multiplied with the columns of the image matrix. As shown below
insert image description here

Through matrix multiplication, the output of the feature matrix will be obtained, and finally the col2img (matrix to feature map) operation will be performed to convert the generated column vector into a matrix output to obtain the feature map. As shown below
insert image description here


The above is the specific process of the convolution operator completing the feature map matrix convolution through the img2col method. If convolution is performed according to the steps in mathematics, the memory read during operation is discontinuous, which will increase the time cost. The method of multiplying and adding the corresponding elements of convolution is very similar to the vector inner product, so the matrix convolution is converted into matrix multiplication by img2col, which can greatly improve the speed of matrix convolution .

Only one convolution kernel is given here, and one convolution kernel will only generate one feature map. In practical applications, in order to enhance the expressive ability of the convolution layer, many convolution kernels will be used to obtain multiple feature maps ( For example, 6 convolution kernels are selected in LeNet-5). When there are many convolution kernels, it would be a waste of time and memory to calculate one by one, but if you arrange all these convolution kernels in rows, and then perform matrix multiplication with the column vector converted from the input feature map, you will get all the convolution kernels at once. The output feature map, which can greatly improve the speed of feature map matrix convolution .


It is also worth noting that for the visual interpretation of convolution layer-by-layer extraction features , the paper "Visualizing and Understanding Convolutional Networks" by Matthew D. Zeiler and Rob Fergus has a good explanation. The following picture is from Li Feifei's cs231n The course introduces the part of CNN: it reflects the process of convolution layer by layer from shallow features (such as edges) to deep features (such as categories) .

insert image description here


Activation operator

The activation operator is generally connected after the convolutional layer or the fully connected layer, which can activate a certain part of the neurons in the neural network and pass the activation information back to the neural network of the next layer. The significance of the activation operator is to introduce nonlinear operations to the highly linear neural network, so that the neural network has a strong fitting ability . The activation operator does not change the dimensions of the data, i.e. the dimensions of the input and output are the same. The activation operator is generally expressed by a specific function and becomes an activation function. There are many types of activation functions, some common basic functions are listed below


function name Sigmoid function Tanh function ReLU function Softmax function
Calculation formula S ( x ) = 1 1 + e x S(x)=\frac{1}{1+e^{x}} S(x)=1+ex1 tanh ( x ) = 1 − e − 2 x 1 + e − 2 x tanh(x)=\frac{1-e^{-2x}}{1+e^{-2x}}t a n h ( x )=1+e−2x _ _1e2 x R e L U ( x ) = m a x ( x , 0 ) ReLU(x)=max(x,0) R e L U ( x )=max(x,0) σ ( z ) j = ezj ∑ iezi zi means the ith output value \sigma(z)_j=\frac{e^{z_j}}{\sum_{i}e^{z_i}} \ \ z^i means i output valueσ ( z )j=ieziezj  zi representsthei-thoutputvalue

Among them, the ReLU function is an activation function often used in neural networks . Based on ReLU, there are a series of variants such as LeakyReLU, ELU, and SeLU; Softmax, as a special activation function, is often used in the output layer of the classification neural network. Outputs the probability that an input sample belongs to each class .


Factors to be considered in activation function design:

  1. Nonlinearity: When the activation function is nonlinear, a two-layer neural network can prove to be a general function approximation . If the nonlinearity is lost, the entire network is equivalent to a single-layer linear model.
  2. Continuous differentiability: This property is necessary for gradient-based optimization methods . If some functions with local non-differentiability are selected, the derivative here needs to be defined forcefully.
  3. Boundedness: If the activation function is bounded, the gradient-based training method is often more stable ; if it is unbounded, the training is usually more efficient, but the training is easy to diverge, and the learning rate can be appropriately reduced at this time.
  4. Monotonicity: If the activation function is monotonic, the loss function associated with a single-layer model is convex .
  5. Smoothness: Smooth functions with monotonic derivatives have been shown to generalize better in some cases .

pooling operator

The pooling operator is the pooling layer operator in the neural network. As a downsampling operator , it aims to obtain spatially invariant features by reducing the resolution of the feature map. The pooling operator plays the role of secondary feature extraction in the entire network structure , and each neuron of it performs a pooling operation on the local receiving field, thereby reducing the size of the feature map and the calculation amount of the network model . The commonly used pooling operators are

  1. Average pooling operator : average all values ​​in the local receptive field and use it as a sampling value
  2. Maximum pooling operator : take the maximum value in the local receptive field as the sampling value

Assuming that the input matrix size is 2, the stride = 2, pad = 0 of the pooling operator, the following figure shows the difference between the two


insert image description here


fully connected operator

Convolutional layers, pooling layers, and activation layers (such as ReLU layers) are usually interleaved in neural networks to improve the expressiveness of the network. Activation layers usually follow convolutional layers, in the same way that non-linear activation functions usually follow linear dot products in traditional neural networks. As a result, convolutional and activation layers are usually glued together one after the other, and some neural architectures (such as AlexNet) do not explicitly show activation layers because they are supposed to always be glued at the end of a linear convolutional layer . After two or three sets of convolutional and activation layers, there may be a max pooling layer. After repeating this pattern three times, a fully connected layer is added at the end to form a neural network similar to
CRCRPCRCRPCCRPF CRCRPCRCRPCCRPFCRCRPCRCRPCRCRPF


Similar to a multi-layer perceptron, each neuron in a fully connected layer is connected to all neurons in the previous layer, whereby fully connected operators can integrate class-discriminative local information. In the fully connected network, the feature maps of all binary images are spliced ​​into one-dimensional features and used as the input of the fully connected layer. The fully connected layer performs weighted summation on the input and sends it to activation layer .

The following shows the process of the fully connected layer

insert image description here

X is the input of the fully connected layer, which is the feature; W is the parameter of the fully connected layer, also called the weight . The feature X is obtained after multiple convolutional layers and pooling layers in front of the fully connected layer. If a convolutional layer is connected in front of the fully connected layer, this convolutional layer outputs 100 features (that is, the feature map channel is 100), and the size of each feature is 4×4. After inputting 100 features to the fully connected The Flatten layer before the layer will flatten these features into a one-dimensional vector with N rows and 1 column. At this time, N = 100 × 4 × 4 = 1600 N=100\times 4 \times 4 = 1600N=100×4×4=1 6 0 0 , then the feature vector X is a one-dimensional vector with 1600 rows and 1 column.


The parameter W of the fully connected layer is the optimal weight sought by the fully connected layer during the training process of the deep neural network, which can be expressed as a two-dimensional vector with T rows and N columns, where T represents the number of categories . For example, if the 7-category problem needs to be solved, then T=7, and so on for other categories. by W × X = YW\times X = YW×X=Y , get a one-dimensional vector of T rows and 1 column, which is the output of the fully connected layer.



[1] "Neural Network and Deep Learning", Chapter 8 Convolutional Neural Network, Machinery Industry Press [
2] "Deep Neural Network FPGA Design and Implementation", Chapter 3 Introduction to the Basic Layer Operator of Deep Neural Network, Xidian University Publishing House
[3] Chapter 7 Convolutional Neural Networks of "The Essence of Deep Learning Cases", People's Posts and Telecommunications Press

Guess you like

Origin blog.csdn.net/weixin_47305073/article/details/127876799