Convolution operation, pooling operation, activation function of convolutional neural network (CNN)

Foreword: Convolutional neural network is an important part of deep learning algorithm, which plays a key role in the application of deep learning image recognition technology. Convolutional neural network and recurrent neural network (RNN) are similar to traditional fully connected neural network (also called deep neural network, referred to as DNN). CNN belongs to DNN that encodes spatial correlation, and RNN belongs to encoding temporal correlation. DNN. Due to different image tasks, the network layer of CNN will also change slightly, but basically convolutional layers, pooling layers, and nonlinear layers are used. In order to deepen the understanding of theoretical knowledge in this area, this article will explain in depth the convolution operation, pooling operation and activation function in CNN from many aspects.

Table of contents

1. Convolution layer

1.1 Convolution calculation

1.2 Features of the convolutional layer

1.3 Commonly used convolution operations

2. Pooling layer

2.1 The role of pooling

2.2 Commonly used pooling operations

3. Nonlinear layer

3.1 The role of the activation function

3.2 Commonly used activation functions


1. Convolution layer

        The function of the convolutional layer is to extract the information in the input image. These information are called image features. These features are reflected by each pixel in the image through combination or independent methods, such as the texture features and color features of the image.

1.1 Convolution calculation

        Before explaining the specific convolution calculation, let's intuitively feel the convolution operation of different dimensions through a few animations.

The one-dimensional convolution operation is shown in the following figure:

The two-dimensional convolution operation is shown in the following figure:

The three-bit convolution operation is shown in the following figure:

        The convolution kernel does a linear operation. Each value on the kernel is multiplied by the value at the corresponding position it slides to, and then these values ​​are added. Take two-dimensional convolution as an example to explain how Conv2d performs convolution calculation. Before explaining convolution calculation, we need to understand some important concepts, which are very helpful for understanding convolution operation:

①The two-dimensional in two-dimensional convolution does not mean that the convolution kernel is two-dimensional, it has nothing to do with the dimension of the convolution kernel, but means that the convolution kernel only slides in two dimensions. Similarly, one-dimensional convolution and three-dimensional convolution mean that the convolution kernel slides in one dimension or three dimensions, as shown in the above three figures;

②The number of convolution kernel channels (or the number of convolution kernels, there are several convolution kernels in a set of convolution kernels) = the number of input layer channels;

③The number of output layer channels (that is, the number of feature map channels or the number of feature maps) = the number of convolution kernel groups, that is to say, only one feature map can be obtained after a set of convolution kernels perform convolution calculation on the input, and the feature map has n One channel indicates that n sets of convolution kernels need to be used to perform convolution calculations on the input.

        After understanding the above concepts, let's now use an example to specifically experience how two-dimensional convolution is calculated.

        As shown in the above figure, assuming that the input is a two-dimensional image with RGB3 channels, then a set of convolution kernels contains three two-dimensional convolution kernels, that is, the convolution kernels must also have three channels. These 3 convolution kernels slide on the 3 channels of the input image respectively. For example, on the R channel, every time you slide, the corresponding elements are multiplied and added to get a number, and the 3 convolution kernels slide once to get 3 number, add these 3 numbers and add a bias to get a value on the feature map, the set of convolution kernels slide all over the input image, then you can get a complete feature map, this A feature map represents a feature extracted from an input image. Generally, it is not enough to extract only one feature on the input image. It is often necessary to obtain more feature information on the input image, that is, to obtain multiple feature maps, then multiple sets of convolution kernels are required to perform convolution on the image. product calculation. In the figure below, two sets of convolution kernels are used to calculate two feature maps.

The two-dimensional convolution operation is implemented in Pytorch by the following function:

#二维卷积
torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True, padding_mode='zeros', device=None, dtype=None)
#参数介绍:
#in_channels:输入的通道数
#out_channels:输出的通道数
#kernel_size:卷积核的大小
#stride:卷积核滑动的步长,默认是1
#padding:怎么填充输入图像,此参数的类型可以是int , tuple或str , optional 。默认padding=0,即不填充。
#dilation:设置膨胀率,即核内元素间距,默认是1。即如果kernel_size=3,dilation=1,那么卷积核大小就是3×3;如果kernel_size=3,dilation=2,那么卷积核大小为5×5
#groups:通过设置这个参数来决定分几组进行卷积,默认是1,即默认是普通卷积,此时卷积核通道数=输入通道数
#bias:是否添加偏差,默认true
#padding_mode:填充时,此参数决定用什么值来填充,默认是'zeros',即用0填充,可选参数有'zeros', 'reflect', 'replicate'或'circular'

Assuming that the input size is (N,C_{in},H_{in},W_{in})and the output size after convolution is (N,C_{out},H_{out},W_{out}), then:

H_{out}=\frac{H_{in}+2*padding[0]-dilation[0]*(kernalsize[0]-1)-1}{stride[0]}+1

W_{out}=\frac{W_{in}+2*padding[1]-dilation[1]*(kernalsize[1]-1)-1}{stride[1]}+1

1.2 Features of the convolutional layer

① Weight sharing

        Use the same set of parameters to traverse the entire image to extract some common feature information in the entire image, such as texture features, etc. Different convolution kernels are used to extract feature information that has common features in different aspects of the image, that is, convolution A feature map obtained after the operation represents an image feature extracted. Weight sharing is an important idea in deep learning, which can maintain a good network capacity while reducing network parameters. Convolutional neural networks share weights in space, while recurrent neural networks share weights in time.

② Local connection

        Convolutional layers are evolved from fully connected layers, where each output is connected to all inputs through weights. In visual recognition tasks, key image features, edges, corners, etc. only occupy a small part of the entire image, and two pixels that are far apart in the image are less likely to affect each other. Therefore, in a convolutional layer, each output neuron remains fully connected channel-wise, while being spatially connected to only a fraction of the input neurons in its neighborhood.

1.3 Commonly used convolution operations

1) Group convolution

        The difference between group convolution and ordinary convolution is that ordinary convolution maintains a full connection on the channel, and is only locally connected to a small number of input neurons in the neighborhood in space, while group convolution is local in both channel and space Connection, then it is not difficult to find that group convolution can further reduce parameters on the basis of ordinary convolution, but the effect may be worse than ordinary convolution. As shown in the figure below, you can intuitively observe the difference between group convolution and ordinary convolution.

        For example, suppose the input shape is (1, 12, 24, 24), the kernel_size of the convolution kernel is 3, and the number of output channels is required to be 64. If it is an ordinary convolution, then the number of weight parameters is 3× 3×12×64=6912; and if group convolution is used, assuming 4 groups, then the number of weight parameters is 3×3×3×16×4=1728, and the parameter amount of group convolution is reduced to ordinary convolution One-fourth of the product parameter quantity. The code implementation of group convolution is also very simple, just modify the value of the groups parameter in the torch.nn.Conv2d() function, and modify it into several groups.

2) Depth separable convolution

        Depthwise separable convolution means that the convolution process is divided into Depthwise Convolution and Pointwise Convolution. Depthwise Convolution is actually a group convolution of groups=the number of input channels. This completely isolates the mutual influence of pixels on the channel, and does not effectively use the feature information of different channels at the same spatial position, so Pointwise Convolution is required. These feature maps are linearly combined across channels to generate new feature maps. The operation processes of Depthwise Convolution and Pointwise Convolution are shown in the figure below.

 Depthwise Convolution

 Pointwise Convolution

        One thing to note is that the number of feature map channels after Depthwise Convolution is completed is the same as the number of channels in the input layer, that is to say, Depthwise Convolution does not change the number of channels, and Pointwise Convolution can not only make each pixel linearly combined on different channels , you can also change the number of channels.

        To put it simply, depth-separable convolution can be regarded as a combination of group convolution and 1×1 convolution with the number of groups equal to the number of input channels in code implementation. Generally speaking, the effect of depth-separable convolution is better than that of group convolution. The effect is good, and the effect is similar to that of ordinary convolution, because depth-separable convolution realizes full connection on channels like ordinary convolution, and local connection in space, which is more in line with the characteristics of image pixel interaction, but using depth-separable Convolution, the number of parameters will be much reduced than ordinary convolution. For example, suppose the input shape is (1, 12, 24, 24), the kernel_size of the convolution kernel is 3, and the number of output channels is required to be 64. If it is an ordinary convolution, then the number of weight parameters is 3× 3×12×64=6912; if depthwise separable convolution is used, then the number of weight parameters is 3×3×1×1×12+1×1×12×64=876.

2. Pooling layer

2.1 The role of pooling

       The introduction of the pooling layer imitates the human visual system to reduce the dimension and abstract the visual input objects. It mainly has the following functions:

① Feature invariance: The pooling operation makes the model pay more attention to whether there are certain features in the image regardless of the form in which these features appear, such as the position and size of the features. Among them, feature invariance mainly includes translation invariance and scale invariance. Translation invariance means that the translation of the output to the input remains basically unchanged. For example, if the input is (4, 1, 3, 7, 2), the maximum pooling will take 7. If the input is shifted to the left by one bit to get ( 1, 3, 7, 2, 0), the output result will still be 7; for scale invariance, the pooling operation is equivalent to the resize of the image, usually a dog image is doubled and we can still recognize this It is a photo of a dog, which means that the most important features of the dog are still preserved in this image. We can tell that the picture in the image is a dog at a glance. The information removed during image compression is only some irrelevant information. The remaining information is the feature with scale invariance, which is the feature that can best express the image.

②Feature dimensionality reduction (downsampling): We know that an image contains a lot of information and has many features, but some information is not very useful or repetitive for us to do image tasks, we can use this kind of redundancy Information removal, extracting the most important features, is also a major role of pooling operations.

③The pooling layer will continuously reduce the space size of the data, so the number of parameters and the amount of calculation will also decrease, which also controls overfitting to a certain extent.

④ Realize non-linearity (similar to relu).

⑤Expand the receptive field.

2.2 Commonly used pooling operations

1) Maximum pooling, average pooling, global maximum pooling (GMP), global average pooling (GAP)

        Maximum pooling: Select the maximum value of the pooling kernel area in the image as the pooled value of the area.

Average pooling: Calculate the average value         of the pooled kernel area in the image as the pooled value of the area.

        So what is the basis for deciding whether to use maximum pooling or average pooling? Before answering this question, we need to understand that the error generated by the network when extracting features from an image mainly comes from two aspects : ①The limited size of the neighborhood means that the data information in this part of the area is not comprehensive enough, resulting in an increase in the variance of the estimated value. Large ; ②The error of the convolutional layer parameters increases the deviation of the estimated value (the article on regularization explained in detail that the error can be expressed as the sum of deviation, variance, and noise, that is, error = deviation + variance + noise). Generally speaking, average pooling can reduce the first error and retain more background information of the image; while maximum pooling can reduce the second error and retain more texture information of the image. If you still can't understand how to choose maximum pooling or average pooling, then another explanation is to use average pooling when all the information in the feature map should contribute to the model prediction results , such as global average commonly used in image segmentation Pooling is used to obtain the global context. For example, in the image classification task, the average pooling of the feature map is often performed instead of the maximum pooling, because the high-level semantic information in the deep network can generally help the classifier to classify; in addition, in order to reduce The impact of useless information is to use maximum pooling . For example, the shallow layer of the network often sees maximum pooling, because the shallow layer contains more useless information for the image. To sum up, the shallow layer of the network generally uses maximum pooling, and the deep layer mostly uses average pooling .

Code:

#1、最大池化
torch.nn.MaxPool2d(kernel_size, stride=None, padding=0, dilation=1, return_indices=False, ceil_mode=False)
#参数解释
#kernel_size:池化核的大小
#stride:池化核滑动的步长,默认大小是kernel_size
#padding:在输入图像的两边进行填充,默认是0,即不填充,另外填充值默认是0
#dilation:设置核的膨胀率,默认 dilation=1,如果kernel_size =3,那么核的大小就是3×3。如果        dilation = 2,kernel_size =3×3,那么每列数据与每列数据,每行数据与每行数据中间都再加一行或列数据,数据都用0填充,那么核的大小就变成5×5。
#return_indices:这个参数用来控制要不要返回最大值的索引位置,如果为true那么要记住最大池化后最大值的所在索引位置,后面上采样可能要用上,为false则不用记住位置。
#ceil_mode:它决定的是在计算输出结果形状的时候,是使用向上取整还是向下取整。
#2、平均池化
torch.nn.AvgPool2d(kernel_size, stride=None, padding=0, ceil_mode=False, count_include_pad=True, divisor_override=None)
#参数讲解,与最大池化相同的参数代表的意思一样
#count_include_pad:为True时表示平均计算时零填充也包含在内
#divisor_override:如果指定,它将用作除数,否则将使用池化区域的大小
#3、全局最大池化
torch.nn.AdaptiveMaxPool2d(output_size, return_indices=False)
#这个函数用来自适应最大池化
#output_size:决定将输入图像分几个区域池化,如果output_size=1,就代表全局最大池化
#全局平均池化
torch.nn.AdaptiveAvgPool2d(output_size)
#这个函数代表自适应平均池化
#output_size:决定输入图像分几个区域进行平均池化,output_size=1,就代表全局平均池化

2) Overlap pooling

That is, there is an overlapping area between adjacent pooling windows . At this time, generallykernel_size > stride

3) Spatial Pyramid Pooling (SPP)

        It turns a pooling into a pooling of multiple scales. Use different size pooling windows to act on the feature map of the upper layer. As shown in the figure below, the input feature map is pooled three times with different window sizes, and then sent to the fully connected layer.

Such a structural design can ensure that even a convolutional network with a fully connected layer can handle images of different sizes. Why can't fully connected layers handle images of different sizes? This is because the number of neurons on both sides of the fully connected layer must be fixed, and the number of neurons must be different for input images of different sizes. Its code can be implemented using the adaptive max pooling or adaptive average pooling mentioned above. Many similar structures have been developed from SPP, such as ASPP, ROI Pooling, etc., which will not be listed here.

3. Nonlinear layer

3.1 The role of the activation function

        The introduction of activation functions can bring nonlinear capabilities to neural networks, which is very important, because most of the data in the world is nonlinear, and linear networks cannot learn and simulate nonlinear data such as images and audio. In addition, if no nonlinear layer is introduced into the neural network, then the neural network becomes a simple stack of linear layers, and the simple stack of multi-layer linear networks can be expressed by a linear function in essence, which makes the depth of the neural network lose lost its original meaning. Finally, the activation function can also map data from nonlinear space to linear space, so that data can be better classified.

3.2 Commonly used activation functions

1) sigmoid activation function

sigmoid(x)=\frac{1}{1+e^{-x}}

        The sigmoid function compresses the value range to (0,1), which just fits the characteristics of the probability distribution and can be used in the output layer for probability prediction. Its advantage is that it is continuous and can be guided everywhere. The disadvantage is that when the function value is close to 0 and 1, the function gradient is small and it is easy to cause the gradient to disappear, and the output value of the function is always positive, not centered on 0, which will cause the weight to only go to Update in one direction, thus affecting the convergence rate.

2) Tanh activation function

tanh=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}

         The tanhx activation function is also called the hyperbolic tangent activation function. This function compresses the value range to (-1,1). Its advantage is that the output value is centered on 0, which solves the problem that the weight in the sigmoid function can only be updated in one direction. The problem, its disadvantage is that it will also cause the gradient to disappear and the amount of calculation is huge.

3) ReLU activation function

f(x)=max(0,x)

         ReLU activation function, also known as modified linear unit or linear rectification function, is a very commonly used activation function in neural networks. Its advantage is that ReLU has sparsity, small amount of calculation, fast convergence speed and x>0no gradient disappearance in the region. The disadvantage is that the output is not centered on 0 and x<0the weights of this part of neurons will never be updated.

4) Leaky ReLU activation function

y=max(0,x)+\alpha *min(0,x)

        This activation function solves the problem that x<0the weights of this part of neurons in the ReLU activation function will never be updated

 The problem.

Guess you like

Origin blog.csdn.net/Mike_honor/article/details/125999256