A detailed introduction to pooling layers in convolutional neural networks

The convolutional neural network has undergone earth-shaking changes from 2012 to 2023. The earliest convolutional neural network consists of convolutional layers, pooling layers, and fully connected layers. Among them, the convolutional layer is used to extract the features of the image, the pooling layer reduces the number of features, and the fully connected layer is used to nonlinearly combine the features and predict the category. However, in the age of transformer rampant, the pooling layer almost disappeared in the neural network. While the pooling layer can still be seen in a small number of mainstream models, record the various pooling layers. This blog post introduces a total of 7 simple pooling layers, and also introduces the basic structure of the convolutional neural network.

1. The main structure of convolutional neural network

1.1 Convolution layer

The convolutional layer was first applied to 2D images in RGB format (ie W , H , C in W,H,C_{in}W,H,Cinformat data), the output data is also W , H , C out W,H,C_{out}W,H,Coutformat, later extended to W , H , C , DW,H,C,D in other fieldsW,H,C,D format data, that is, 3D data. whereC in C_{in}CinRefers to the number of channels of the input convolutional layer, C out C_{out}Coutis the number of channels of the output convolutional layer. The operation process of the convolution kernel is exactly the same as the process of image filtering at first (this makes the convolution operation have the characteristics of local connection and parameter sharing), which has a parameter setting of kernel_size (usually only an odd number), which is later supplemented some other concepts. For example, the step size of the convolution kernel sliding (the image filter step size is 1 by default), and the algorithm engineers also consider that the size of the image will become smaller after the convolution operation (the size of the WxH image will change after filtering (W- kernel_size+1)x(H-kernel_size+1)). Therefore, the concept of padding [ ] is proposed for the convolution operation padding=smae,即自动填充;padding=valid即不填充, that is, the original image is filled, so that the calculated image remains the same size as the original image. Another feature of the convolutional layer is multi-view. The number of kernels of the convolution kernel C in ∗ C out C_{in}*C_{out}CinCout, each C out C_{out}Cout[corresponding to C in C_{in}CinEach convolution kernel] is a perspective of the convolution layer to observe the data [extract features].

1.2 Pooling layer

The pooling layer was considered an indispensable part of the convolutional neural network at the time, which provided the robustness of the convolutional neural network ( 平移不变性、旋转不变性、尺度不变性|特征降维,防止过拟合|实现非线性|扩大下一层卷积层的感受野). The operation of the pooling layer is similar to the image downsampling process. After this operation, the size of the features in the network layer can be reduced, thereby reducing the amount of parameters in the features and subsequent fully connected network layers. Translation|rotation invariance means that the recognition result remains unchanged after the image is translated|rotated. This depends on the feature extracted after translation|rotation does not change, and the pooling layer can ensure that the feature remains unchanged to a certain extent. For details, please refer to https://zhuanlan.zhihu.com/p/382569419, which mentions that what is most needed in image classification is 平移不变性; what is most needed in target detection is 平移相等性[that is, the target translation, the predicted boxes are also corresponding translation]. And it is pointed out in the experiment that when the output has a slight rotation and translation, the pooling operation can approximately achieve the effect of translation invariance, but it does not have this effect when the translation or rotation is too large.

Feature dimensionality reduction , after the operation of the pooling layer, the feature size output by the network layer is greatly reduced, which is the same as the effect of feature dimensionality reduction. After reducing the number of features, overfitting can be suppressed to a certain extent.

Achieve non-linearity , which is irreversible with the pooling operation
自ResNet之后,池化层在分类网络中应用逐渐变少,往往采用stride=2的卷积替代最大值池化层。

1.3 Fully connected layer

The fully connected layer is actually a multi-layer perceptron | BP neural network. We can divide the convolutional neural network model into two parts: convolution base (composed of pooling layer and convolution layer), fully connected layer, convolution base is considered to be used to extract image features, and fully connected layer is considered to classify image features. When modeling the model, the convolution is based on the fully connected layer and participates in training together (forward propagation prediction and backpropagation optimization), so there is a certain adaptive relationship between the convolutional base and the fully connected layer, and multiple models cannot be freely combined The convolutional base and the fully connected layer are used directly (retraining is required to make them adapt to each other, that is, transfer learning|fine-tuning). In addition, it has also been proposed to use convolutional bases to extract features, and then train traditional methods such as decision trees, random forests, SVMs, extreme learning machines, Bayesian methods, and KNNs for classification.

特征的概念,特征本质上就是一个属性,当其与任务密集相关时就是有效特征,当其与任务无关时就是噪声数据。在深度学习中,通常使用relu或者其变种函数作为激活函数,其会对负值区域进行抑制,而对正值区域进行保留。也就是说,在深度学习中,卷积基输出的值越大,则表明特征响应强度越大。

2. Detailed explanation of the pooling layer

In deep learning, the pooling layer is just an operation rule for reducing features, and it has no parameters. According to its operation rules, there are average pooling layer, maximum pooling layer, global average|maximum pooling layer, random pooling layer, hybrid pooling, median pooling layer, combined pooling layer, etc. Among them, the first three are common pooling operations, and the latter are relatively rare.

In addition to these 7 pooling methods, https://jishuin.proginn.com/p/763bfbd5c125 of Jishi platform also introduces power average pooling (using average | (The pooling can amplify spatial changes and retain important image structure details), Local Importance Pooling (local importance pooling) and Soft Pooling (soft pooling), etc.

2.1 Average pooling layer

Refer to https://zhuanlan.zhihu.com/p/77040467
In the forward propagation process, the average value in the image area is calculated as the pooled value of the area; in the back propagation process, the gradient feature points are evenly distributed to each Location. According to the effect in the figure below, it can be seen that there is a certain similarity between mean pooling and low-pass filtering, which can filter out high-frequency information (blurring the image edge, ignoring noise points), and preserving texture information as much as possible. Data disaggregated by characteristics may be useful.
[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-RLNJY3n3-1687252619299)(2023-06-20-15-43-32.png)]
Its reverse propagation When, the diagram is as follows, you can see that the optimization purpose is to make the network layer data as close to the average value as possible (use the same gradient value for 2x2 input)

Average pooling takes the average value in each rectangular area, which can extract the information of all features in the feature map and enter the next layer, unlike maximum pooling, which only retains the feature with the largest value, so average pooling can retain more images background information.

在实际应用中,均值池化往往以全局均值池化的形式出现。常见于SE模块以及分类模块中。极少见于作为下采样模块用于分类网络中。

2.2 Maximum pooling layer

In the forward process, the maximum value in the image area is selected as the pooled value of the area; in the reverse process, the gradient is backpropagated through the maximum value in the forward process, and the gradient in other positions is 0. According to the effect in the figure below, it can be seen that there is a certain similarity between the average pooling effect and the high-pass filter, filtering out low-frequency information (removing fine textures, only retaining the maximum value), and only retaining the maximum value for a 2x2 area (similar to in relu operation).

When it is backpropagating, it is schematically shown in the figure above. It can be seen that only the region corresponding to the maximum value has a gradient value that is optimized, while other regions have no gradient value.

This method discards a large amount of redundant information in the network, making the network easier to optimize. At the same time, this method of operation often loses some detail information in the feature map, so the maximum pooling retains more edge information of the image.

2.3 Global pooling layer

The global average pooling layer (Global Average Pooling, GAP) is the spatial generalization of the average | maximum pooling layer. In the average | maximum pooling layer, it is usually necessary to set pool_size (the size of the pooling area), while the global global pooling layer is By default, the pool_size is WxH, and the result is calculated for all the data at one time (the average | maximum pooling layer needs to perform a sliding window, and one result is calculated at a time). Therefore, the global pooling layer has a global average pooling layer and a global maximum pooling layer.

GAP regularizes the structure of the entire network to prevent overfitting, directly eliminates the characteristics of the black box in the fully connected layer, and directly gives each channel the actual category meaning. In addition, using GAP instead of a fully connected layer allows input of arbitrary image size ( 以往的CNN使用flatten操作将2d数据转成1维,不同的输入size,输出的特征维度不同,故只能固定图像输入尺寸). GAP averages the entire feature map and can also be used to extract global context information, which serves as a guide to further enhance network performance.

The global pooling layer is usually placed in the middle of the convolution-based fully connected layer, or in CBAM (Channel attention).

2.4 Random pooling layer

Reference from: https://zhuanlan.zhihu.com/p/77040467

Random pooling is proposed by Stochastic Pooling, a paper of ICLR2013, which combines the advantages and disadvantages of the average pooling layer and the maximum pooling layer. The method of random pooling is very simple. It only needs to randomly select the elements of the feature area according to their probability values, and the probability of being selected with a large element value is also high (avoiding that the maximum value is always selected).

2.5 Hybrid Pooling

Reference from: https://jishuin.proginn.com/p/763bfbd5c125
Inspired by dropout, the conventional deterministic pooling operation is replaced by a random process, and the maximum pooling and average pooling methods are randomly used during model training , and helps to prevent network overfitting to a certain extent. Hybrid pooling changes the rules of pool regulation in a random way, which will partly solve the problems encountered by max pooling and average pooling.

Hybrid pooling is superior to traditional max pooling and average pooling methods, and can solve the overfitting problem to improve classification accuracy. In addition, the calculation overhead required by this method is negligible, without any hyperparameter adjustment, and can be widely used in CNN.

2.6 Median pooling layer

Refer to the implementation of median filtering in image processing, which is basically not seen in the CNN network. Using the sorted median as the output value, its forward pass and back pass are similar to max pooling.

博主推断其看是能提取图像纹理特征,但很难具备稳定性(中值在网络深层容易受感染),只可用在网络浅层中。在实际使用过程中,不支持将学习率设置的较大,且极难优化。

Compared with the maximum pooling, although the maximum value is not very stable, the numerical value is used to describe the influence of the feature on the result during the whole training process, so it is still stable in the whole system.

2.7 Combined Pooling Layers

​ Combination pooling is a pooling strategy derived from the advantages of both maximum pooling and average pooling. way, mixed pooling is undefined using two pooling ways). There are two common combination strategies: Cat and Add. Its code description is as follows:

def add_avgmax_pool2d(x, output_size=1):
    x_avg = F.adaptive_avg_pool2d(x, output_size)
    x_max = F.adaptive_max_pool2d(x, output_size)
    return 0.5 * (x_avg + x_max)

def cat_avgmax_pool2d(x, output_size=1):
    x_avg = F.adaptive_avg_pool2d(x, output_size)
    x_max = F.adaptive_max_pool2d(x, output_size)
    return torch.cat([x_avg, x_max], 1)

3. Summary

Since Transformer was introduced into the visual model in 2021, the convolutional neural network is basically at an end. Although ConvNeXt and SegNeXt have successfully challenged Transformer's position since then, they are unable to restore the general trend of convolutional neural networks.

ConvNeXt uses a series of training techniques (data enhancement techniques such as AdamW optimizer, Mixup, Cutmix, RandAugment, Random Erasing, etc.) and regularization schemes such as random depth and label smoothing, but it barely improves the performance of the ResNet-50 model from 76.1 % increased to 78.8%. However, Transformer does not need to use such a complex enhancement strategy, it only needs to continuously expand the training set.整个模型没有池化层做特征降维

SegNeXt claims to use the convolutional model to surpass the Transformer model in semantic segmentation. In fact, it uses the patch embedding of the Transformer model and replaces the multi head self-attention with the MLP layer. It has little to do with the convolutional neural network.整个模型没有池化层

In these advanced model structures, pooling layers are no longer used to extract image features (feature dimensionality reduction), and the general results of convolutional neural networks may be redefined in the future.

Guess you like

Origin blog.csdn.net/a486259/article/details/131311587