SCNet for 2D key point detection: Improving Convolutional Networks with Self-Calibrated Convolutions

image.png

Paper link: Improving Convolutional Networks With Self-Calibrated Convolutions
Time: 2020 CVPR2020
Author team: Jiang-Jiang Liu, Qibin Hou, Ming-Ming Cheng, Changhu Wang, Jiashi Feng
Category: Computer Vision – Human Key Point Detection – 2D topdown_heatmap

Table of contents:

1.SCNet background
2.SCNet gesture recognition
3.SCNet network architecture diagram
4. Quote

1. Mainly for learning records. If there is any infringement, please send me a private message to make corrections.
2. The level is limited. Thank you for pointing out any deficiencies.


1.SCNet background

  Most improvements in convolutional neural networks focus primarily on adjusting the architecture of the network model to produce rich finite element analyses.
  Advances in CNNs have mainly focused on designing more complex structures, thus enhancing their ability to represent learning.
  This paper does not focus on the architecture, but only improves the performance of the entire network by improving the basic convolution module. The proposed self-calibration convolution SC can locate target objects more completely and accurately.
  The figure below is a feature activation map generated by visualizing resnet with different convolution methods. Resnet with self-calibration convolution can locate the target object more accurately.
image.png


2.SCNet gesture recognition

  This paper proposes a self-calibration module composed of multiple convolutional attentions to replace the basic convolution structure. This module can generate a global receptive field without adding additional parameters and calculations. Compared to standard convolution, the feature maps produced by this module are more discriminative.
 The advantages of this module are:
  1. Traditional convolution can only perform convolution operations on small areas, while the self-calibrating convolution module enables each spatial position to adaptively encode relevant information from long-range areas.
  2. Self-calibrated convolution is universally applicable and can be easily applied to standard convolutional layers without introducing any parameters and complex headers or changing hyperparameters.

  1. Network structure part :
      traditional convolution :
     there is input x, convolution kernel k, output z, then the formula of traditional convolution operation:
    yi = ki ∗ X = ∑ j = 1 kij ∗ xj , \begin{aligned} \\ \text {y}_i=\text{k}_i*\textbf{X} =\sum\limits_{j=1}\mathbf{k}_i^j*\mathbf{x}_j, \end{aligned}yi=kiX=j=1kijxj,
     Existing problems; the extracted feature maps are not very distinguishable
      1. Each output feature map is calculated by summing all channels, and all feature maps are obtained by repeating the same formula.
      2. The receptive field of each spatial location is mainly controlled by the predefined convolution kernel size.
      SC self-calibration convolution :
    image.png
      The above is the architecture diagram of the SC module. X is the input feature map, Y is the output feature map, F is the convolution layer of different kernel sizes, and K is the size of the corresponding kernel.
      In order to effectively obtain useful global information for each spatial position, the paper proposes to perform convolution feature transformation in two different spaces: one
      is the initial scale space, the feature map has the same resolution as the input data; the other is the downsampled Small hidden space. Because the small latent space has a larger receptive field after conversion, it can be used as reference information to guide the feature transformation process of the initial feature space.
      1. Divide the input feature map X into two parts X1 and X2.
      2. X1 obtains Y1 through a self-calibration operation, X2 obtains Y2 through a simple convolution operation, and finally the output feature map Y is obtained by splicing Y1 and Y2.
    The specific operation
     process for Y1: given input X, use the filter size to be rxr and the step size to perform average pooling. The formula is as follows: T 1 =
    A vg P oolr ( \mathrm{AvgPool}_r(\mathbf{X}_1).T1=AvgPoolr(X1) .
     Use convolution kernel groupK 2 {K}_2K2Perform feature transformation : \mathcal{F}}_{2}(\mathbf{T}_{1}))=\operatorname{Up}(\mathbf{T}_{1}*\mathbf{K}_{2})X1=Up(F2(T1))=Up(T1K2)
     whereUp ( ⋅ ) \text{Up}(\cdot)Up ( ) represents a linear interpolation operation to obtain the mapping of the intermediate reference quantity from the small scale space to the original feature space. The self-calibration operation can be expressed as:Y 1 ′ = F 3 ( X 1 ) ⋅ σ ( X 1 + X 1 ′ ) \mathbf{Y}_1'=\mathcal{F}_3(\mathbf{X}_1)\cdot\sigma(\mathbf{X}_1+\mathbf{X}_1')Y1=F3(X1)s ( X1+X1)
     whereF ⁡ 3 ( X 1 ) = X 1 ∗ K 3 \operatorname{F}_3(\mathrm X_1)=\mathrm X_1\ast\mathrm K_3F3(X1)=X1K3 σ \sigma σ represents the sigmoid function, the symbol "." represents the element-wise multiplication operation, and X' is used as the residual term to establish weights for self-calibration. The final output after self-calibration can be written as: Y 1 = F 4 ( Y 1 ′ ) = Y 1 ′ ∗ K 4 \mathbf{Y}_1=\mathcal{F}_4(\mathbf{Y}_1')=\ mathbf{Y}_1'*\mathbf{K}_4Y1=F4(Y1)=Y1K4

  2. Summary
      Compared to traditional convolutions, by employing a calibration operation, each spatial location is allowed to adaptively not only view its surrounding information environment as an embedding from the latent space, as a scalar in the response from the original scale space, but also Model inter-channel dependencies. The field of view of convolutional layers with self-calibration can be effectively enlarged. As shown, convolutional layers with self-calibration encode larger but more accurate discriminative regions.
    image.png
      The self-calibration operation does not collect global context, but only considers the context around each spatial location, to some extent avoiding certain contaminating information from irrelevant areas. As can be seen from the figure, convolution with self-calibration function can accurately locate the target object when visualizing the final score layer.
    image.png

  3. The results evaluated
      the comparison between SC and ordinary convolution.
     After adding the SC module during the training process, the loss dropped faster and the error rate was lower.
    image.png
      ImageNet-1K data set evaluation results
    image.png
      Ablation experiments designed by SCNet
    image.png
      Comparison of using attention mechanism methods on ImageNet Comparison
    image.png
      of target detection using optimal methods on coco minival data set Comparison of
    image.png
      key point detection on COCO val2017 data set
    image.png


3.SCNet network architecture diagram

Insert image description here


4. Quote

Quote 1
Quote 2
Quote 3

Guess you like

Origin blog.csdn.net/qq_54793880/article/details/131489127