[Paper reading notes] Numerical Coordinate Regression with Convolutional Neural Networks

Paper address: Numerical Coordinate Regression with Convolutional Neural Networks
Code address: https://github.com/anibali/dsntnn

Paper summary

  This article provides a way to learn coordinates directly from images. The current mainstream methods are based on the heatmap processed by the Gaussian kernel as supervision, but the heatmap learned by this method has quantization errors in the process of post-processing to obtain the coordinates (such as the heatmap with 4 times downsampling, the expected quantization error is 2).

  This paper proposes a new processing method, called DSNT, through DSNT processing ( without adding additional parameters ) to directly supervise the coordinates. DSNT processes the heatmap, and the idea is shown in the figure below. The final process is to pass the heatmap through the softmax to obtain a probability distribution based on the heatmap, and then through this probability distribution, multiply the preset X and Y (coordinate axes) to obtain the expected value of the coordinates. Supervision loss is also based on this expected value.

  Although the idea in the article mainly refers to the regression of coordinates directly, in actual application, the heatmap is still constrained, and the weight is not too small. From another perspective, the actual operation of this article can also be considered as supervising the heatmap, and then adding a regularization factor of the coordinates. The supervision of this regularization item can effectively reduce the quantization loss of the heatmap converted into coordinates, and some of the loss errors caused by the direct regression of the heatmap are not consistent with expectations. However, the loss of this heatmap item is also carefully selected, even without adding the heatmap loss item, the results are better than many heatmap loss calculation methods.

  However, DSNT cannot directly solve the key points that do not exist in the image (such as the bust), and the problems of multiple people. For certain scenarios, this is an inevitable problem.

Introduction

  Before that, there are two ways to get coordinates through pictures: (1) Generate coordinates based on heatmap; (2) Get coordinates based on fully connected layers (Yolov1 and the like); the first method, in the process of generating coordinates in heatmap, The processing is not perfect in two aspects: (1) The processing process such as argmax used is not differentiable and cannot be directly learned; (2) There is a quantization error in the process of heatmap to coordinates. The larger the downsampling multiple of the heatmap and the input resolution, the larger the quantization error. What's more noteworthy is that supervision is built on the heatmap, which will cause the loss function to be separated from our metric (on the coordinates). In inference, we only use one (certain) of the pixels for numerical coordinate calculation, but during training, all pixels will be lost.
  The second method cannot solve the problem of spatial generalization. Spatial generalization refers to the independence of image location and category recognition. Generally speaking, the convolution operation is spatially invariant because it is a weight-sharing operation. The addition of a fully connected layer means that some areas are not involved in weight sharing. This means that using a fully connected layer model has stricter requirements on the distribution of data sets.
  The following table shows the advantages and disadvantages of heatmap, fully connection, and DSNT to obtain coordinates. It can be seen from the table that heatmap is not fully differential and does not perform well at low resolution; fully connected, does not have the ability of spatial generalization, and is easy to overfit; while DSNT has all the advantages.

  Personal opinion: The reason why DSNT can directly obtain coordinates and has the ability of spatial generalization at the same time lies in two points: (1) It has supervised the heatmap, and the object of supervision is Gaussian distribution, which is symmetrical; (2) Its right The coordinate axis objects X and Y are carefully designed, and they are 1 ∗ n 1*n1n andn ∗ 1 n*1nThe unidirectionality of 1 makes it symmetrical on two coordinate axes.

  The figure below is the difference between using heatmap supervision and the heatmap learned by using DSNT to supervise the coordinates (it is said that there is no supervision on the heatmap, but in fact it is not, the regularization item is added), and the result of DSNT is more concentrated.

  Before the heatmap is input to DSNT, it is normalized to become a probability distribution map. Normalized means non-negative, and the sum is 1. There are four kinds of normalization attempts in the following table, and finally softmax is selected as the function of rectification normalization.

  The figure below shows that if you only monitor the heatmap, you may not get the point you want. MSE with less loss is not necessarily more accurate. Therefore, it is inaccurate to only perform loss monitoring on the heatmap.

  The supervision loss in the article is directly performed on the coordinates, as indicated by the following formula.

  For the DSNT layer, like the point multiplication of Z and X, Y above to get the final coordinates, it can be known that there are many different heatmaps that will cause the same coordinates to be generated by DSNT (for example, the heatmap is due to the diffusion and contraction of the heatmap due to the size of the convolution kernel. The coordinates have no effect). Although this freedom is considered beneficial, the potential disadvantage is that the model does not receive strictly supervised pixel gradients through the heatmap during training. In experiments, it was found that providing such supervision through regularization can produce significant performance improvements over ordinary DSNT. The regularization term of heatmap is added to the loss function, and the loss function becomes the following formula:

  The variance regularization is represented by the following formula, which controls the variance of the target.


  The distribution regularization is shown in the following formula, which adds strict regularization to the shape of the heatmap to directly encourage a certain shape. Where D (⋅ ∣ ∣ ⋅) D(\cdot | | \cdot)D ( ) is a measure of dispersion.

  In the ResNet-34 model, applied on 28 pixels, under the MPII data set, the performance of various regular items is shown in the following table:

  It can be seen that the JS distribution regularization has the highest accuracy. The following figure shows how a sample heatmap image drastically changes the appearance of the heatmap under different regularization options. The distribution regularization represented by KL divergence and JS divergence promotes the generation of Gaussian-shaped spots very effectively.

  After selecting the regularization term as JS distribution regularization, the values ​​of other hyperparameter factors are obtained through experiments: finally choose σ = 1, λ = 1 \sigma=1, \lambda=1σ=1 λ=1

Experimental results

  The pckh values ​​of heatmaps with different resolutions and coordinates generated by different strategies are shown in the following table: It can be seen that DSNT performs well.

  Under different heatmap resolutions, the experimental results of DSNT with regularization terms are as follows: It can be seen that the downsampling loss of 16 times is a bit large. Higher heatmap resolution is beneficial at any depth, but the cost is that increased resolution has a great impact on memory consumption and computational cost.

  The experimental results of different output strategies on different stacked hourglass are as follows:

  Compare hourglass + heatmap with ResNet + DSNTr. ResNet+28 pixels (8 times downsampling) is a trade-off choice.

  Comparison of accuracy, reasoning time and memory usage of different methods:

Guess you like

Origin blog.csdn.net/qq_19784349/article/details/110795196