HRNet for 2D key point detection: Deep High-Resolution Representation Learning for Human Pose Estimation

image.png

Paper link: Deep High-Resolution Representation Learning for Human Pose Estimation
Time: 2019.02 CVPR2019
Author Team: Ke Sun Bin Xiao Dong Liu Jingdong Wang
Category: Computer Vision – Human Key Point Detection – 2D top-down
###Contents:

1.HRNet background
2.HRNet gesture recognition
3.HRNet network architecture diagram
4. Quote

1. Mainly for learning records. If there is any infringement, please send me a private message to make corrections.
2. The level is limited. Thank you for pointing out any deficiencies.


1.HRNet background

  For the Human Pose Estimation task, there are two main methods based on deep learning. HRNet uses a heatmap-based method:

  1. Based on the regressing method, the position coordinates of each key point are directly predicted.
  2. Based on the heatmap method, a heat map is predicted for each key point and the score of each position is predicted.

  This article was jointly published by the University of Science and Technology of China and Microsoft Research Asia in 2019 and was proposed for the 2D human pose estimation task.
  Generally speaking, existing networks concatenate high-resolution to low-resolution convolutions (ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. In Hourglass and cascaded pyramid networks, layers with the same resolution in the high-to-low and low-to-high processes are jump-connected in order to fuse low-level and high-level features. In cascaded pyramid networks, low-level and high-level features are fused through convolution operations.
  HRnet maintains high-resolution representation throughout the process. Its main features are: (i) parallel connection of convolution streams from high resolution to low resolution; (ii) repeated exchange of information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise.


2.HRNet gesture recognition
  1. Network Structure Part
      This paper proposes a new architecture, the High-Resolution Network (HRNet), which is able to maintain high-resolution representation throughout the entire process. The first stage builds a high-resolution subnetwork, and subsequent stages gradually add high-to-low resolution subnetworks and connect multi-resolution subnetworks in parallel. Multi-scale fusion is guided by exchanging information throughout parallel multi-resolution sub-networks, and the process is repeated.
    (1) Connect high-to-low resolution subnetworks in parallel rather than in series.
    (2) Repeated multi-scale fusion, using low-resolution representations at the same depth and similar levels to improve high resolution.
    Insert image description here

  The core idea of ​​this paper is to continuously fuse information on different scales. HRNet first downsamples a total of 4 times through two convolutional layers with a convolution kernel size of 3x3 and a stride of 2. Then adjust the number of channels through the Layer1 module without changing the size of the feature layer.
  Then through a series of Transition structures and Stage structures, a new scale branch will be added every time a Transition structure is passed.
Insert image description here

  When passing through the Stage, for each scale branch, n Basic Blocks are first passed, and then the information on different scales is fused. The output on each scale branch is obtained by fusion of the output on all branches.
image.png

  The last Exchange Block in Stage 4 only outputs the output of the 4-fold downsampling branch, and is then connected to a convolution layer with a convolution kernel size of 1x1xn (n is the number of key points). The heatmap of the final feature layer (64x48xn). The position coordinate point can be found by looking for the maximum score based on the heatmap, and then comparing the scores on the left, right, upper and lower sides of the point respectively, and the final predicted coordinates are offset by 0.25 towards the largest side.
  
  The process of up and down sampling in stage2:
image.png

  1. Results are evaluated
      using a two-stage top-down paradigm: using a person detector to detect person instances and then predicting detection key points.
      The result graph on the COCO test set:
    image.png

  MPII Human Pose Estimation
image.png


3.HRNet network architecture diagram

crowdpose.onnx.png


4. Quote

Quote 1
Quote 2

Guess you like

Origin blog.csdn.net/qq_54793880/article/details/131116732