Human key point detection and MMPose
introduce
Human Pose Estimation (Human Pose Estimation) is an important research direction in the field of computer vision, and it is also an essential step for computers to understand human actions and behaviors. Human Pose Estimation refers to locating key points of the human body in images or videos through computer algorithms. At present, it is widely used in motion detection, virtual reality, human-computer interaction, video surveillance and many other fields. This course covers the introduction and application of human pose estimation, 2D pose estimation, 3D pose estimation, DensePose, Body Mesh and MMPose, etc.
What is Human Pose Estimation
Identify key points such as faces, hands, and bodies from a given image
3D pose estimation: predict the coordinates of key points of the human body in three-dimensional space, and restore the pose of the human body in three-dimensional space
Human parametric model: 3D human body model that will give motion from images or videos
PoseC3D: Action recognition based on human gestures
2D Pose Estimation
2D Human Pose Estimation: Locating the Coordinates of Human Key Points on Images
Basic idea: based on regression
Model the key point detection problem as a regression problem, let the model directly return the coordinates of the key points
It is difficult for the depth model to directly return the coordinates, and the accuracy is not optimal
Basic idea: based on heat map
Does not directly regress the coordinates of the keypoints, but predicts the probability that the keypoints are at each location
The heat map can be generated based on the original key point coordinates as the supervisory information for training the network
The heat map predicted by the network can also obtain the coordinates of key points by methods such as seeking the maximum value
It is easier for the model to predict the heat map than the direct regression coordinates, and the model accuracy is relatively higher, so the mainstream algorithm is more based on the heat map, but the calculation consumption of predicting the heat map is greater than that of the direct regression
Multi-Person Pose Estimation: A Top-Down Approach
Overall accuracy is limited by the accuracy of the detector
Speed and computation will be proportional to the number of people
Some new work considers aggregating the two stages into one
Multi-Person Pose Estimation: A Bottom-Up Approach
Advantages: Reasoning speed has nothing to do with the number of people
Multi-Person Pose Estimation: A Single-Stage Approach
Human body detection and pose estimation in one step
Regression-Based Top-Down Approach
DeepPose
Based on the classification network, the last layer of classification is changed to regression, and the coordinates of all key points are predicted at one time
Train the network by minimizing the squared error
Accuracy through cascading
Advantage:
The regression model can theoretically achieve infinite accuracy, and the accuracy of the heat map method is limited by the spatial resolution of the feature map
The regression model does not need to maintain high-resolution feature maps, and the calculation level is more efficient. In contrast, the heat map method needs to calculate and store high-resolution heat maps and feature maps, and the calculation cost is higher.
Disadvantages:
The mapping from images to key point coordinates is highly nonlinear, which makes regression coordinates more difficult than regression heat maps, and the accuracy of regression methods is also weaker than heat map methods. Therefore, for a long time after DeepPose was proposed, 2D key point prediction algorithms Mainly based on heatmaps
RLE(ResidualLog-likelihood Estimation)
Core idea: more accurate probability modeling for the position of key points, thereby improving the accuracy of position prediction
classical regression paradigm
The model predicts the key point coordinates and the true value calculation error as a loss, which implies the assumption of a Gaussian distribution, but it does not necessarily conform to the actual distribution of the data
The paradigm of RLE
shows the probability distribution of modeling key points, and fits the optimal position distribution by maximum likelihood
The overall design of RLE
The goal of RLE is to model the probability distribution of keypoint positions, that is, given an image, the position distribution of each keypoint is given
This distribution can be constructed based on the normalized flow, but the RLE algorithm also introduces two tricks to reduce the difficulty of modeling you and the real distribution
- Reparameterization
- residual likelihood function
Top-down method based on heat map
Hourglass module
The performance of the Hourglass model on the FLIC and MPII datasets was leading at the time, and each item on the MPII reached SOTA at the time.
Simple Baseline(2018)
Strive for a simple structure, using ResNet with deconvolution to form an encoder-decoder structure
HRNet
Core idea: Maintain the high resolution and spatial position information of the feature map in the whole process of the network by retaining the resolution branch during downsampling, and design a unique network structure to achieve multi-scale feature fusion at different resolutions
bottom-up approach
Part Affinity Filelds & OpenPose
Basic idea: Predict the joint position and limb direction based on the image colleagues, and use the limb direction to assist the clustering of key points: that is, if two key points are connected by a certain limb, these two key points belong to the same person
single stage method
SPM
SPM proposes a single-stage solution for human pose estimation for the first time, which not only achieves a speed advantage, but also achieves a detection rate that is not inferior to the two-stage method. And the method can be directly extended from 2D images to 3D images for human pose estimation
Structured Pose Representation(SPR)
In order to unify the position information of human body instances and body joints, and provide a single-stage solution for multi-person pose estimation, SPR introduces an auxiliary joint, that is, a joint to represent the position of a human instance, which is a unique identification joint
However, SPR has obvious disadvantages, such as large pose changes that may involve long-range displacements between body joints and root joints, which makes it difficult to estimate displacements mapped from image representations to vector domains.
Therefore, Hierarchical SPR is proposed on the basis of SPR, and the root joint and body joint are divided into four levels according to the degree of freedom and the degree of deformation.
Level 1: root joint
Level 2: including neck, shoulder and arm
Level 3: including head, elbow and
knee Level 4: including wrist and ankle
Transformer-based method
PRTR
Human body pose estimation and object detection have certain similarities, both involve the positioning of image content.
In DETR, the query gradually focuses on specific objects through attention and wit.
Pose estimation can imitate DETR: let the query gradually focus on specific human key points
Two-stage algorithm
Human detection stage: Use DETR to detect different people in the picture
Key point detection stage: Also use the DETR structure, the difference is that the query learns key point information, and finally returns to the key point position
TokenPose
Sending the visual token and the key point token into the encoder can simultaneously learn the visual appearance and the constraint relationship between the key points from the image
The classification model ViT also uses a similar method to use a classification token and a visual token together for self-attention
3D Human Pose Estimation
Predict the coordinates of key points of the human body in three-dimensional space through a given image, and restore the posture of the human body in three-dimensional space