[OpenMMLab AI Combat Camp Second Notes] Human body key point detection and MMPose

Human key point detection and MMPose

introduce

Human Pose Estimation (Human Pose Estimation) is an important research direction in the field of computer vision, and it is also an essential step for computers to understand human actions and behaviors. Human Pose Estimation refers to locating key points of the human body in images or videos through computer algorithms. At present, it is widely used in motion detection, virtual reality, human-computer interaction, video surveillance and many other fields. This course covers the introduction and application of human pose estimation, 2D pose estimation, 3D pose estimation, DensePose, Body Mesh and MMPose, etc.

What is Human Pose Estimation

Identify key points such as faces, hands, and bodies from a given image

3D pose estimation: predict the coordinates of key points of the human body in three-dimensional space, and restore the pose of the human body in three-dimensional space

Human parametric model: 3D human body model that will give motion from images or videos

PoseC3D: Action recognition based on human gestures

2D Pose Estimation

2D Human Pose Estimation: Locating the Coordinates of Human Key Points on Images

Basic idea: based on regression

Model the key point detection problem as a regression problem, let the model directly return the coordinates of the key points

It is difficult for the depth model to directly return the coordinates, and the accuracy is not optimal

Basic idea: based on heat map

Does not directly regress the coordinates of the keypoints, but predicts the probability that the keypoints are at each location

The heat map can be generated based on the original key point coordinates as the supervisory information for training the network

The heat map predicted by the network can also obtain the coordinates of key points by methods such as seeking the maximum value

It is easier for the model to predict the heat map than the direct regression coordinates, and the model accuracy is relatively higher, so the mainstream algorithm is more based on the heat map, but the calculation consumption of predicting the heat map is greater than that of the direct regression

Multi-Person Pose Estimation: A Top-Down Approach

Overall accuracy is limited by the accuracy of the detector
Speed ​​and computation will be proportional to the number of people
Some new work considers aggregating the two stages into one

Multi-Person Pose Estimation: A Bottom-Up Approach

Advantages: Reasoning speed has nothing to do with the number of people

Multi-Person Pose Estimation: A Single-Stage Approach

Human body detection and pose estimation in one step

Regression-Based Top-Down Approach

DeepPose

Based on the classification network, the last layer of classification is changed to regression, and the coordinates of all key points are predicted at one time

Train the network by minimizing the squared error

Accuracy through cascading

Advantage:

The regression model can theoretically achieve infinite accuracy, and the accuracy of the heat map method is limited by the spatial resolution of the feature map

The regression model does not need to maintain high-resolution feature maps, and the calculation level is more efficient. In contrast, the heat map method needs to calculate and store high-resolution heat maps and feature maps, and the calculation cost is higher.

Disadvantages:
The mapping from images to key point coordinates is highly nonlinear, which makes regression coordinates more difficult than regression heat maps, and the accuracy of regression methods is also weaker than heat map methods. Therefore, for a long time after DeepPose was proposed, 2D key point prediction algorithms Mainly based on heatmaps

RLE(ResidualLog-likelihood Estimation)

Core idea: more accurate probability modeling for the position of key points, thereby improving the accuracy of position prediction

classical regression paradigm

The model predicts the key point coordinates and the true value calculation error as a loss, which implies the assumption of a Gaussian distribution, but it does not necessarily conform to the actual distribution of the data

The paradigm of RLE
shows the probability distribution of modeling key points, and fits the optimal position distribution by maximum likelihood

The overall design of RLE

The goal of RLE is to model the probability distribution of keypoint positions, that is, given an image, the position distribution of each keypoint is given

This distribution can be constructed based on the normalized flow, but the RLE algorithm also introduces two tricks to reduce the difficulty of modeling you and the real distribution

  1. Reparameterization
  2. residual likelihood function

Top-down method based on heat map

Hourglass module

The performance of the Hourglass model on the FLIC and MPII datasets was leading at the time, and each item on the MPII reached SOTA at the time.

insert image description here

Simple Baseline(2018)

Strive for a simple structure, using ResNet with deconvolution to form an encoder-decoder structure

HRNet

Core idea: Maintain the high resolution and spatial position information of the feature map in the whole process of the network by retaining the resolution branch during downsampling, and design a unique network structure to achieve multi-scale feature fusion at different resolutions

bottom-up approach

Part Affinity Filelds & OpenPose

Basic idea: Predict the joint position and limb direction based on the image colleagues, and use the limb direction to assist the clustering of key points: that is, if two key points are connected by a certain limb, these two key points belong to the same person

single stage method

SPM

SPM proposes a single-stage solution for human pose estimation for the first time, which not only achieves a speed advantage, but also achieves a detection rate that is not inferior to the two-stage method. And the method can be directly extended from 2D images to 3D images for human pose estimation

Structured Pose Representation(SPR)

In order to unify the position information of human body instances and body joints, and provide a single-stage solution for multi-person pose estimation, SPR introduces an auxiliary joint, that is, a joint to represent the position of a human instance, which is a unique identification joint

However, SPR has obvious disadvantages, such as large pose changes that may involve long-range displacements between body joints and root joints, which makes it difficult to estimate displacements mapped from image representations to vector domains.

Therefore, Hierarchical SPR is proposed on the basis of SPR, and the root joint and body joint are divided into four levels according to the degree of freedom and the degree of deformation.

Level 1: root joint
Level 2: including neck, shoulder and arm
Level 3: including head, elbow and
knee Level 4: including wrist and ankle

Transformer-based method

PRTR

Human body pose estimation and object detection have certain similarities, both involve the positioning of image content.
In DETR, the query gradually focuses on specific objects through attention and wit.
Pose estimation can imitate DETR: let the query gradually focus on specific human key points

Two-stage algorithm
Human detection stage: Use DETR to detect different people in the picture
Key point detection stage: Also use the DETR structure, the difference is that the query learns key point information, and finally returns to the key point position

TokenPose

Sending the visual token and the key point token into the encoder can simultaneously learn the visual appearance and the constraint relationship between the key points from the image

The classification model ViT also uses a similar method to use a classification token and a visual token together for self-attention

3D Human Pose Estimation

Predict the coordinates of key points of the human body in three-dimensional space through a given image, and restore the posture of the human body in three-dimensional space

Summarize

insert image description here

Guess you like

Origin blog.csdn.net/yichao_ding/article/details/131003994