[Human Pose Estimation] (1) Principle Introduction

[Human Pose Estimation] (1) Principle Introduction

1. Background

Human pose estimation is essentially a key point detection project;

Key point detection is widely used in life, including face recognition, gesture recognition, and human body pose estimation is to detect key points of the body;

This article will introduce some common data sets, evaluation indicators and more classic algorithms;

2. Datasets and Evaluation Metrics

First, the evaluation indicators of key points refer to the following articles:

COCO Dataset Evaluation Index - Keypoints - Short Book (jianshu.com)

COCO official evaluation index

Its essence is to evaluate the index through the Euclidean distance relationship between points;

Among them, one parameter should be paid attention to:

v = 0 : 未标注点
v = 1 : 标注了但是图像中不可见(例如遮挡)
v = 2 : 标注了并图像可见

The most common human pose data sets are MPII and COCO data sets, where the COCO data set is shown in the following figure:

In the annotation file, you need to focus on the categories part;

Among them, keypoints represent types, and skeleton represents the rules of connection;

3. Top Down Algorithm

Introduction: A top-down method, the essence is to find people first and then find points;

The most classic network is Mask RCNN, and its steps are to find people first - instance segmentation - key point detection;

Let's take a look at the difference between Mask RCNN and the traditional two-stage detection algorithm:

As can be seen from the above figure, compared with the traditional two-stage detection network, the mask branch module is added;

The details are shown in the figure below:

Perform dimensionality reduction feature extraction on the detected image ROI, and the number of output channels is the number of key points to be detected;

Take out the image of each channel separately, it can be seen that it is a Heatmap, also known as a heat map;

4. Bottom Up Algorithm

Introduction: A bottom-up method, first find points and then generalize;

The most classic algorithm is OpenPose;

Source code: https://github.com/CMU-Perceptual-Computing-Lab/openpose

论文:[1812.08008] OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields (arxiv.org)

principle:

It is mainly divided into two steps, Parts Detection (point prediction) and Parts Association (point connection)

Among them, the feature map is first obtained through the feature extraction network, and the heat map of each key point is obtained by conversion;

The second key is to calculate the correlation degree of two points through PAF;

The figure above shows how to calculate the score between two points. Firstly, the correlation between the two points is judged by whether the point is within the range of the real connection line, and then the score is calculated;

If you want to know the details of the algorithm, you can see the subsequent code explanation part;

5. Frontier Algorithms

1、MSPN

Paper address: https://arxiv.org/pdf/1901.00148.pdf

The above figure shows the main structural modules;

From a structural point of view, the structure of a single stage is optimized, and two processes from downsampling to upsampling are adopted (similar to U-Net);

Specifically look at the key structure:

In the process of downsampling and upsampling, a large amount of feature information will be lost, so the feature aggregation of the adjacent stages shown in the above figure is used to enhance the propagation of feature information and reduce the difficulty of training; for a downsampling process, the input includes three Part: the downsampled features of the same size in the previous stage after 1*1 convolution coding, the upsampling features of the same size in the previous stage after 1*1 convolution coding, and the downsampling features of the current stage ;

The following is an optimization strategy for each stage output:

A coarse-to-fine multi-branch supervision method is used to optimize the ability of the stage. As shown in the figure above, for the characteristics of each stage, different kernel-size Gaussian kernels are used to make labels;

2、HRNet

Reference: Refresh three COCO records! The attitude estimation model HRNet is open source, produced by University of Science and Technology of China | CVPR (qq.com)

Introduction: The abbreviation of High-Resolution Net (High-Resolution Net), which can maintain high-resolution representation during the whole process of representation learning, so a parallel structure is designed for the model, and sub-networks with different resolutions are used. way together;

The picture above shows some existing methods:

  • (a) Symmetrical structure, first down-sampling, then up-sampling, while using layer-skip connections to restore the information lost in down-sampling;
  • (b) cascading pyramids;
  • (c) Downsample first, transpose convolution upsample, and do not use layer-skip connections for data fusion;
  • (d) Expanded convolution, reducing the number of downsampling, and not using layer-skip connections for data fusion;

The above picture shows the structure of HRNet, which has two main features, parallel connection of high-resolution subnets and repeated multi-scale fusion;

Compared with the traditional down-sampling feature extraction, the network uses up-sampling and down-sampling, and fuses features of different shapes during the feature extraction process;

Additional information

For key point open source code and papers, you can refer to this address: Keypoint Detection | Papers With Code

Summary of several IOUs: One article to understand various IoU loss functions in target detection - Zhihu (zhihu.com)

おすすめ

転載: blog.csdn.net/weixin_40620310/article/details/130367486