【Paper reading】 AdaptivePose: Human Parts as Adaptive Points

DOI:https://doi.org/10.1609/aaai.v36i3.20185

AAAI 2022          Published:2022-06-28

Others Reading/Organization: Translation 1 , Translation 2

Intro&Background

Multi-person pose estimation method

two-stage methods【图a】

These methods use absolute keypoint locations and locate independent points, so additional steps are required to model the relationship between them.

1. Top-down Methods : First crop+resize to locate the person area and then locate its key points respectively/HRNet

mainly focus on the design of the network to extract better feature representation.

Disadvantages: ①Performance is closely related to box; ②The first detection method has high memory and low efficiency

2. Bottom-up Methods : First use different scales to locate the key points of everyone, and then group them to the corresponding people.

mainly focus on the effective grouping process.

Disadvantages: Although it is fast, the group post-processing process is complicated and requires skills

3. Point-based Representation

CenterNet:centerpoint+center2joint offsets【图b】

Due to various posture changes and the center having a fixed receptive field, it is difficult to handle long-distance center2joint offsets and the performance is limited.

SPM : Use rootjoint to represent instances, and divide root joints and key points into four levels according to joint kinematics. [Picture c]

Long-distance offsets are decomposed into short-distance offsets and accumulated, but there is also error accumulation along the skeleton.

Method of this articleBody Representation

1. Basic idea of ​​the method : center -> 7parts -> joints method

White is the center of each individual; divide the human body into 7 adaptive parts (the other 7 points in a); then set keypoints for each part

The novel representation starts from instance-wise (body center) to part-wise (adaptive human-part related points), then to joint-wise (body keypoints) to form human pose.

2. Overall Architecture: The single-stage network includes three modules + one auxiliary 

AdaptivePose (d) End-to-end differentiable network body2part2joint (center2adaptivepoints2point)

Composition: three branches + auxiliary parallel branch

(1) Part Perception Module : Part Perception Module-get 7parts

These adaptive points act as intermediate nodes, which are used for subsequent predictions.

(2) Enhanced Center-aware Branch : Enhanced center-aware branch - get center

aggregate the features of seven adaptive human-part related points for precise center estimation.

(3) Two-hop Regression Branch : two-hop regression branch - displacements: center2part, part2joint

Predict displacements instead of directly regressing center2jiont offsets

On the basis of (1)7parts, change long distance center2joint into center2part2joint offsets

(4) parallel branch (train): 17keypoints for gt

 LOSStotal(6) =LOSSct(3) + LOSSkp(5) +LOSShm(3)

experiment

parameter settings:

Dataset:COCO2017

Metric: average precision and average recall based on OKS (Object Keypoint Similarity)

Augmentation: random flip, random rotation, random scaling and color jitter

Preprocessing: Conditional cropping of each input, 512/640-DLA34, 800-HRNetW48

Adam optimizer with a mini-batch size of 64 (8 per GPU)

SOTA comparison

Ablation Experiment Analysis

Part Perception Module(定位7个位置): conduct the experiments that using shared adaptive points and unshared adaptive points

Enhanced Center-aware Branch(定位center): conduct the controlled experiments to explore the effect of receptive field adaptation process

Two-hop Regression Branch(offsets):

conduct the controlled experiments that is capable of factorizing long-range center-to-joint offsets and avoiding the accumulated errors

Auxiliary loss (help training) [Experiment 4/5]

the keypoint heatmap can retain more structural geometric information to improve regression performance.

Heatmap Refifinement for our regression result.

snap the closest confidence peaks on the keypoint heatmap to refine the regressed predictions

Conclusion: the heatmap refinement is negligible for our two-hop regression method (heatmap refinement is negligible, the results are shown in the figure below)

scraps

Abstract (machine translated)

Multi-person pose estimation methods usually follow top-down and bottom-up paradigms, both of which can be considered as two-stage methods, resulting in high computational cost and low efficiency. For a compact and efficient pipeline for multi-person pose estimation tasks, in this paper, we propose to represent human parts as points and propose a novel body representation that utilizes an adaptive point set including the human body center and seven human body part related points . to represent human instances in a more fine-grained way. This new representation is more capable of capturing various pose deformations and adaptively factorizes long-range center-joint displacements , thus providing a single-stage differentiable network to more accurately regress multi-person poses , called adaptive poses. For inference, our proposed network eliminates grouping and refinement and requires only a single-step separation process to form multi-person poses. Without any additional features, we achieved 67.4% AP / 29.4 fps and 71.3% AP / 9.1 fps using DLA-34 and HRNet-W48 respectively 

AdaptivePose: End-to-end differentiable network, advantages × 2 (fine-grained point representation, long-distance displacement decomposed into short displacements

① Compared with the center representation, this fine-grained point set representation is better able to capture different degrees of deformation of the human body.

② It adaptively decomposes long-distance displacements into shorter displacements, and at the same time automatically learns adaptive human body part related points through the neural network, avoiding cumulative errors propagated along the skeleton.

Conclusion (computer translation)

In this paper, we propose to represent parts of the human body as points and introduce an adaptive body representation that represents the human body in a fine-grained manner. On this basis, we construct a single-stage network, which includes three effective components: the partial sensing module, the enhanced center sensing branch, and the two-hop regression branch. During inference, we eliminate grouping and refinement and only require a single-step process to form human poses. We experimentally demonstrate that the adaptive algorithm achieves the best speed-accuracy trade-off and outperforms previous state-of-the-art bottom-up and single-stage methods.

Knowledge points

1、a warp operation

2、AE

3. centernet: paper

Translation , intensive reading , detailed explanation , CenterTrack , reading , CenterNet ,

4. finely grained

Guess you like

Origin blog.csdn.net/sinat_40759442/article/details/128254899