ACM MM 2023 | Human pose estimation method based on decentralized representation

01. Preface

Beijing University of Posts and Telecommunications and the EVOL innovation team jointly proposed the human posture estimation method DecenterNet to improve the accuracy of human posture estimation in crowded scenes. This method introduces a decentralized posture representation method, so that the network will express human posture more robustly in entangled/crowded areas. The method also proposes a decoupled pose evaluation mechanism to adaptively select the best pose among multiple pose representations. The paper DecenterNet: Bottom-Up Human Pose Estimation Via Decentralized Pose Representation has been accepted by ACM MM 2023.

02. Background and motivation

Multi-human pose estimation in crowded scenes remains an extremely challenging task. We find that most of the failures of current human pose estimation methods in crowded scenes come from the inability to locate or group visible keypoints, rather than reasoning about invisible keypoints, as shown in Table 1.

Therefore, this paper divides crowded scenes into two cases: entanglement and occlusion, and observes that entanglement is an important issue in crowded scenes. Based on this observation, we propose DecenterNet, an end-to-end human pose estimation method that enables robust and efficient pose estimation in crowded scenes.

In DecenterNet, we introduce a decentralized pose representation method that uses all visible key points as representation points to represent human poses, so that the network will express human poses more robustly in entangled/crowded areas. In order to solve the above problem of introducing too many False Positives into pose representations, we also propose a decoupled pose evaluation mechanism that introduces a location map to adaptively select the best pose among multiple pose representations. In addition, we also constructed a new dataset called SkatingPose, which contains more figure skating scenes with entanglements.

03. Method

3.1 Decentralized Pose Representation

Traditional work uses the center point of the posture, the pelvic point of the posture, or the center point of the human body part to represent the posture, and then summarizes the output of these representation points, and then uses the NMS algorithm to obtain the human body posture. However, when human poses are entangled in a crowded scene, their representation points may obscure each other, causing the pose represented by this point to be incorrect. Therefore, we propose Decentralized Pose Representation to alleviate the entanglement problem in crowded scenes. Specifically, this representation uses all visible key points of the pose as representation points, and narrows the range of representation points to reduce the possibility of mutual occlusion. On the one hand, the visible point of the pose is difficult to be completely occluded and is more discriminative than the center point. On the other hand, fusing predictions from representation points from more different locations results in more comprehensive and robust predictions.

3.2 Decoupled Pose Assessment

It is foreseeable that due to the use of too many representation points, the above posture representation will introduce a large number of False Positive problems. Therefore, we propose a decoupled posture evaluation mechanism, which combines the traditional heatmap selection of representation points and the evaluation of posture. This function is decoupled to heatmap and location map, as shown in the figure below.

In this attitude assessment mechanism, the location map plays a particularly critical role. On the one hand, it is used to select representation points from the offset map, and on the other hand, it can further enhance the scoring function of the heatmap. Specifically, the location map is supervised by a 4x4 all-1 square area and multiplied with the loss of the offset map to dynamically represent the confidence of the pose on the offset map. The maximum value point of the traditional representation point heatmap does not represent the best attitude quality of this representation point, while the location map can adaptively select attitude representation points with high confidence to obtain a better solution.

04. Experimental results

We conducted experiments in three data sets: COCO, CrowdPose, and SkatingPose. Compared with other bottom-up human pose estimation methods, DecenterNet achieves SOTA results with a lower number of parameters and calculations. Among them, the CrowdPose data set does not distinguish between visible points and invisible points. We use the human body instance method Mask2Former to distinguish.

05. Summary

DecenterNet is an end-to-end method for human pose estimation in crowded scenes. This method uses decentralized human posture representation and uses all visible key points as representation points to characterize human posture, thereby obtaining better results in the entangled region. In addition, DecenterNet also adopts a decoupled posture evaluation mechanism to adaptively select the optimal posture through location map. We also built a new dataset called SkatingPose, which contains more figure skating scenes with entanglements.

EVOL Innovation Team Members Introduction
EVOL Joint Innovation Team Leader:
Zhao Jian (Academy of Military Sciences), Ph.D., director of the Beijing Image and Graphics Society, selected into the "Young Talent Promotion Project" of the Beijing Association for Science and Technology/China Association for Science and Technology, and was awarded the Wu Wenjun Natural The first prize of the Science Award, the research direction is unconstrained visual perception understanding.
Personal homepage:
https://zhaoj9014.github.io/
Jin Lei (Beijing University of Posts and Telecommunications), Ph.D., Distinguished Associate Researcher of Beijing University of Posts and Telecommunications, research interests include human posture estimation, human body analysis, human behavior recognition, etc.
Personal homepage:
ACM MM 2023 | Human pose estimation method based on decentralized representation

  About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) is affiliated with Jiangmen Venture Capital and is a growth community that gathers global Chinese AI elites.

We hope to create more professional services and experiences for AI talents, accelerate and accompany their learning and growth.

We look forward to this becoming a high ground for you to learn cutting-edge AI knowledge, a fertile ground for sharing your latest work, and a base for upgrading and fighting monsters on the road to AI advancement!

More detailed introduction >> TechBeat, a learning and growth community that gathers global Chinese AI elites

Guess you like

Origin blog.csdn.net/hanseywho/article/details/133385181