ACM MM 2023 | Single-stage multi-person human body parsing method based on point sets and offsets

This article is a sharing report of our work "Single-Stage Multi-Human Parsing via Point Sets and Center-Based Offsets" that has just been accepted by ACM MM2023.

Paper link: https://arxiv.org/abs/2304.11356

01. Preface

The EVOL innovation team and Beijing University of Posts and Telecommunications jointly proposed the multi-person human body analysis method SMP, which uses point sets and offset vectors based on them to represent human body parts. This method designs a single-stage representation method of human body part instances, allowing the network to parse the human body more simply and intuitively. This method also proposes two pluggable modules, RFRM and MIRM, to enhance the feature extraction capabilities of the network from two directions: instance and semantics, respectively, to alleviate the target irregularity and long-tail distribution problems in human body parsing tasks. This paper has been accepted by ACM MM 2023.

02. Background and motivation

Instance-aware multi-person parsing (IAMHP) aims to segment human body parts based on semantics and group them by instances. It is more challenging compared to semantic segmentation and instance segmentation. Because for each pixel in the image, not only its part-level semantic label needs to be judged, but also its human-level instance label needs to be judged .

Existing multi-person parsing work can be roughly divided into two categories: bottom-up and top-down approaches. Top-down methods usually detect human body instances first, and then perform single human body parsing for the detected human body instances one by one. On the contrary, the bottom-up method first parses out all the parts in the image, and then uses the human body instance segmentation results or boundary prediction results to combine the parts. Although they have achieved good results, they also suffer from the problems of complex post-processing and redundant calculations caused by the two stages.

In order to solve these problems, we hope to represent the relationship between human body instances and part instances in a more concise way.

In this paper, we explore the possibility of using point sets and center-based offsets to understand the human body. Specifically, the point set consists of the center of gravity of the human body and the center of gravity of the part, and the offset based on the center is the offset vector from the center of gravity of the human body to the center of gravity of the part. With this representation, we implement a single-stage multi-person parsing (SMP) framework that omits the time-consuming ROI and Grouping processes.

In addition, we decouple the multi-person human parsing (MHP) task into four subtasks - human body instance localization , part instance localization , part instance segmentation and prediction of the affiliation mapping between the two centers of gravity .

Since the MHP dataset has long-tail distribution and large differences in instance scales, we also propose a fine feature retention module (RFRM) and a mask interest reclassification module (MIRM). The former utilizes the correlation within the instance features in the mask feature space. As a kind of attention, it enhances the ability to extract the overall features of the instance. The latter refers to the ROI Align idea and uses the feature alignment of the mask results to eliminate the interference of the instance scale on the extraction of semantic features. Based on the above ideas, our SMP method achieves the best performance on both the MHPv2 dataset and the Densepose COCO dataset. At the same time, SMP also has the fastest inference speed currently.

03. Method and implementation

3.1 Overview

An overview of our single-stage multi-player parsing (SMP) framework is shown in Figure 2. First, we send the image to the Feature Pyramid Network (FPN) to generate feature maps of different sizes. Then we use the center head, offset head, and part head to process the feature map to predict human body position and mask information. Finally, we can obtain the multi-person parsing results through the output of the three heads.

  • The center head aims to predict the position of each independent human instance to complete the subtask of human instance localization . To avoid the overlapping center problem, we utilize the center of gravity of the visible mask to represent each instance.
  • In the offset head, we predict the offset of the body's center of gravity to the center of gravity of its corresponding part instance to estimate the mapping relationship to complete the subtask of prediction of the affiliation mapping between the two centers of gravity .
  • The purpose of the part header is to predict the center of gravity position of each independent part instance in the image and predict their fine masks. The part header can be divided into three sub-heads, the category positioning sub-head, the part core sub-head and the mask feature header. Similar to the idea of ​​conditional convolution, we generate its corresponding convolution kernel for each part instance in the picture, and use the mask features to calculate the fine mask of the part. The category positioning subhead completes the subtask of part instance positioning , while the part core subhead and the mask feature header jointly complete the subtask of part instance segmentation .

Finally, the three elements of multi-person human body analysis - human body instances , part instances , and their affiliations can be obtained through the model, and the four subtasks are also completed at the same time. In the inference stage, it is only necessary to simply index the part convolution kernel corresponding to each human body instance and convolve it with the feature map to obtain the human body parsing results for each person.

3.2 Feature enhancement module

On this basis, SMP still has unsolvable problems of long-tail distribution and small target classification .

In order to solve the above problems, from an instance perspective, we propose the Refined Feature Retention (RFR) module .

The main idea of ​​the RFR module is to use mask features as attention to guide the learning of category branches. The part head completes instance segmentation through conditional convolution, and the output value of each pixel in the segmentation map is actually the inner product similarity of the convolution kernel and the corresponding feature on the feature map. Through the autocorrelation calculation of the convolutional feature map, we can obtain the similarity self-attention map of the instance at the corresponding position. The self-attention map, i.e., mask attention, has superior instance guidance ability, and by weighting the category features with the mask attention at each position, we can obtain a new refined feature map. We use new features as offset input to perform warp operations to guide the model to adaptively obtain more instance information.

In addition, our model can utilize the masked interest reclassification module (MIRM)  to treat the segmentation output as a region of interest (ROI) to achieve secondary classification. The MIR module is independent and can utilize the output results of other branches. We select the fused features of the feature pyramid as input features, transform the feature space through continuous convolutional layers, and use semantic segmentation labels for supervision to learn latent semantic features. We use the mask generated by the part header as the ROI to obtain local features. The features are interpolated to a fixed size through ROI Align of size 14, and the features are transformed again using a convolutional layer with kernel size of 14. Finally, two consecutive fully connected layers are used to output the classification results.

04. Experimental results

We conducted experiments in two data sets, MHPv2 and Densepose COCO. Compared with other multi-person human analysis methods, SMP achieves SOTA results with the fastest inference speed.

05. Summary

This paper proposes using point sets and center-based offsets to understand humans, leading to a new framework, namely SMP, a new method for solving instance-aware multi-person parsing tasks in a single stage. Specifically, point features at the center of gravity of human body parts are used to generate masks of part instances. The offset from the center of the body to the center of gravity of the part is used to unify the human instances. To enhance the representation of instance features for classification, we propose the Refined Feature Retention (RFR) module, which can utilize mask features to generate mask attention to guide feature extraction. For the problem of classification errors due to high similarity between classes and long-tail distribution, we propose the Mask of Interest Reclassification (MIR) module, which uses the generated mask as the region of interest to refine the classification results. SMP has the advantages of fast reasoning, high accuracy and simplicity, and can promote human-centered related research.

EVOL Innovation Team Members Introduction
EVOL Joint Innovation Team Leader:
Zhao Jian (Academy of Military Sciences), Ph.D., director of the Beijing Image and Graphics Society, selected into the "Young Talent Promotion Project" of the Beijing Association for Science and Technology/China Association for Science and Technology, and was awarded the Wu Wenjun Natural The first prize of the Science Award, the research direction is unconstrained visual perception understanding.
Personal homepage: https://zhaoj9014.github.io/
Jin Lei (Beijing University of Posts and Telecommunications), Ph.D., Distinguished Associate Researcher of Beijing University of Posts and Telecommunications, research interests include human posture estimation, human body analysis, human behavior recognition, etc.
Personal homepage: https://teacher.bupt.edu.cn/jin

  About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) is affiliated with Jiangmen Venture Capital and is a growth community that gathers global Chinese AI elites.

We hope to create more professional services and experiences for AI talents, accelerate and accompany their learning and growth.

We look forward to this becoming a high ground for you to learn cutting-edge AI knowledge, a fertile ground for sharing your latest work, and a base for upgrading and fighting monsters on the road to AI advancement!

More detailed introduction >> TechBeat, a learning and growth community that gathers global Chinese AI elites

Guess you like

Origin blog.csdn.net/hanseywho/article/details/132902026