Surpassing RTMPose | Topping the COCO-WholeBody list: Tsinghua and IDEA proposed the whole body key point detection SOTA model DWPose

guide

TL;DR : This paper introduces a new method for whole-body pose estimation and how the efficiency and accuracy of this method can be improved by knowledge distillation techniques. So far, the proposed method is currently at the top of the list paperwithcodeon 2D Human Pose Estimation on COCO-WholeBody, surpassing OpenMMLabthe SOTA model previously released by the community RTMPose.

Whole-body pose estimation refers to the task of locating key points of the human body, hands, face, and feet in an image. Due to the multi-scale body parts involved, the fine-grained localization of low-resolution regions, and the scarcity of data, it is a very Challenging tasks, specifically reflected in two points:

  1. Hierarchical structure of human body, small resolution of hands and faces, complex body part matching in multi-person images, especially occlusion and complex hand poses, etc. 2. In addition, in order to deploy the model, it is also necessary to compress it into a lightweight network to better meet the real-time requirements.

To this end, the author proposes a knowledge distillation method for a two-stage whole-body pose estimator DWPosenamed , to improve its effectiveness and efficiency:

  • The first-stage distillation devises a weight decay strategy to supervise the student model trained from scratch by utilizing both the intermediate features of the teacher model and the final logical information, including visible and unseen keypoints.
  • The second stage of distillation further improves the performance of the student model. headDifferent from the previous self-knowledge distillation, this stage only fine-tunes the part of the student model in 20% of the training time , adopting a plug-and-play training strategy.

In addition, in order to cope with the data limitation, the paper explores a UBodydataset named , which contains a variety of facial expressions and gestures, which are used in real application scenarios.

Finally, comprehensive experiments demonstrate the simplicity and effectiveness of the proposed method. As mentioned above, a new state-of-the-art is achieved on the dataset, which DWPoseimproves the overall pose average precision (AP) from 64.8 % to 66.5 %, even surpassing the 65.3 % AP of the teacher model . At the same time, in order to meet the needs of various downstream tasks, the author also released a series of models of different sizes, which can be used on demand, and the code has been open sourced.COCO-WholeBodyRTMPose-lRTMPose-x

background introduction

2D Whole-body Pose Estimation

List several existing SOTA models, including:

  • OpenPose: Combining different datasets to train on different body parts for separated keypoint detection.
  • MediaPipe: A perceptual pipeline is built, especially for overall human keypoint detection.
  • ZoomNet: For the first time, a top-down approach is proposed to address scale variation across different body parts using a hierarchically structured single network.
  • ZoomNAS: A neural architecture search framework is further explored to simultaneously search model structures and connections between different submodules for improved accuracy and efficiency.
  • TCFormer: introduces stepwise clustering and merging of visual features to capture keypoint information of different locations, sizes and shapes in multiple stages.
  • RTMPose: Discuss the key factors of pose estimation, build a real-time model, and achieve the latest results on the COCO-WholeBody dataset. But there are still model design redundancies and data limitations , especially for diverse hand and face poses.

If you don’t know yet MediaPipe, you can go to understand and pay attention to it first. This is a super warehouse open sourced by Google. We will explain it in a separate issue later.

Knowledge Distillation

Knowledge distillation is a common method for compressing models . Originally Hintonproposed by et al., this method guides the student model by using the soft labels output by the teacher model. This method was originally designed for classification tasks and is also known as logistic regression (logit-based) distillation. Subsequent research works exploited the logical information of the teacher model in different ways to transfer more knowledge, including soft labels, target and non-target logical information. From logical information-based distillation to feature-based distillation, knowledge can be transferred from intermediate layers and extended to various tasks including detection, segmentation, generation, etc. Presented in this paper DWPoseis the first work exploring effective knowledge distillation strategies for this task.

method

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-cMZjxWIx-1693106299790)(https://files.mdnice.com/user/13977/4d19893b-e897-43ec -bc29-924e4b69bf48.png)]

As shown in the figure above, DWPose is a two-stage pose distillation ( Two-stage Pose Distillation,TPD) method consisting of two separate stages. Below is a detailed explanation of each stage.

first-stage

The goal of this stage is to use the pre-trained teacher model to guide the student model to learn from scratch while distilling at the feature and logistic levels.

feature-based distillation

In this distillation method, the authors have the student imitate a layer in the backbone network of the teacher model. To achieve this goal, we use mean square error (MSE) loss to calculate the distance between student features and teacher features. The expression of the loss function is as follows:

Among them, F t F_tFtis the feature of the teacher model, F s F_sFsis the feature of the student model, fff is a simple1 × 1 1 \times 11×1 convolutional layer, which is only used to align the spatial dimension, resizing the student features to the same size as the teacher features.

logic-based distillation

RTMPoseAlgorithms used in SimCCpredict keypoints, and keypoint localization is regarded as a classification task of horizontal and vertical coordinates.

The paper points out that logic-based knowledge distillation methods are also suitable for this situation. Here the author RTMPosesimplifies the original classification loss and removes the target weight mask (Weight Mask):

Among them, NNN is the number of people in the batch,KKK is the number of keypoints,LLL is forxxxyyThe length of ylocalization bin ,T i T_iTiis the target value, S i S_iSiis the predicted value.

Weight Decay Strategy for Distillation

Based on the feature distillation loss and logical distillation loss, this paper proposes a weight decay strategy to gradually reduce the penalty of distillation. This strategy helps the student model to focus more on labels and thus achieve better performance. Specifically, a time function r ( t ) r(t) is used in this paperr ( t ) to implement this strategy, wherettt is the current number of training rounds,t max t_{\text{max}}tmaxis the total number of training rounds. Finally, the loss of the first-stage distillation can be expressed as:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-NGKbMjEb-1693106299792)(https://files.mdnice.com/user/13977/2c69be7b-9e1b-45dc -a4b4-66dcbdece6c4.png)]

In a brief summary, here we describe the feature and logic loss of the first stage of distillation, as well as the introduced weight decay strategy. Through these methods, DWPose can guide the student model through the teacher model, realizing the effective distillation of the model .

secode-stage

In the second stage of distillation, the authors try to use the trained student model to teach "self" to achieve better performance. In this way, performance gains can be obtained regardless of whether the student model is trained with distillation or from scratch.

In simple terms, we can build a student model with a trained backbone and an untrained head based on the trained model, and the teacher model is the same but with a trained backbone and a head. During training, the backbone network of the student model is fixed, and the head is updated. Since teachers and students have the same architecture, we only need to extract features from the backbone network once. Then, the features are input into the teacher’s trained head and the student’s untrained head respectively, and the corresponding logical information T i T_iTiSum S i S_iSi

Second, we employ a logistic distillation loss L logit L_{\text{logit}}LlogitTrain students on the second stage of distillation. It is worth noting that the original loss L ori L_{\text{ori}} is discarded hereLori, this loss is calculated based on the label values. By introducing hyperparameters γ γγ represents the scaling of the loss, and the loss of the final second-stage distillation can be expressed as:

Different from the previous self-knowledge distillation method, the head-oriented distillation method proposed in the paper can efficiently distill knowledge from the head using only 20% of the training time, further improving the localization ability. Taken together, the second-stage distillation uses the trained student model to self-guidance and achieve performance improvement by updating the head. This method is efficient and effective, and is important for improving the model to produce better results within a limited training time. effect.

experiment

Since details such as hyperparameters are critical to the reproduction of the method in this paper, we will go through it in detail today.

data set

The paper conducted experiments on COCOand datasets.UBody

  • For the COCO dataset, the standard segmentation of train2017 and val2017 is used in this paper, using 118K training images and 5K validation images. On the COCO validation dataset, the general-purpose human detector provided by SimpleBaseline is used, and its average precision (AP) is 56.4%.
  • The UBody dataset contains more than 1 million frames of images from 15 real scenes, providing corresponding 133 2D keypoints and SMPL-X parameters. It should be noted that the original dataset only focuses on 3D whole-body pose estimation, and does not verify the effectiveness of 2D annotations. The paper selects one frame every 10 frames from the video for training and testing.

implementation details

  • For the first stage of distillation, two hyperparameters α are used in Equation 6 αab bβ to balance the loss. {α = 0.00005, β = 0.1} are used in all experiments including on COCO and UBody datasets.
  • The second stage distillation has a hyperparameter γ γγ to balance the loss in Equation 7, whereγ = 1 γ = 1c=1

quantitative analysis

As can be seen from Table 1, the authors use the larger RTMPose-xand as the teacher model to guide and other student models, RTMPose-lrespectively . DWPose-lWith the proposed two-stage pose distillation (TPD) method, models of different sizes and input resolutions are significantly improved. In particular, an overall average precision (AP) of 60.6DWPose-m is achieved at 2.2 GFLOPs , which is 4.1% higher than the baseline, while the cost of inference remains the same and is easy to deploy.

Interestingly, DWPose-l在two different input resolutions achieve an overall average precision (AP) of 63.1 and 66.5, respectively, surpassing the teacher model RTMPose-xwith fewer parameters and computation.

qualitative analysis

Figure 3 presents some qualitative comparisons illustrating how distillation helps the student model perform better. Two-stage pose distillation (TPD) helps the model predict more accurately, reduces false pose detections, and increases real pose detections, especially for finger keypoint localization improvements.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-EMgXqNa0-1693106299793)(https://files.mdnice.com/user/13977/1149c0d8-d18f-47a3 -ae5a-f50f8695e32e.png)]

In addition, the authors' proposed method is compared with the commonly used OpenPoseand MediaPipebaseline methods provided in Fig. 4. It can be seen that DWPose also significantly outperforms the other two methods, especially in terms of robustness to truncation , occlusion , and fine-grained positioning .

generalization

Summarize

This paper aims to address the issue of efficiency and effectiveness in the task of human whole-body pose estimation. Based on the application of distillation techniques, the author RTMPoseproposes a method called "Two-stage Pose Distillation" to enhance the performance of lightweight models. With this approach, they first guide the training of the feature and logic layers of the student model through the teacher model to achieve better model performance. In the absence of a larger teacher model, the second-stage distillation further improves performance by self-teaching the head of the student model in a short time.

In addition, by studying UBodythe dataset, the performance was further improved, and finally DWPosethe model was formed. Experiments have proved that this method is simple, but very effective. The paper also explores the impact of better pose estimation models on the task of steerable image generation. Taken together, the distillation method proposed by the proposed method in the field of human body pose estimation provides new ideas and experimental results for the efficiency and accuracy of the model.

write at the end

Children's shoes who are interested in basic vision-related tasks and applications such as pose estimation are welcome to scan the QR code at the bottom of the screen or directly search the WeChat account cv_huber to add editor friends, remarks: school/company-research direction-nickname, and communicate with more friends study!

Guess you like

Origin blog.csdn.net/CVHub/article/details/132521636