CVPR2023: IDEA and Tsinghua proposed the first one-stage 3D whole body mesh reconstruction algorithm, the code is open source!

GitHub - IDEA-Research/OSX: [CVPR 2023] Official implementation of the paper "One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer"

3D Whole-Body Mesh Recovery (3D Whole-Body Mesh Recovery) is a basic task in the field of 3D human body reconstruction and an important part of human behavior modeling. It is used to capture accurate body posture and shape from monocular images. , which has a wide range of applications in many downstream tasks such as human body reconstruction and human-computer interaction.

Researchers from the Guangdong-Hong Kong-Macao Greater Bay Area Research Institute (IDEA) and Tsinghua University’s Institute of Deep Research proposed the first one-stage algorithm OSX for full-body mesh reconstruction. Through the module-aware Transformer network, it can be reconstructed efficiently and accurately. A full-body human body mesh is proposed, and a large-scale upper body human body reconstruction dataset UBody that focuses on real application scenarios is proposed.

The algorithm proposed in this article has been ranked first in the SMPL-X track on the AGORA list since it was submitted (2022.11~2023.04). This work has been accepted by CVPR2023, the top conference of computer vision, and the algorithm code and pre-training model have all been open sourced.

Article: https://arxiv.org/abs/2303.16160

Code: https://github.com/IDEA-Research/OSX

Project home page: https://osx-ubody.github.io/

Unit: IDEA, Deep Research Institute of Tsinghua University

3D Whole-Body Mesh Recovery (3D Whole-Body Mesh Recovery) is an important part of human behavior modeling, which is used to estimate the body pose (Body Pose), gesture (Hand Gesture) and facial expression ( Facial Expressions), this task has a wide range of applications in many downstream real-world scenarios, such as motion capture, human-computer interaction, etc. Thanks to the development of parametric models such as SMPLX, the accuracy of full-body mesh reconstruction has been improved, and this task has received more and more attention.

Compared with body pose estimation (Body-Only Mesh Recovery), full-body mesh reconstruction requires additional estimation of hand and face parameters, and the resolution of hands and faces is often small, making it difficult to pass a one-stage network. Estimate the whole body parameters. Most of the previous methods used a multi-stage copy-paste (Copy-Paste) framework to detect the bounding box (Bounding Box) of the hand and face in advance, cut it out and enlarge it, input three independent networks, and estimate the body ( Body), hand (Hand), and face (Face) parameters, and then fuse. This multi-stage approach can solve the problem that the resolution of hands and faces is too small. However, since the parameter estimation of the three parts is relatively independent, it is easy to cause the final result and the connection between the three parts to be unnatural and realistic, and it will also increase The complexity of the model. In order to solve the above problems, we propose the first one-stage algorithm OSX, we use a module-aware Transformer model to simultaneously estimate human pose, gesture and facial expression. The algorithm outperforms the existing full-body mesh reconstruction algorithms on three public datasets (AGORA, EHF, 3DPW) with a small amount of computation and running time.

We noticed that most of the current full-body mesh reconstruction datasets are collected in the laboratory environment or simulation environment, and these datasets have a large distribution difference from the real scene. This easily leads to the poor reconstruction effect of the trained model when it is applied to the real scene. In addition, in many scenes in reality, such as live broadcast, sign language, etc., only the upper body of a person appears in the screen, while the current data sets are all full-body human bodies, and the resolution of hands and faces is often low. In order to make up for the shortcomings of this data set, we propose a large-scale upper body data set UBody, which covers 15 real scenes, including 1 million frames of pictures and corresponding whole body key points (2D Whole-Body Keypoint), Human body bounding box (Person BBox), human hand bounding box (Hand BBox) and SMPLX label. The figure below is a partial data visualization of UBody.

Figure 1 UBody dataset display

The contributions of this work can be summarized as:

  • We propose OSX, the first one-stage full body mesh reconstruction algorithm, which can estimate SMPLX parameters in a simple and efficient way.

  • Our algorithm OSX outperforms existing full-body body mesh reconstruction algorithms on three publicly available datasets.

  • We propose UBody, a large-scale upper body dataset, to facilitate the application of the fundamental task of full-body mesh reconstruction in real-world scenarios.

2. Introduction to one-stage reconstruction algorithm

2.1 Overall framework of OSX

As shown in the figure below, we propose a Component-Aware Transoformer model to estimate the parameters of the whole body at the same time, and then input it into the SMPLX model to obtain the whole body mesh. We noticed that body pose (Body Pose) estimation needs to utilize global human body-dependent information, while gesture (Hand Gesture) and facial expression (Facial Expression) focus more on local regional features. Therefore, we designed a global encoder and a local decoder. The encoder uses the Global Self-attention mechanism (Global Self-attention) to capture the whole-body dependencies of the human body and estimate the body pose and shape (Body Pose and Shape) , the decoder upsamples the feature map and uses a key point-guided cross-attention mechanism (Cross-Attention) to estimate the parameters of the hand and face.

Figure 2 Schematic diagram of OSX network structure

2.2 Global Encoder

In the global encoder, the human body picture is first cut into multiple blocks that are not overlapping with each other. These blocks are converted into feature tokens (Feature Token) through a convolutional layer and position encoding. Then, we It is connected with several Body Tokens composed of learnable parameters and fed into the global encoder. The global encoder consists of multiple Transformer blocks, each block contains a multi-head self-attention, a feed-forward network, and two layer normalization modules (Layer Normization). After passing through these blocks, the information between various parts of the human body is obtained In order to interact, the body token captures the whole-body dependence of the human body, input the fully connected layer, and returns the posture and shape of the human body. The feature token is reshaped and converted into a feature map for use by the decoder.

2.3 High-Resolution Local Decoder

In the decoder, we first upsample the feature maps to address the low resolution of hands and faces. Specifically, we use a differentiable Region of Interest Alignment operation to upsample the hand and face feature maps, thus obtaining multi-scale hand and face high-resolution features. Next, we define multiple module tokens (Component Token), each token represents a key point, and input these tokens into the decoder, and capture useful information from high-resolution features through the cross-attention mechanism guided by key points, Update Component Token:

Finally, these module tokens are converted into gestures and facial expressions through the fully connected layer, and together with the body pose and shape, they are input into the SMPLX model and converted into a human mesh.

3. Introduction to the upper body data set UBody

3.1 Dataset Highlights

In order to narrow the difference between the basic task of full-body mesh reconstruction and downstream tasks, we collected more than 1 million pictures from 15 realistic scenes, including music performance, talk show, sign language, magic show, etc., and labeled them. Compared with the existing data set AGORA, these scenes only contain the upper body, so the resolution of hands and faces is larger, and they have richer hand movements and facial expressions. At the same time, these scenes contain very diverse occlusions, interactions, camera cuts, backgrounds, and lighting changes, making them more challenging and more in line with realistic scenes. In addition, UBody is in the form of video, and each video contains audio (Audio), so it can also be applied to tasks such as multimodality in the future.

Figure 3 UBody 15 scene display

3.2 IDEA self-developed high-precision full-body motion capture labeling framework

In order to label these large-scale data, we propose an automatic labeling scheme, as shown in the figure below, we first train a ViT-based key point estimation network to estimate high-precision key points of the whole body. Next, we use a multi-stage progressive fitting technique (Progreesive Fitting) to convert the human body mesh output by OSX into three-dimensional key points (3D Keypoints), and project them onto the image plane, and estimate two-dimensional key points (2D Keypoints) Calculate the loss to optimize the OSX network parameters until the estimated mesh and 2D key points can be highly matched.

Figure 4 The frame diagram of full-body motion capture labeling

The following is a display of 15 scenes of the UBody dataset and their annotation results:

SignLanguage

Singing

OnlineClass

Olympic

Entertainment

Fitness

LiveVlog

Conference

TVShow

ConductMusic

Speech

TalkShow

MagicShow

4. Experimental results

4.1 Quantitative experiment comparison

Since the submission of OSX (2022.11~2023.04), OSX is the top of the SMPLX track on the AGORA list. The quantitative comparison results on AGORA-test (https://agora-evaluation.is.tuebingen.mpg.de/) are as follows Shown:

Table 1 Quantitative results of OSX and SOTA algorithms on AGORA-test

The quantitative comparison results on AGORA-val are shown in the following table:

Table 2 Quantitative results of OSX and SOTA algorithms on AGORA-val

The quantitative results at EHF and 3DPW are as follows:

Table 3 Quantitative results of OSX and SOTA algorithms on EHF and 3DPW

It can be seen that due to the use of the module-aware Transformer network, OSX can ensure the modeling of global dependencies and the capture of local features at the same time. In the existing data sets, especially the difficult data set of AGORA, it significantly exceeds previous method.

4.2 Qualitative experiment comparison

The qualitative comparison results on AGORA are shown in the figure:

From left to right: input image, ExPose, Hand4Whole, OSX(Ours)

The qualitative comparison results on EHF are shown in the figure:

From left to right: input image, ExPose, Hand4Whole, OSX(Ours)

The comparison results on the UBody dataset are shown in the figure:

From left to right: input image, ExPose, Hand4Whole, OSX(Ours)

It can be seen that our algorithm OSX can estimate more accurate body posture, hand movements and facial expressions, and the reconstructed human mesh is more accurate, fits better with the original image, and is more robust.

5. Summary

OSX is the first one-stage whole-body mesh reconstruction algorithm. Through a module-aware Transformer model, body pose, hand pose and facial experssion are estimated at the same time, and it has achieved the best whole-body mesh in three public lists. recovery for the best results. Furthermore, we propose UBody, a large-scale dataset of upper body scenes, to facilitate the application of human mesh reconstruction tasks in downstream scenarios. Our code has been open-sourced in hopes of advancing the field.

Guess you like

Origin blog.csdn.net/jacke121/article/details/130072702