CVPR2022 | 3D Reconstruction of Mobile Hands

Author丨Chen Xingyu@zhihu

Source丨https://zhuanlan.zhihu.com/p/494755253

Edit丨3D Vision Workshop

This paper presents our work published at CVPR2022, MobRecon: Mobile-Friendly Hand Mesh Reconstruction from Monocular Image. The main contribution of this paper is to realize a lightweight, accurate and stable hand 3D reconstruction method at the same time, and it has been applied on the mobile terminal.

ae3d602e9ee00d785787ce120be2956d.png

Paper link: https://arxiv.org/abs/2112.02753

Code link: https://github.com/SeanChenxy/HandMesh

1. Background and research motivation

Our society is becoming more and more virtual, and there will be more and more wearable mixed reality products or immersive virtual reality products. Consistent with the real world, the hand will also become an important tool for us to interact with the virtual world. Therefore, the virtualization of the hand is of great significance to the future XR technology.

The problem of estimating hand geometry from monocular RGB has developed into a mature field of intersection of vision and graphics. Please refer to the following link for details of research progress in this field in recent years.

https://github.com/SeanChenxy/Hand3DResearch

However, there is a lack of related work that can simultaneously guarantee the efficiency, accuracy and timing consistency of hand reconstruction. Inspired by this problem, we explore a mobile-oriented hand mesh estimation method:

· Proposed the MobRecon framework, which only contains 123M Mult-Adds (multiply-add operations) and 5M #Param (parameters), which can reach 83FPS on Apple A14 CPU.

· Designed a lightweight 2D encoding structure and 3D decoding structure.

· Proposed feature lifting module to bridge 2D and 3D feature expression, including MapReg (map-based position regression) for 2D key point estimation, pose pooling for point feature extraction, and PVL (pose for feature mapping) -to-vertex lifting).

2. Method

Overview

MobRecon follows the vertex regression idea based on the graph method, and inserts a feature lifting stage in the middle of the traditional encoding-decoding structure [1,2]. The so-called lifting refers to the mapping from 2D space to 3D space. We focus on and reduce the parameter cost in this stage, and improve the accuracy and timing stability at the same time. In addition, we also lightweighted the 3D decoding part, keeping other performance unchanged as much as possible while reducing the computational cost. The overall framework of MobRecon is shown in Figure 1.

a776da198bc7fa7539b1fddba47b75c6.png

Figure 1. MobRecon overview

2D encoding : image feature encoding

Feature lifting : mapping 2D features to 3D space

3D decoding: 3D features are decoded to 3D coordinates

2D encoding

As shown in Figure 2, we design two Hourglass structures to express images, called DenseStack and GhostStack, and the amount of computation and parameters are

·DenseStack:373M Multi-Adds,6M #Param

·GhostStack:96M Multi-Adds,5M#Param

81a6ada9e709df265692e62a638411ec.png

Figure 2. 2D encoding

8460a50b5cdb8a974c11cb374062eb76.png

Table 1. 2D encoding structure analysis

Table 1 provides a detailed analysis of 2D encoding, and our module maintains a good reconstruction accuracy while greatly optimizing the amount of computation and parameters. At the same time, Table 1 also shows the role of our virtual data. For the design ideas of virtual data, please refer to the supplementary material of the paper.

Feature lifting

The purpose of this stage is to map image features from 2D space to 3D space. As shown in Fig. 6, the traditional method [1, 2] does not show the feature lifting stage, but uses a fully connected operation to map the global features of the image into a long feature vector, and then reorganize them into 3D point features. Our design is divided into 3 steps: 2D keypoint estimation, keypoint feature extraction, and feature mapping.

d7584208af07475df0d10c2d9f72871f.png

Figure 3. Feature lifting

We propose Map-based position regression (MapReg) to estimate 2D keypoints. As shown in Figure 4, there are many mature methods for 2D keypoint estimation, so this problem is often ignored, and there is little work to explore how to improve the accuracy and timing consistency of 2D keypoints at the same time. Our thinking on this question is as follows:

2f096aa06cf3e796ee65e5b7a258c498.png

Figure 4. Different 2D keypoint expressions (a) heatmap, (b) heatmap+soft-argmax, (c) regression (d) MapReg

· heatmap, Figure 4(a).

High-resolution representation (eg, 64x64), by fusing shallow features and semantic features, the representation granularity is finer.

However, if the receptive field is too small, it is difficult to generate constraints between key points.

· regression, Figure 4(c).

Low-resolution expression (ie, 1x1), always maintains the semantic global receptive field, and has stronger structural expression ability.

However, shallow features are lost and the ability to express details is insufficient.

The above two basic methods have their own advantages and disadvantages, and they complement each other. Can they be combined?

Heatmap + soft - argmax, Figure 4(b).

High-resolution representation, inheriting the advantages of heatmap.

Although there is a global receptive field, it comes from heuristic rules, so it does not inherit the advantages of regression.

· MapReg, Fig. 4(d), is also the method proposed in this paper. We design a small 4-fold upsampling structure to incorporate shallow features during upsampling. The fused features are expanded along the channel dimension, that is, the 2D spatial structure is expanded into a 1D vector. Use MLP to regress multiple 1D vectors to 2D coordinates of multiple keypoints. MapReg has the following features:

Medium resolution representation (eg, 16x16), inheriting the advantages of heatmap.

The semantic global receptive field inherits the advantages of regression.

The computational complexity and space complexity of MapReg are between heatmap and regression.

The advantages of MapReg can be clearly observed from Figure 5: (1) the granularity of heatmap expression is fine, but each point is expressed independently; (2) the global receptive field of heatmap+soft-argmax is heuristic, so Its result is just the smoothing of the heatmap; (3) MapReg can autonomously express constraints between keypoints.

9c7ab21a825ec36f51d6e935a3759b0a.png

Figure 5. Comparison of different 2D keypoint representations. The blue dots are the predicted 2D keypoint locations.

27b227ee6ca7484f64e11e3ef849052d.png

We propose pose-to-vertex mapping (PVL) to implement feature mapping. As shown in Figure 3, traditional methods usually use fully-connected operations to map global features, resulting in a large computational cost. We designed a lighter PVL.

bf1638fae1c3d821e69cd09ef32714d9.png

图7. pose-to-vertex mapping

60e4c23bd764faff13b77523e592bacb.png

142260aa98114df69dda7418efdf8565.png

Figure 8. The optimized lifting matrix

8e76d4cd19ef74c9280ebd813d96dc75.png

Figure 9. Highly correlated feature propagation from joint to vertex

Table 2 provides a detailed analysis of the entire feature lifting process: MapReg obtains the best 2D accuracy (2D AUC) and acceleration (2D Acc) at the same time, PVL has better computational cost, and simultaneously obtains the best 3D accuracy (3D AUC). ) and acceleration (3D Acc).

b72f477632d9743dc51f594e1704b3b1.png

Table 2. Feature lifting analysis

Consistency constraints

Before lightweighting 3D decoding, we further enforce temporal consistency through consistency constraints. As shown in Figure 10, a single-sample affine transformation is used to create sample pairs, and the mesh vertices and 2D key points predicted by the model are constrained to be consistent in the original space:

71c77aafd5fbd8385635118743dde52c.png

3396b86d98751269bb0d6870f3e19023.png

Figure 10. Consistency constraints based on affine transformations

The experimental results show that the consistency constraint helps to reduce the timing acceleration, and also has a slight positive effect on the reconstruction accuracy.

6315ad6a0ed2578d48fc7a7781ce25ed.png

Table 3. Consistency constraint analysis

3D decoding

Since the intrinsic dimension of the mesh is two-dimensional, we use a simple and efficient spiral convolution to achieve 3D decoding. The definition of the neighborhood of mesh vertices is shown in Figure 11, which is completely equivalent to the definition of the neighborhood of image convolution.

78af33b88f462fabd62b9d21c4501d97.png

Figure 11. Spiral sampling

After defining the neighborhood, the next step in the convolution operation is feature fusion. As shown in Figure 12, traditional methods use LSTM [5] or very large FC [6] for feature fusion, which either cannot be parallelized or have high computational cost. We propose DSConv, which migrates the depth-wise separable convolution to feature operations on mesh vertices. Comparing to [6], DSConv has better computational complexity, ie  outside_default.png vs.  outside_default.png .

1eb014756512a4d72fd35b38a8ad4104.png

Figure 12. Comparison of SpiralConv, SpiralConv++ and DSConv

The experimental results show that DSConv effectively reduces the amount of computation and parameters, and keeps the reconstruction performance basically unchanged. The whole MobRecon reaches 83FPS on Apple A14 CPU.

42e5084fbac02f830136c858bd00921d.png

Table 4. DSConv and overall model analysis. Mult-Add is with #Param's "/" 3D decoder on the left, and the overall MobRecon on the right; Acc's &quot ;/" The left is about 2D space, the right is about 3D space.

Limitation

MobRecon is a mobile CPU-friendly framework, but the parallel computing efficiency on GPU is not high. The main reason is that methods such as separable convolution and spiral neighborhood sampling increase the memory access cost.

3. Comparative experiment

reconstruction accuracy

As shown in Fig. 13, based on the FreiHAND dataset, the reconstruction accuracy of MobRecon is almost consistent with some large model methods. If you replace the 2D encoding part of MobRecon with ResNet50, you can get very good accuracy. For more comparative experiments, please refer to the paper.

5fb46eb12d0c6007282608630427f0bf.png

Figure 13. Comparative experiments based on the FreiHAND dataset

Timing consistency

In Figure 14, we compare the timing performance with [2]. The content of the video sequence is shown in the lower right subfigure, and the prediction results are still jittery despite keeping the hand pose unchanged throughout the video. The three graphs on the left in the figure show the predicted accelerations in 2D space, human 3D space, and camera space, respectively. The red curve is the result of MobRecon, which is obviously due to heatmap-based [2]. As shown in the lower right subfigure, compared to heatmap, MapReg generates better constraints and 2D structure between key points, thus showing stronger stability in timing. It can be concluded that MobRecon is a non-sequential monocular method, there is no temporal module, and its stability in the time dimension is essentially brought about by the structured expression in the spatial dimension.

169137ff9512e3b0ff39e688d6d43755.png

Figure 14. Timing Consistency Comparison

4. Outlook

In terms of accuracy, RGB-based hand pose/mesh estimation has basically reached a level that can be practically applied. In the future, the academic community will pay more attention to high-level tasks such as hand rendering, self-supervision, timing modeling, and hand behavior understanding. At the same time, there will be more and more interactive work for hands, hand objects, and human bodies. In addition, hand muscle modeling, robot operation, hand + voice multimodal interaction and other directions are also worthy of attention.

Reference

[1] Dominik Kulon, Riza Alp Guler, Iasonas Kokkinos, Michael Bronstein, Stefanos Zafeiriou. Weakly-supervised mesh-convolutional hand reconstruction in the wild. CVPR2020.

[2] Xingyu Chen, Yufeng Liu, Chongyang Ma, Jianlong Chang, Huayan Wang, Tian Chen, Xiaoyan Guo, Pengfei Wan, Wen Zheng. Camera-space hand mesh recovery via semantic aggregation and adaptive 2D-1D registration. CVPR2021.

[3] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In ECCV2016.

[4] Thomas N. Kipf, Max Welling. Semi-supervised classification with graph convolutional networks, ICLR2017.

[5] Isaak Lim, Alexander Dielen, Marcel Campen, and Leif Kobbelt. A simple approach to intrinsic correspondence learning on unstructured 3D meshes. In ECCV, 2018.

[6] Shunwang Gong, Lei Chen, Michael Bronstein, and Stefanos Zafeiriou. SpiralNet++: A fast and highly efficient mesh convolution operator. In ICCV Workshops, 2019.

This article is for academic sharing only, if there is any infringement, please contact to delete the article.

Dry goods download and study

Backstage reply: Barcelona Autonomous University courseware, you can download the 3D Vison high-quality courseware accumulated by foreign universities for several years

Background reply: computer vision books, you can download the pdf of classic books in the field of 3D vision

Backstage reply: 3D vision courses, you can learn excellent courses in the field of 3D vision

3D visual quality courses recommended:

1. Multi-sensor data fusion technology for autonomous driving

2. A full-stack learning route for 3D point cloud target detection in the field of autonomous driving! (Single-modal + multi-modal/data + code)
3. Thoroughly understand visual 3D reconstruction: principle analysis, code explanation, and optimization and improvement
4. The first domestic point cloud processing course for industrial-level combat
5. Laser-vision -IMU-GPS fusion SLAM algorithm sorting
and code
explanation
Indoor and outdoor laser SLAM key algorithm principle, code and actual combat (cartographer + LOAM + LIO-SAM)

9. Build a structured light 3D reconstruction system from scratch [theory + source code + practice]

10. Monocular depth estimation method: algorithm sorting and code implementation

11. The actual deployment of deep learning models in autonomous driving

12. Camera model and calibration (monocular + binocular + fisheye)

13. Heavy! Quadcopters: Algorithms and Practice

14. ROS2 from entry to mastery: theory and practice

15. The first 3D defect detection tutorial in China: theory, source code and actual combat

Heavy! 3DCVer- Academic paper writing and submission  exchange group has been established

Scan the code to add a WeChat assistant, and you can apply to join the 3D Vision Workshop - Academic Paper Writing and Submission WeChat exchange group, which aims to exchange writing and submission matters such as top conferences, top journals, SCI, and EI.

At the same time , you can also apply to join our subdivision direction exchange group. At present, there are mainly 3D vision , CV & deep learning , SLAM , 3D reconstruction , point cloud post-processing , automatic driving, multi-sensor fusion, CV introduction, 3D measurement, VR/AR, 3D face recognition, medical imaging, defect detection, pedestrian re-identification, target tracking, visual product landing, visual competition, license plate recognition, hardware selection, academic exchanges, job search exchanges, ORB-SLAM series source code exchanges, depth estimation and other WeChat groups .

Be sure to note: research direction + school/company + nickname , for example: "3D Vision + Shanghai Jiaotong University + Jingjing". Please note according to the format, it can be quickly passed and invited to the group. Please contact for original submissions .

a43d0519bd4a1c6afe3963a3a8027a10.png

▲Long press to add WeChat group or contribute

6bc25a1deab9406a1ed5715721b7aa1a.png

▲Long press to follow the official account

3D vision from entry to proficient knowledge planet : video courses for 3D vision field ( 3D reconstruction series , 3D point cloud series , structured light series , hand-eye calibration , camera calibration , laser/vision SLAM, automatic driving, etc. ), summary of knowledge points , entry and advanced learning route, the latest paper sharing, and question answering for in-depth cultivation, and technical guidance from algorithm engineers from various large factories. At the same time, Planet will cooperate with well-known companies to release 3D vision-related algorithm development jobs and project docking information, creating a gathering area for die-hard fans that integrates technology and employment. Nearly 5,000 Planet members make common progress and knowledge for creating a better AI world. Planet Entrance:

Learn the core technology of 3D vision, scan and view the introduction, unconditional refund within 3 days

f35497e21d4a888a4ecd3dad122f4018.png

 There are high-quality tutorial materials in the circle, answering questions and solving doubts, and helping you solve problems efficiently

I find it useful, please give a like and watch~  d68a063e02d77664844657d99f0cd683.gif

おすすめ

転載: blog.csdn.net/Yong_Qi2015/article/details/124263460