CVPR 2023 | Interpretation of HRN, the Champion Model of DAMO Academy REALY Head Reconstruction List

Collection of team models, papers, blog posts, and live broadcasts, click here to browse

 

foreword

        High-fidelity 3D head reconstruction has wide applications in many scenarios, such as AR/VR, medical, film production, etc. Although extensive works have achieved excellent reconstruction results using specialized hardware such as LightStage, estimating highly detailed facial models from monocular images from single or sparse viewpoints remains a challenging task. In this article, we will introduce the latest head reconstruction paper of CVPR2023 from Dharma Institute. This work has won the first place in the single-image head reconstruction list REALY in the double list of front face and side face, and has won the first place in many other data sets. The effect of SOTA.

1. Thesis & Code

论文题目:A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction from In-The-Wild Images

Paper address: https://arxiv.org/abs/2302.14434

Project home page: HRN

ModelScope (demo) : ModelScope Modak Community

2. Summary

        Limited by the low-dimensional representation of 3DMM, most 3DMM-based head reconstruction methods cannot recover high-frequency facial details, such as wrinkles, dimples, etc. Some methods try to introduce detail maps or non-linear operations, and the results are still not ideal. To this end, we propose a novel Hierarchical Representation Network (HRN) in this paper to achieve high-resolution head reconstruction from a single image. Specifically, we decouple head geometry details and introduce hierarchical representations for fine-grained head modeling. At the same time, combined with the 3D prior of facial details, the accuracy and authenticity of the reconstruction results are improved. We also propose a de-retouching module for better geometry and texture decoupling. Notably, our framework can be extended to multi-view reconstruction by considering the consistency of details across different views. Extensive experiments on two single-view and two multi-view head reconstruction benchmarks demonstrate that our method outperforms existing methods in terms of reconstruction accuracy and visual performance.

3. Interpretation of the method

3.1 Core idea

Some existing methods [1, 2, 3] try to capture high-frequency facial details, such as wrinkles, by predicting the displacement map, and have achieved good results. However, due to the way it is defined, the displacement map cannot model larger-scale details, such as the contour details of the chin, cheeks, etc. To this end, we disassemble the geometry of the head and represent it with different representations, as shown in the figure above. Specifically, we split the head geometry into low-frequency parts, mid-frequency details, and high-frequency details:

  • The low-frequency part describes the overall skeleton of the head (fat and thin, facial features, and general shape). For this part, we use the existing parametric 3DMM method to represent it using low-dimensional coefficients and corresponding shape bases.
  • The mid-frequency part describes the larger-scale details on the basis of the head skeleton (such as muscle orientation, facial contours, etc.). In this part, we use a 3-channel deformation map in UV space as a representation, which describes the low-frequency basis of each vertex. Deformation in the xyz directions above.
  • The high-frequency part describes the small-scale details of the head, such as wrinkles, etc. In this part, we use the displacement map to model the details on the pixel scale.

In general, we split the head geometry into three parts, and introduced three hierarchical representations according to its scale and detail features, and modeled from three different granularities of the head, vertices, and pixels. , to achieve precise and refined reconstruction of the head.

3.2 Network structure

        In the HRN (hierarchical representation network) network structure, we adopt a coarse-to-fine framework as a whole. First, we use the existing 3DMM-based method deep3d[4] to predict the low-frequency geometric part of the head (Fig. 2 blue color area), at the same time, we can obtain the corresponding position map and texture map, which will be used as the input of detail prediction. Then, we use two cascaded pix2pix networks to predict the deformation map and displacement map (green area in Figure 2). Finally, we combine the predicted refined geometry, lighting, and optimized diffuse reflection maps to perform differentiable rendering to obtain the reconstructed head image (the purple area in Figure 2). By calculating the loss between the mid-frequency and high-frequency rendering head and the original image, the geometric deformation of the head can be guided to obtain the corresponding geometric details. In this overall process, we also introduce some novel modules and loss functions to improve the modeling accuracy.

3.3 3D detail prior

Although facial details can be roughly reconstructed from a single image using a reconstruction loss, due to its nature as a highly ill-posed task, the details obtained only from a single image are ambiguous and ambiguous. Adding additional regularization may help shrink the solution space, but also leads to severe loss of detail accuracy and fidelity. To address this issue, we obtain real head 3D details from real 3D data as prior information to guide the network's predictions. As shown in the figure above, we use the proposed network structure to fit the real 3D mesh to obtain the groud-truth of the deformation map and displacement map. Then, in the network training, we introduce the discriminator network, and use the real distribution to guide the generation of detail maps. Ablation experiments show that the introduction of 3D detail prior can make the predicted head geometry smoother and more realistic.

3.4 De-Retouching module

The head image is the result of a combination of geometry, lighting, and face diffuse albedo. Previous work assumed that face diffuse albedo is smooth and modeled it using the low-frequency albedo of 3DMM. However, actual skin textures are full of high-frequency details such as moles, scars, freckles, and other blemishes, which bring ambiguity to geometric detail learning, especially in the single-view head reconstruction task. Inspired by [5], we propose a De-Retouching module that aims to generate facial albedo with high-frequency details and facilitates more accurate geometry and appearance decoupling. We first collected 10,000 head images from the FFHQ dataset, and trained a retouching network G to remove high-frequency details such as head blemishes. Given the head texture T', we first use G to remove its texture details and get T0, as shown in the figure above. We then aim to bake texture details into the coarse albedo A0 to obtain an optimized albedo A' for rendering. We assume that the lighting from A0 to T0 should be consistent with the lighting from A' to T', as :

Where S represents shading, and ⊙ represents element-wise matrix multiplication. We can then solve the equation and obtain A' as:

where ϕ(T0) avoids exploding values ​​around 0, and ε = 1e−6 by default. Compared with A0, the optimized albedo A' contains more high-frequency texture details, which alleviates the ambiguity between geometry and texture, especially in the single-view head reconstruction task.

3.5 Contour-aware Loss

We propose a novel contour-aware loss L_con to achieve accurate modeling of facial contours. L_con acts on the mid-frequency geometry M1 (figure 2) and aims to pull the vertices of the edges to align with the facial contours. As shown above, we first project the vertices of M1 into image space. We then predict the face mask M_face using a pre-trained face matting network [6] and post-process to obtain the left and right points for each row. Given a vertex p and a corresponding projected point p' on M_face, we get vectors l_p and r_p (edge ​​points from p' to the horizontal direction). Then L_con can be described as:

As you can see, L_con penalizes the vertices outside the soft margin of the head (like the blue and gray points in the image above) and pulls them to the head contour, while leaving the vertices inside the head untouched. We only focus on the lower part of the facial contours to avoid interference from the hair. Compared with the common segmentation loss, L_con gives a more direct head contour optimization direction and is easier to train. Ablation studies also confirm the effectiveness of Lcon in improving the accuracy of reconstructed contours.

3.6 MV-HRN

Thanks to hierarchical modeling and 3D prior guidance, we can easily adapt HRN to multi-view head reconstruction tasks. By adding geometric consistency between different views, we can accomplish accurate modeling of the overall facial geometry using a small number of two or three views. The figure above shows the flow of MV-HRN. We assume that head low-frequency parts and mid-frequency details are consistent across views, while lighting, pose, expression and high-frequency details, etc. should be view-dependent. Therefore, we introduce a standard space as well as a view-independent space to model the shared intrinsic facial shape as well as pose, lighting, expression and high-frequency details of each view, respectively. Through the fitting process, the face shape is gradually confined in a smaller and more accurate space under the supervision of images from different viewpoints. Experiments show that MV-HRN achieves accurate reconstruction in a short time (less than a minute) given only a small number (2∼5) of image views.

5. Experimental results

5.1 Comparison with SOTA method

5.1.2 Qualitative comparison

It can be seen that whether in single-image or multi-image reconstruction, our method has greatly improved the accuracy of geometry and the restoration of details compared with existing methods.

5.1.2 Quantitative comparison

Similarly, in comparison with quantitative indicators such as the average error of the real mesh, our method also surpasses the existing SOTA method in multiple single-image and multi-image head reconstruction benchmarks.

5.2 Ablation experiment

6. References

[1] Anpei Chen, Zhang Chen, Guli Zhang, Kenny Mitchell, and Jingyi Yu. Photo-realistic facial details synthesis from single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9429–9439, 2019. 1, 6

[2] Yudong Guo, Juyong Zhang, Jianfei Cai, Boyi Jiang, and Jianmin Zheng. Cnn-based real-time dense face reconstruction with inverse-rendered photo-realistic face images. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2018. 1

[3] E. Richardson, M. Sela, R. Or-El, and R. Kimmel. Learning detailed face reconstruction from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 1

[4] Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019. 2, 4, 6

[5] Biwen Lei, Xiefan Guo, Hongyu Yang, Miaomiao Cui, Xuansong Xie, and Di Huang. Abpn: Adaptive blend pyramid network for real-time local retouching of ultra highresolution photo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2108–2117, 2022. 2, 5

[6] Jinlin Liu, Yuan Yao, Wendi Hou, Miaomiao Cui, Xuansong Xie, Changshui Zhang, and Xian-sheng Hua. Boosting semantic human matting with coarse annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8563–8572, 2020. 4

 7. Application

        In addition, I would like to introduce to you the open source and free models on the CV domain. You are welcome to experience and download them (you can experience them on most mobile phones):

ModelScope community https://modelscope.cn/models/damo/cv_ddsar_face-detection_iclr23-damofd/summary

ModelScope community https://modelscope.cn/models/damo/cv_resnet50_face-detection_retinaface/summary

ModelScope Magic Community https://modelscope.cn/models/damo/cv_resnet101_face-detection_cvpr22papermogface/summary

ModelScope Magic Community https://modelscope.cn/models/damo/cv_manual_face-detection_tinymog/summary

ModelScope Magic Community https://modelscope.cn/models/damo/cv_manual_face-detection_ulfd/summary

ModelScope magic community https://modelscope.cn/models/damo/cv_manual_face-detection_mtcnn/summary

ModelScope Magic Community https://modelscope.cn/models/damo/cv_resnet_face-recognition_facemask/summary

ModelScope Magic Community https://modelscope.cn/models/damo/cv_ir50_face-recognition_arcface/summary

ModelScope magic community https://modelscope.cn/models/damo/cv_manual_face-liveness_flir/summary

ModelScope magic community https://modelscope.cn/models/damo/cv_manual_face-liveness_flrgb/summary

ModelScope community https://modelscope.cn/models/damo/cv_manual_facial-landmark-confidence_flcm/summary

ModelScope community https://modelscope.cn/models/damo/cv_vgg19_facial-expression-recognition_fer/summary

ModelScope community https://modelscope.cn/models/damo/cv_resnet34_face-attribute-recognition_fairface/summary

 

 

Supongo que te gusta

Origin blog.csdn.net/sunbaigui/article/details/130238028
Recomendado
Clasificación