[Computer Vision | Face Modeling] SOFA: Style-Based, 3D Facial Animation Driven by 2D Keypoints from a Single Example

This series of blog posts are notes for deep learning/computer vision papers, please indicate the source for reprinting

标题:SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks

链接:SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks | Proceedings of the 2023 ACM International Conference on Multimedia Retrieval

Authorization statement:

Permission is granted to make digital or hard copies of all or part of this work for personal or classroom use free of charge, provided that copies are not made or distributed for profit or commercial gain, and that the copies are accompanied by this notice and full citation on the first page. Copyrights in components of this work owned by others other than the author must be respected. Withdrawals on credit are allowed. To otherwise copy, republish, post on a server or redistribute to lists requires specific prior permission and/or payment. Request permissions from [email protected].

ICMR'23, 12-15 June 2023, Thessaloniki, Greece

© 2023 Copyright by owner/author. Publishing rights are licensed by ACM.

ACM ISBN 979-8-4007-0178-8/23/06. . . $15.00

https://doi.org/10.1145/3591106.3592291

Figure 1: Visualization results of our method. (a) target face image (b) target keypoint map © our rendered avatar (d) real avatar (e) our texture map (f) real texture map.

Summary

We propose a 2D landmark-driven 3D facial animation framework based on 2D landmark-driven 3D facial animation framework , without using 3D facial dataset for training. Our method decomposes 3D facial avatars into geometry and texture parts. Given 2D keypoints as input , our model learns to estimate the parameters of FLAME and convert target textures to different facial expressions. Experimental results show that our method achieves remarkable results. By using 2D keypoints as input data, our method has the potential to be deployed in scenarios where it is difficult to obtain a full RGB facial image (such as being occluded by a VR headset).

CCS concept

• Computational Methodology -> Animation

Key words

Facial animation, 3D avatars, plastic models

ACM Reference Format

Pu Ching, Hung-Kuo Chu, and Min-Chun Hu. 2023. SOFA: Style-based One-shot 3D Facial Animation Driven by 2D landmarks. In International Conference on Multimedia Retrieval (ICMR '23), June 12–15, 2023, Thessaloniki, Greece. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3591106.3592291

1 Introduction

Facial animation has been an important task in the fields of computer graphics and computer vision. Despite the development of cartoon-style avatars widely used in teleconferencing scenarios, it remains challenging to provide realistic facial animations for those users who require a more immersive and vivid experience. According to the final representation of the output, facial animation can be simply divided into two categories of methods: 2D-based and 3D-based methods . 3D-based methods can be further divided into model-based and model-free based methods according to whether **parameterized facial model** is used.

** 2D based facial animation. **2D-based facial animation focuses on a specific range of camera angles and generates a corresponding sequence of 2D facial images given a sequence of 2D facial target images and information such as keypoints. According to application constraints, 2D-based facial animation can be divided into subject-dependent and subject-agnostic methods.

  • Subject-dependent methods can only be used for specific characters.
  • Most of the subject-independent methods are based on the one-shot setting, which controls the target image given one image of the user and source information of different modalities.

For example

  1. Gu et al. [4] stitched the keypoint map with the input image and learned a deformation-based network for face retargeting.
  2. Zakharov et al. [18] utilize adaptive instance normalization to fuse keypoints and raw image features.
  3. To generate high-quality face images, Yi et al. [17] included a two-stage refinement step in the generator.
  4. To further exploit facial control signals, Zhao et al. [19] use local branches to improve fine-grained facial details;
  5. Meshry et al. [12] learned a spatial layout map to generate more information;
  6. Tao et al. [15] proposed to use deformable anchors to model complex structures.

Extensive 2D face datasets encourage researchers to develop different face parsers, such as keypoint predictors (landmark predictors) and facial feature segmenters (landmark predictors) , to drive virtual characters in an easy way.

Most existing 2D-based methods can accurately output high-quality images of frontal faces, but cannot generate images of faces with different head poses. ** In contrast, 3D-based methods have greater potential to generate facial animations with different head poses.

3D based facial animation. 3D facial animation can be divided into model-free and model-based approaches, i.e. whether or not a plastic model is used as a prior .

  • Model-free methods [9, 13, 16] usually pre-train a variational autoencoder (VAE), learning a latent space to compress the semantic information of texture and geometry. The avatar is then driven based on the pre-trained decoder and inputs of different data modalities (e.g. NIR eye images or keypoint positions of user's eyes/lips).
    • Since the above methods are user-specific, Cao et al. [2] propose a framework to learn global priors and decode texture and geometry under different identity conditions.
    • Model-free methods are able to learn a global latent space for different input data modalities, but the training of 3D model-free methods usually relies on a large amount of 3D facial geometry and texture data for a specific user , resulting in overfitting to this user and poor generalization ability .
  • 3D model-based methods typically train an encoder to regress parameters for different facial attributes (e.g., pose, shape, and expression) given a 2D face image of the user, and a decoder to generate the user's 3D face.
    • 3DMM [1, 7] has been used as an effective method for registering faces. Recently, FLAME [8] was proposed to estimate parameters governing expression, pose, and shape. **Previous methods aim to regress these parameters and use 2D face image reconstruction loss as training target. **Model-based methods compress geometric information into low-dimensional representations and are thus widely used in recent facial animation work.
    • Sanyal et al. [8] use cycle consistency on projected 2D keypoints to achieve 3D face reconstruction without using ground truth data of 3D faces. Feng et al. [3] considered texture mapping and detailed displacement, using differentiable rendering to train their generator. Medin et al. [11] tackled the facial animation problem similar to Feng's work [3], but the final output is a 2D image instead of a 3D face.
    • In contrast to 3D model-free methods, 3D model methods typically use differentiable renderers to compute the reconstruction loss between images, thus not requiring the ground truth of 3D facial geometry and texture data. However, for virtual reality applications, existing 3D model-based methods may fail because the user's face is occluded by the head-mounted display (HMD), resulting in incomplete input face information .

To summarize, current approaches to facial animation suffer from some trade-offs in terms of data acquisition and fidelity.

  • 2D methods are convenient in terms of data acquisition and cross-modal inference, but cannot provide highly immersive rendering results. 3D model-free methods provide good rendering results but are limited by the difficulty of data acquisition.
  • The 3D model approach strikes a balance between data acquisition and rendering results, but relies on full RGB facial images and is therefore not practical in occluded environments such as virtual reality.

In this paper, we adopt a framework in a single-instance setting to drive 3D facial animation with a full facial image of the user and a sequence of facial keypoints .

Facial keypoint sequences can be obtained from full facial images, or from partially occluded facial images in VR scenes, with the aid of additional NIR (near infrared) images.

Our proposed architecture is based on the concept of 3D model methods and can operate without the need for 3D facial ground truth. Based on the proposed framework, we further propose to use facial expressions represented by keypoints as styles, and adjust the target facial texture maps by a StyleGAN generator.

Experimental results demonstrate that our proposed method is capable of generating remarkable face synthesis results in real-time .

2 methods

Figure 2(a) shows our system framework.

Figure 2: Overall system framework

Controlling key point map (landmark map) L ′ L’L can be based on the off-the-shelf keypoint predictorEL E_LELFrom the source face image I'I'I (or from an additional near-infrared image capturing a partially occluded face image).

For a given keypoint map L ′ L’L , which can be obtained through the face geometry regressorER E_RERto predict facial parameters.

Meanwhile, given the complete user face image I 0 I_0 according to the single-instance setup described earlierI0, a pre-trained virtual character estimator ET E_TETFor estimating the user's initial facial texture T 0 T_0T0, the keypoint predictor EL E_LELis applied to obtain the user's initial keypoint map L 0 L_0L0

We propose a style-based texture transformer ST S_TST, for a given keypoint map L 0 L_0L0and L' L'L , the initial face textureT 0 T_0T0Morph to target texture T'T'T , which is obtained by computing the given keypoint mapL 0 L_0L0and L' L'L between the residual informationΔ S \Delta{S}ΔS obtained .

Finally, for each source frame I'I'I , combining facial parameters and textureT ′ T'T , using the virtual character generatorDA D_ADAGenerate the final virtual character YYY

2.1 Geometric Regressor

Geometry Regressor (Geometry Regressor, ER E_R in the figureER

Directly synthesizing an entire 3D facial model represented by vertices using single-view images is a very complex problem.

Inspired by previous work, we adopt FLAME [8] as the plastic model, which requires three kinds of parameters:

  • Attitude θ \thetai
  • Expression ψ \psip
  • shape β \betab

to generate a 3D facial mesh. Compared to modeling the complex geometry of the entire face, using a plastic model like FLAME has the advantage of a representation with lower degrees of freedom, allowing us to design a lightweight geometric regressor ER E_RERto estimate FLAME parameters and generate virtual characters in real time.

  1. Attitude (Pose): attitude parameter θ \thetaθ is used to describe the rotation and translation of the 3D facial mesh in space. It can control the orientation of the virtual character's head and face to achieve different head poses and face orientations.
  2. Expression (Expression): expression parameter ψ \psiψ is used to describe the facial expressions of the 3D facial mesh, such as smiling, angry, sad, etc. It can control the change of the facial expression of the virtual character to achieve different facial expressions.
  3. Shape (Shape): shape parameter β \betaβ is used to describe the overall shape of the 3D facial mesh. It can control the change of the facial shape of the virtual character, so as to realize the change of the personalized characteristics and facial shape of different users.

In addition to reducing the model size, using the FLAME plastic model makes it possible to generate high-quality facial meshes without the need for 3D ground truth data.

It is worth noting that

  1. Geometric Regressor ER E_REROnly estimate the pose parameters θ ′ \theta'i and expression parameterψ ′ \psi'p
  2. Shape parameter β ′ \beta'b by the virtual role estimatorET E_TETAccording to the complete user face image I 0 I_0I0Make an estimate.

In Section 3.3, we show that when the geometric regressor ER E_RERRegression works better when the shape parameter is not included.

2.2 Style-Based Texture Converter

Style-based Texture Translator (Style-based Texture Translator, ST S_T in the figureST

Style-Based Texture Transformer ST S_TSTReceive a style code Δ S \Delta{S}ΔS , which is the residual information of the keypoint map, is used to estimate the animated texture map T ′ T′T

  • We map the network MMM from the 2D keypoint mapL′ L′L , and outputS ′ S’S contains information about the identity of the subject and the source expression.
  • Similarly, the mapping network MMM is applied to the 2D keypoint mapL 0 L_0L0, to extract S 0 S_0S0, which contains the information of subject identity and neutral expression.

In order to reduce the dependence on the identity of the subject and keep only the expression information, we will S ′ S’S andS 0 S_0S0The residual is coded as style, namely:
Δ S = S ′ − S 0 (1) \Delta{S}=S'-S_0\tag{1}WILL _=SS0( 1 )
As shown in Figure 2(b), the texture converterST S_TSTby NNN encoding blocks,{ E i } i = 1 N \lbrace{E_i}\rbrace^N_{i=1}{ Ei}i=1N, and NNN style-based stacked warping blocks,{ D i } i = 1 N \lbrace{D_i}\rbrace^N_{i=1}{ Di}i=1N, consisting of a skip-connection similar to the U-net architecture.

At a given Δ S \Delta{S}Under the condition of ΔS , each style-based stacked warp blockD i D_iDiThe output features of the previous layer D i + 1 D_{i+1}Di+1E i E_iEias input.

More specifically, each warped block D i D_iDiis a StyleGAN generator with modulated convolutional layers, its formula is:
f D i = U psample ( convm ( D i ( f D i + 1 , f E i ) , Δ S ) ) (2) f_{D_i} =Upsample(convm(D_i(f_{D_{i+1}},f_{E_i}),\Delta{S}))\tag{2}fDi=Upsample(convm(Di(fDi+1,fEi),S ) )( 2 )
Note thatf D 0 f_{D_0}fD0is the final animated texture map T′ T′T

In Section 3.2, we verify that applying PixelShuffle [14] as an upsampling process improves fine-grained generation quality compared to using deconvolutional layers. By providing style encoding in different receptive fields, the texture transformer ST S_TSTAbility to generate global representations with a specific style.

2.3 Virtual Character Generator

Avatar generator ( DA D_A in the figureDA

Since our facial model is built on top of FLAME, we can apply predefined UV mappings to the estimated avatar for rendering.

To improve rendering quality, we employ a photoscale loss computed based on the L2 norm, which computes the difference between face images rendered with estimated textures and with real textures (see Section 2.4 for details).

Furthermore, we employ differentiable rendering to implement the proposed texture transformer ST S_TSTend-to-end training. It is worth noting that during the inference stage, other non-differentiable engines can be used to render the estimated virtual character.

Our estimated 3D virtual characters are not constrained by the original camera angles of the input images during rendering.

2.4 Learning content

During the training phase, the geometry estimator ER E_RERand texture converter ST S_TSTTrain separately.

  • For the geometry estimator ER E_RER, we minimize the geometric loss, defined as:
    L g = λ FLFLAME + λ l L l (3) L_g=\lambda_FL_{FLAME}+\lambda_lL_l\tag{3}Lg=lFLFLAME+λlLl(3)

    • L F L A M E L_{FLAME} LFLAME是估计参数和真实参数之间的L2损失,即 ∣ ∣ θ ′ − θ ∣ ∣ 2 ||\theta'-\theta||^2 ∣∣θθ2 ∣ ∣ ψ ′ − ψ ∣ ∣ 2 ||\psi'-\psi||^2 ∣∣ψψ2

    • L l L_l Ll是估计网格的3D关键点与真实网格的3D关键点之间的L2损失。需要注意的是,网格的3D关键点是由FLAME提取的。

  • 对于纹理转换器 S T S_T ST,我们最小化纹理损失,定义为:
    L T = λ i L i + λ r L r + λ p L p (4) L_T=\lambda_iL_i+\lambda_rL_r+\lambda_pL_p\tag{4} LT=λiLi+λrLr+λpLp(4)

    • L i L_i LiIndicates the L2 loss between the estimated texture map and the real texture map.
    • L r L_r Lris the aforementioned photo-level loss.
    • L p L_p Lpis the perceptual loss between the estimated texture map and the real texture map [6].

λ F \lambda_FlFλ l \lambda_lllλ i \lambda_iliλ r \lambda_rlrλ p \lambda_plpis a predefined hyperparameter.

3 experiments

3.1 Experimental setup

  • data collection. To demonstrate that our 3D model-based approach can operate without the need for 3D face ground truth, we collect a 2D face dataset consisting of 792 video sequences,

    • It contains 6 basic emotions (including surprise, fear, disgust, happiness, sadness, anger) and 12 compound facial expressions (composed of these 6 basic emotions) to cover rich natural expressions.
    • We invited 22 subjects to collect the video dataset, and each subject performed each expression twice.
    • Videos of two of the subjects are used as the test set.
  • data processing. For the collected raw video sequences, we use an off-the-shelf face detection model [10] to crop the face region in each frame. Then, the cropped face region is resized to 256×256 and used as input to our network.

  • training details. Geometry Estimator ER E_RERand texture converter ST S_TSTare trained separately.

    • For the geometry estimator ER E_RER, we use ResNet-18 as the backbone network to extract features, and then apply 2 MLP branches to predict pose and expression respectively. In our experiments, we train our model for 50 epochs using the Adam optimizer with a learning rate of 0.0002.

    • Texture Converter ST S_TSTConsists of 3 encoding blocks and 3 decoding blocks, connected through U-net architecture. In our experiments, we train our model for 100 epochs using the Adam optimizer with a learning rate of 0.0002.

3.2 Texture conversion

We compare style-based decoding blocks to a baseline method that fuses features directly using 2D convolutional layers.

In practice, we applied the same training setup , except for the architecture of the decoder.

In the baseline model, instead of mapping the extracted features to a style encoding, we concatenate the keypoint features at the bottleneck with the output, which is a 2D feature map of the encoder.

  • In UV space, common reconstruction metrics such as L1, PSNR, SSIM, and FID are commonly used.

  • In our experiments, we found that L1, PSNR and SSIM have only small differences between different methods, so we use FID as a performance metric to demonstrate the effectiveness of each method.

Table 1 shows the quantitative evaluation between different methods, including the baseline method, our proposed method and applying PixelShuffle (denoted as Pix ), encoding residual information as style (denoted as Res ), and whether to apply perceptual loss (denoted as PLoss ) reduction studies.

Table 1: Texture conversion comparison results. Pix uses PixelShuffle for upsampling, Res uses residual information as style encoding, and PLoss applies perceptual loss.

insert image description here

We find that texture reconstruction quality can be significantly improved using our proposed style-based decoder. Encoding the remaining information as style further improves the visual quality.

Figure 3: Qualitative comparison between different methods. For each method, the left column is the texture map, and the right column is the rendered result.

Figure 3 shows that the baseline model cannot well reconstruct detailed eye expressions such as blinking, while our proposed style-based texture converter can better reconstruct facial details. Furthermore, our model is lightweight and enables real-time inference (about 20 fps).

  • Table 1 shows that applying perceptual loss reduces FID during training.
  • Figure 3 shows that perceptual loss helps preserve high-frequency details (such as wrinkles or lighting) in texture reconstruction.

We also tried applying the patch-GAN [5] loss to further improve the visual quality, but the FID performance dropped significantly.

3.3 Geometric Estimation

We compare the results between estimating all FLAME parameters and only estimating expression/pose without estimating shape.

Table 2: Comparison of geometric estimates (MSE ↓)

Table 2 shows the corresponding MSE (mean square error), which measures the distance between the estimated parameters and the true parameters. We can observe that models that do not estimate shape information perform better.

In addition, we also compare the MSE of 3D keypoints obtained by FLAME, and the results also show that the model that does not estimate shape information achieves better performance .

4 Conclusion

  • This study proposes a novel framework to model 3D facial animation using 2D keypoints without using 3D facial datasets as ground truth.
  • We provide a flexible solution that can drive 3D characters as long as 2D facial keypoints can be obtained and provided with a complete image of the face (single-instance setup).
  • We show that by using the proposed style-based framework, the visual quality of the reconstructed characters outperforms baseline methods.
  • In the future, we will validate the proposed framework based on different control inputs and demonstrate the generalization ability of the model.

references

[1] Volker Blanz and Thomas Vetter. 1999. Plastic models for synthesizing 3D faces. "Computer Graphics and Interactive Technology Annual Conference Proceedings".

[2] Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, ShunSuke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, etc. 2022. Realistic volumetric avatars from phone scans. ACM Transactions on Graphics (TOG) (2022).

[3] Yao Feng, Haiwen Feng, Michael J Black and Timo Bolkart. 2021. Learning animatable detailed 3D facial models from images in the wild. ACM Transactions on Graphics (ToG) (2021).

[4] Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. 2020. Flnet: Keypoint-Based Learning Networks for Faithful Dialogue Facial Animation Synthesis. "Artificial Intelligence AAAI Conference Proceedings".

[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Conditional Adversarial Networks for Image-to-Image Translation. CVPR (2017).

[6] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual loss for real-time style transfer and super-resolution. Proceedings of the European Conference on Computer Vision. Springer.

[7] Reinhard Knothe, Brian Amberg, Sami Romdhani, Volker Blanz, and Thomas Vetter. year 2011. Facial Morphology Model. Handbook of Facial Recognition. Springer.

[8] Tianye Li, Timo Bolkart, Michael J Black, Hao Li and Javier Romero. 2017. Learn facial shape and expression models from 4D scans. "ACM Transactions on Graphics" (2017).

[9] Stephen Lombardi, Jason Saragih, Tomas Simon and Yaser Sheikh. 2018. A deep appearance model for facial rendering. ACM Transactions on Graphics (ToG) (2018).

[10] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, etc. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[11] Safa C Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B Tenenbaum, Xiaoming Liu, and Tim K Marks. 2022. MOST-GAN: A 3D Plastic StyleGAN for Unwrapping Facial Image Manipulation. "AAAI Artificial Intelligence Conference Proceedings".

[12] Moustafa Meshry, Saksham Suri, Larry S Davis and Abhinav Shrivastava. 2021. Learning spatial representations for few-shot speaking-head synthesis. Proceedings of the IEEE/CVF International Conference on Computer Vision.

[13] Alexander Richard, Colin Lea, Shugao Ma, Jurgen Gall, Fernando De la Torre and Yaser Sheikh. 2021. Audio and gaze-driven facial animation for codec characters. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.

[14] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single-image and video super-resolution using efficient subpixel convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

[15] Jiale Tao, Biao Wang, Borun Xu, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Structure-aware motion transfer with deformable anchor models. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino and Yaser Sheikh. 2019. VR facial animation transforms through multi-view images. ACM Transactions on Graphics (TOG) (2019).

[17] Zili Yi, Qiang Tang, Vishnu Sanjay Ramiya Srinivasan, and Zhan Xu. 2020. Animation through Morphing: An Efficient Method for High-Quality Facial Expression Animation. "Multimedia ACM International Conference Proceedings".

[18] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov and Victor Lempitsky. 2019. Few-shot adversarial learning for realistic neural talking head models. Proceedings of IEEE/CVF International Conference on Computer Vision.

[19] Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse-to-dense motion transfer for facial image animation. Proceedings of IEEE/CVF International Conference on Computer Vision.

REFERENCES

[1] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques.

[2] Chen Cao, Tomas Simon, Jin Kyu Kim, Gabe Schwartz, Michael Zollhoefer, Shun-Suke Saito, Stephen Lombardi, Shih-En Wei, Danielle Belko, Shoou-I Yu, et al. 2022. Authentic volumetric avatars from a phone scan. ACM Transactions on Graphics (TOG) (2022).

[3] Yao Feng, Haiwen Feng, Michael J Black, and Timo Bolkart. 2021. Learning an animatable detailed 3D face model from in-the-wild images. ACM Transactions on Graphics (ToG) (2021).

[4] Kuangxiao Gu, Yuqian Zhou, and Thomas Huang. 2020. Flnet: Landmark-driven fetching and learning network for faithful talking facial animation synthesis. In Proceedings of the AAAI conference on artificial intelligence.

[5] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-Image Translation with Conditional Adversarial Networks. CVPR (2017).

[6] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In European conference on computer vision. Springer.

[7] Reinhard Knothe, Brian Amberg, Sami Romdhani, Volker Blanz, and Thomas Vetter. 2011. Morphable Models of Faces. In Handbook of Face Recognition. Springer.

[8] Tianye Li, Timo Bolkart, Michael J Black, Hao Li, and Javier Romero. 2017. Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. (2017).

[9] Stephen Lombardi, Jason Saragih, Tomas Simon, and Yaser Sheikh. 2018. Deep appearance models for face rendering. ACM Transactions on Graphics (ToG) (2018).

[10] Camillo Lugaresi, Jiuqiang Tang, Hadon Nash, Chris McClanahan, Esha Uboweja, Michael Hays, Fan Zhang, Chuo-Ling Chang, Ming Guang Yong, Juhyun Lee, et al. 2019. Mediapipe: A framework for building perception pipelines. arXiv preprint arXiv:1906.08172 (2019).

[11] Safa C Medin, Bernhard Egger, Anoop Cherian, Ye Wang, Joshua B Tenenbaum, Xiaoming Liu, and Tim K Marks. 2022. MOST-GAN: 3D morphable StyleGAN for disentangled face image manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence.

[12] Moustafa Meshry, Saksham Suri, Larry S Davis, and Abhinav Shrivastava. 2021. Learned Spatial Representations for Few-shot Talking-Head Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

[13] Alexander Richard, Colin Lea, Shugao Ma, Jurgen Gall, Fernando De la Torre, and Yaser Sheikh. 2021. Audio-and gaze-driven facial animation of codec avatars. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.

[14] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition.

[15] Jiale Tao, Biao Wang, Borun Xu, Tiezheng Ge, Yuning Jiang, Wen Li, and Lixin Duan. 2022. Structure-Aware Motion Transfer with Deformable Anchor Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[16] Shih-En Wei, Jason Saragih, Tomas Simon, Adam W Harley, Stephen Lombardi, Michal Perdoch, Alexander Hypes, Dawei Wang, Hernan Badino, and Yaser Sheikh. 2019. VR facial animation via multiview image translation. ACM Transactions on Graphics (TOG) (2019).

[17] Zili Yi, Qiang Tang, Vishnu Sanjay Ramiya Srinivasan, and Zhan Xu. 2020. Animating through warping: An efficient method for high-quality facial expression animation. In Proceedings of the 28th ACM international conference on multimedia.

[18] Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. 2019. Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.

[19] Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
Few-shot adversarial learning of realistic neural talking head models. In Proceedings of the IEEE/CVF international conference on computer vision.

[19] Ruiqi Zhao, Tianyi Wu, and Guodong Guo. 2021. Sparse to dense motion transfer for face image animation. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

Guess you like

Origin blog.csdn.net/I_am_Tony_Stark/article/details/132031840