Article directory
论文: 《Depth-Aware Generative Adversarial Network for Talking Head Video Generation》
github: https://github.com/harlanhong/CVPR2022-DaGAN
Solve the problem
Existing problems:
Existing video generation schemes mainly use 2D representations. Face 3D information is actually crucial for this task, and labeling requires a lot of cost;
Solution:
The author of this paper proposes a self-supervised scheme that automatically generates dense 3D geometric information from face videos without any labeling data; based on this information, sparse face key points are further estimated to capture important movements of the head; depth The information is also used to learn the 3D cross-modal (appearance and depth) attention mechanism to guide the generation of a sports field for distorting the original image; the
DaGAN proposed in this paper can generate high-fidelity human faces, and has achieved good results on unseen human faces;
The contributions of this paper mainly include the following three points:
1. Introduce a self-supervised method to fit the depth map from the video and use it to improve the generation effect;
2. Propose a novel depth-focused GAN, through depth-guided facial key point estimation and The cross-modal (depth and image) attention mechanism introduces depth information into the generation network;
3. Full experiments show the accurate depth fitting of face images, and the generation effect exceeds SOTA at the same time;
algorithm
The DaGAN method is shown in Figure 2, which consists of a generator and a discriminator;
the generator consists of three parts:
1. Self-supervised deep information learning subnetwork F d F_dFd, self-supervised learning of depth estimation from two consecutive frames in the video; then fix F d F_dFdCarry out the whole network training;
2. Sparse key point detection sub-network F kp F_{kp} guided by depth informationFkp;
3. The feature distortion module uses the key points to generate the change area, which combines the appearance information and motion information of the distortion source image features to obtain the distortion feature F w F_wFw; In order to ensure that the model pays attention to details and facial micro-expressions, further learn the attention map that focuses on depth information, and its refinement F w F_wFwget F g F_gFg, used to generate the image I g I_gIg;
Self-supervised Face Depth Learning
The author learns from SfM-Learner and makes optimizations, using continuous frames I i + 1 I_{i+1}Ii+1as the source map and I i I_iIiAs the target graph, learning set elements, the depth map DI i D_{I_i}DIi, KI i − > I i + 1 K_{
{I_i}->I_{i+1}}KIi−>Ii+1, the relationship RI i − > I i + 1 R_{
{I_i}->I_{i+1}}RIi−>Ii+1及变换 t I i − > I i + 1 t_{
{I_i}->I_{i+1}} tIi−>Ii+1, the difference from SfM-Learner is that the camera internal parameter K needs to be learned;
the process is shown in Figure 3:
1. F d F_dFdSubmission mark I i I_iIidepth map DI i D_{I_i}DIi;
2、 F p F_p FpExtract learnable parameters R, t, KR, t, KR , t , K ;
3. According to equations 3 and 4, the source graphI i + 1 I_{i+1}Ii+1Perform geometric transformation to get I i ′ I'_iIi′
q k q_k qkdisplay source image I i + 1 I_{i+1}Ii+1Pixels warped on;
pj p_jpjDisplay target I i I_iIiUpper original pixel;
loss function P e P_ePeAs shown in Equation 5, using L1 loss and SSIM loss
Sparse Keypoint Motion Modeling
1. Combine RGB and F d F_dFdExtract the depth map for concat;
2. Through the key point estimation module F kp F_{kp}FkpObtain the sparse key points of the face, as shown in Equation 6. Due to the introduction of the depth map, the prediction of the key points is more accurate; the
feature distortion strategy, as shown in Figure 4
1, as shown in Equation 7, calculates the initial offset between the original image and the driving image. On { O_n}On;
2. Generate a 2D coordinate map z;
3. Apply O to z to get the motion area wm w_mwm;
4. Use wm w_mwmDistort the downsampled image to obtain the initial distorted feature map;
5. Occlusion estimator τ \tauτ predicts the motion flow mask M m M_mthrough the distorted feature mapMmand occlusion map M o M_oMo;
6. Use M m M_mMmTwist I s I_sIsAfter encoder ϵ I \epsilon_IϵIObtained appearance feature map, compare it with M o M_oMoFusion generates F w F_wFw, such as formula 8. F w F_wFwIt not only preserves the information of the original image but also extracts the motion information between the two faces.
Cross-modal attention mechanism
In order to effectively use the learned depth map to improve the generation ability, the author proposes a cross-modal attention mechanism, as shown in Figure 5.
1, through the depth encoder ϵ d \epsilon_dϵdExtract depth map D sz D_{sz}DszFeature map F d F_dFd;
2. Through three 1X1 convolutional layers, F d F_dFd、 F w F_w FwMapped to 3 latent feature layers F q F_qFq、 F k F_k Fk、 F v F_v Fv;
3. As in formula 9, generate F g F_g through attentionFg.
4. Refining F g F_g through the decoderFgGenerate the final image I g I_gIg。
train
During the training process, the original image and the driving image are the same, and the loss function is shown in Equation 10,
LP L_PLPis the perceptual loss;
LG L_GLGUse the least squares loss;
LE L_ELEEquivariant loss, to ensure that the key points are transformed accordingly when the original image is transformed;
LD L_DLDPrevent facial key points from gathering together through distance loss;
experiment
Comparison of SOTA methods
The test results compared with SOTA on the VoxCeleb1 dataset are shown in Table 1 and 2.
On the VoxCeleb1 dataset, the cross-identity reproduction effect is shown in Figure 6.
On the CelebV dataset, the comparison test with the SOTA method is shown in Table 3. The cross-identity reproduction effect is as follows Figure 7
Ablation experiment
FDN: facial deep network;
CAM: cross-modal attention mechanism
The results are shown in Table 4, and
the generation effect is shown in Figure 8
DaGAN effect video
in conclusion
DaGAN uses a self-supervised method to learn facial depth maps. On the one hand, it is used for more accurate facial key point estimation; on the other hand, a cross-modal (depth map and RGB) mechanism is designed to obtain micro-expression changes. Therefore DaGAN produces more realistic and natural results.