Interpretation of DaGAN paper


论文: 《Depth-Aware Generative Adversarial Network for Talking Head Video Generation》
github: https://github.com/harlanhong/CVPR2022-DaGAN

Solve the problem

Existing problems:
Existing video generation schemes mainly use 2D representations. Face 3D information is actually crucial for this task, and labeling requires a lot of cost;

Solution:
The author of this paper proposes a self-supervised scheme that automatically generates dense 3D geometric information from face videos without any labeling data; based on this information, sparse face key points are further estimated to capture important movements of the head; depth The information is also used to learn the 3D cross-modal (appearance and depth) attention mechanism to guide the generation of a sports field for distorting the original image; the
DaGAN proposed in this paper can generate high-fidelity human faces, and has achieved good results on unseen human faces;

The contributions of this paper mainly include the following three points:
1. Introduce a self-supervised method to fit the depth map from the video and use it to improve the generation effect;
2. Propose a novel depth-focused GAN, through depth-guided facial key point estimation and The cross-modal (depth and image) attention mechanism introduces depth information into the generation network;
3. Full experiments show the accurate depth fitting of face images, and the generation effect exceeds SOTA at the same time;

algorithm

The DaGAN method is shown in Figure 2, which consists of a generator and a discriminator;
the generator consists of three parts:
1. Self-supervised deep information learning subnetwork F d F_dFd, self-supervised learning of depth estimation from two consecutive frames in the video; then fix F d F_dFdCarry out the whole network training;
2. Sparse key point detection sub-network F kp F_{kp} guided by depth informationFkp;
3. The feature distortion module uses the key points to generate the change area, which combines the appearance information and motion information of the distortion source image features to obtain the distortion feature F w F_wFw; In order to ensure that the model pays attention to details and facial micro-expressions, further learn the attention map that focuses on depth information, and its refinement F w F_wFwget F g F_gFg, used to generate the image I g I_gIg
insert image description here

Self-supervised Face Depth Learning

The author learns from SfM-Learner and makes optimizations, using continuous frames I i + 1 I_{i+1}Ii+1as the source map and I i I_iIiAs the target graph, learning set elements, the depth map DI i D_{I_i}DIi, KI i − > I i + 1 K_{ {I_i}->I_{i+1}}KIi>Ii+1, the relationship RI i − > I i + 1 R_{ {I_i}->I_{i+1}}RIi>Ii+1及变换 t I i − > I i + 1 t_{ {I_i}->I_{i+1}} tIi>Ii+1, the difference from SfM-Learner is that the camera internal parameter K needs to be learned;
the process is shown in Figure 3:
1. F d F_dFdSubmission mark I i I_iIidepth map DI i D_{I_i}DIi
2、 F p F_p FpExtract learnable parameters R, t, KR, t, KR , t , K ;
3. According to equations 3 and 4, the source graphI i + 1 I_{i+1}Ii+1Perform geometric transformation to get I i ′ I'_iIi
insert image description here
q k q_k qkdisplay source image I i + 1 I_{i+1}Ii+1Pixels warped on;
pj p_jpjDisplay target I i I_iIiUpper original pixel;
loss function P e P_ePeAs shown in Equation 5, using L1 loss and SSIM loss
insert image description here
insert image description here

Sparse Keypoint Motion Modeling

1. Combine RGB and F d F_dFdExtract the depth map for concat;
2. Through the key point estimation module F kp F_{kp}FkpObtain the sparse key points of the face, as shown in Equation 6. Due to the introduction of the depth map, the prediction of the key points is more accurate; the
insert image description here
feature distortion strategy, as shown in Figure 4
1, as shown in Equation 7, calculates the initial offset between the original image and the driving image. On { O_n}On;
insert image description here
2. Generate a 2D coordinate map z;
3. Apply O to z to get the motion area wm w_mwm;
4. Use wm w_mwmDistort the downsampled image to obtain the initial distorted feature map;
5. Occlusion estimator τ \tauτ predicts the motion flow mask M m M_mthrough the distorted feature mapMmand occlusion map M o M_oMo;
6. Use M m M_mMmTwist I s I_sIsAfter encoder ϵ I \epsilon_IϵIObtained appearance feature map, compare it with M o M_oMoFusion generates F w F_wFw, such as formula 8. F w F_wFwIt not only preserves the information of the original image but also extracts the motion information between the two faces.
insert image description here
insert image description here

Cross-modal attention mechanism

In order to effectively use the learned depth map to improve the generation ability, the author proposes a cross-modal attention mechanism, as shown in Figure 5.
1, through the depth encoder ϵ d \epsilon_dϵdExtract depth map D sz D_{sz}DszFeature map F d F_dFd;
2. Through three 1X1 convolutional layers, F d F_dFd F w F_w FwMapped to 3 latent feature layers F q F_qFq F k F_k Fk F v F_v Fv;
3. As in formula 9, generate F g F_g through attentionFg.
insert image description here
4. Refining F g F_g through the decoderFgGenerate the final image I g I_gIg
insert image description here

train

During the training process, the original image and the driving image are the same, and the loss function is shown in Equation 10,
insert image description here
LP L_PLPis the perceptual loss;
LG L_GLGUse the least squares loss;
LE L_ELEEquivariant loss, to ensure that the key points are transformed accordingly when the original image is transformed;
LD L_DLDPrevent facial key points from gathering together through distance loss;

experiment

Comparison of SOTA methods

The test results compared with SOTA on the VoxCeleb1 dataset are shown in Table 1 and 2.
insert image description here
On the VoxCeleb1 dataset, the cross-identity reproduction effect is shown in Figure 6.
insert image description here
On the CelebV dataset, the comparison test with the SOTA method is shown in Table 3. The cross-identity reproduction effect is as follows Figure 7
insert image description here

Ablation experiment

FDN: facial deep network;
CAM: cross-modal attention mechanism
The results are shown in Table 4, and
insert image description here
the generation effect is shown in Figure 8
insert image description here
insert image description here

DaGAN effect video

in conclusion

DaGAN uses a self-supervised method to learn facial depth maps. On the one hand, it is used for more accurate facial key point estimation; on the other hand, a cross-modal (depth map and RGB) mechanism is designed to obtain micro-expression changes. Therefore DaGAN produces more realistic and natural results.

Guess you like

Origin blog.csdn.net/qq_41994006/article/details/125586789