The paper of the new face-shifting model FaceShifter is simple and complete explanation

Today, deep learning can produce amazing results in the field of image synthesis and processing. We have already seen examples of websites that illusion people of imagination, videos showing celebrities saying things they have never said, and tools that make people dance. These examples are true enough to fool most of us. One of the novel feats is FaceShifter [1], which is a deep learning model that can exchange faces in images superior to the latest technology. In this article, we will understand how it works.

Problem statement

  We have a source face image Xₛ and a target face image Xₜ. We want to generate a new face image Yₛ, which has the attributes of Xₜ (posture, lighting, glasses, etc.), but has the identity of the person in Xₛ. Figure 1 summarizes this problem statement. Now, we continue to explain the model.

figure 1. Change face problem statement. The results shown are from the FaceShifter model. Adapted from [1].

FaceShifter model

  FaceShifter consists of two networks, called AEI network and HEAR network. The AEI network generates a preliminary facial exchange result, and the HEAR network optimizes the output. Let us analyze the two separately.

AEI Network

AEI network is the abbreviation of "Adaptive Embedded Integrated Network". This is because the AEI network consists of 3 sub-networks:

  1. Identity encoder: An encoder that embeds Xₛ into a space that describes the identity of a face in an image.
  2. Multilevel attribute encoder: An encoder that embeds Xₜ into a space that describes the attributes to be retained when exchanging faces.
  3. AAD generator: a generator that integrates the output of the first two subnets to generate the exchange of faces in Xₜ and Xₛ's logo.

The AEI network is shown in Figure 2. Let us concretize its details.

figure 2. The architecture of the AEI network. Adapted from [1].

Identity encoder

  This sub-network projects the source image Xₛ into a low-dimensional feature space. The output is just a vector, which we call zᵢ, as shown in Figure 3. This vector encodes the identity of the face in Xₛ, which means that it should extract the features that we humans use to distinguish the faces of different people, such as the shape of the eyes, the distance between the eyes and the mouth, the curvature of the mouth, and so on.

   The author uses a pre-trained encoder. They used a trained face recognition network. This is expected to meet our requirements, because the network to distinguish faces must extract identity-related features.

image 3. Identity encoder. Adapted from [1].

Multi-level attribute encoder

  The sub-network encodes the target image X. It generates multiple vectors, each of which describes the attributes of Xₜ with different spatial resolutions. There are generally 8 feature vectors, called zₐ. The attribute here refers to the facial structure in the target image, such as the posture, outline, facial expression, hairstyle, skin color, background, scene lighting, etc. of the face. As shown in Figure 4, it is a ConvNet with a U-shaped network structure, in which the output vector is only a feature map of each level in the upscale / decoding section. Please note that this subnet is not pre-trained.

Figure 4. Multi-level attribute encoder architecture. Adapted from [1].

Representing Xₜ as multiple embeddings is necessary because using one embedding at a single spatial resolution will result in the loss of information in the desired output image that generates the swap surface (ie, we want to retain too much fine detail from Xₜ, which makes compression The image is not feasible). This is evident in the ablation studies done by the authors, who tried to use only the first 3 zₐ embeddings instead of 8 zₐ embeddings to represent Xₜ, which caused the output in Figure 5 to be more blurred.

Figure 5. Use multiple embeddings to represent the effect of the target. If the first 3 zₐ embeddings are used, the output is compressed; if all 8 embeddings are used, the output is AEI Net. Adapted from [1].

AAD generator

  The AAD generator is an abbreviation of "Adaptive Attention Denormalization Generator". It synthesizes the output of the first two subnets to improve the spatial resolution, thereby producing the final output of the AEI network. It is achieved by superimposing a new block AAD Resblock, as shown in Figure 6.

Figure 6. The architecture of the AAD generator in the left picture, and the AAD Resblock in the right picture. Adapted from [1].

The new part of this block is the AAD layer. We divide it into 3 parts, as shown in Figure 7. At a higher level, Part 1 tells us how to edit the input feature map hᵢ to make it more like Xₜ in terms of attributes. Specifically, it outputs two tensors of the same size as hᵢₙ, one tensor contains the scale value multiplied by each cell in hᵢₙ, and the other tensor contains the shift value. The input of layer 1 is one of the attribute vectors. Similarly, Part 2 will tell us how to edit the feature map hᵢ to make it more like Xₛ.

Figure 7. AAD layer architecture. Adapted from [1].

The task of Part 3 is to select the part (2 or 3) that we should focus on at each cell / pixel. For example, at the cell / pixel related to the mouth, the network will tell us to focus more on Part 2 because the mouth is more related to identity. This is proved empirically by an experiment shown in Figure 8.

Figure 8. An experiment showing what was learned in Part 3 of the AAD layer. The image on the right shows the output of the third part of the asynchronous number / spatial resolution in the entire AAD generator. The bright area indicates that we should pay attention to the identity of the cell (ie, part 2), the black area means that we pay attention to part 1. Note that at high spatial resolution, our main focus is on Part 1.

In this way, the AAD generator will be able to construct the final image step by step. In each step, it will determine the best way to enlarge the current feature map for a given identity and attribute code.

Now we have a network, the AEI network, which can embed Xₛ & Xₜ and integrate them in a way that achieves our goals. We call the output of AEI Net Yₛₜ *.

Training loss function

  Generally speaking, loss is a mathematical formula that we hope the network will achieve its purpose. There are 4 losses in training the AEI network:

  1. We want it to output a real human face, so we will have an adversarial loss, just like any adversarial network.
  2. We hope that the generated face has the identity of Xₛ. Our only mathematical object that can represent identity is zᵢ. Therefore, this goal can be expressed by the following losses:
  3. We want the output to have Xₜ attributes. The losses are:
  4. The author adds another loss based on the view that the network should output Xₜ (if Xₜ and Xₛ are actually the same image):

I believe that this last loss is necessary to drive the zₐ actual encoding attribute, because it is not pre-trained like zᵢ. Without it, the AEI network can ignore Xₜ and make zₐ only produce 0.

Our total loss is just a weighted sum of previous losses.

Hear Network

  The AEI network is not only a complete network capable of face switching. However, it is not good enough to be consistent. Specifically, whenever something in the target image obscures part of the face (such as glasses, hat, hair, or hands) that should appear in the final output, the AEI network removes it. These things should still exist, because it has nothing to do with the logo to be changed. Therefore, the author has implemented an additional network called the "heuristic error confirmation refinement network", which has a single task to recover such occlusion.

They noticed that when they set the input of the AEI network (ie Xₛ & Xₜ) to the same image, it still did not retain occlusion as shown in Figure 9.

Figure 9. The output of AEI Net when we input the same image as Xₛ & Xₜ. Notice how the chain on the headscarf is lost in the output. Adapted from [1].

Therefore, they did not use Yₛₜ * and Xₜ as input to the HEAR network, but set it to Yₛₜ * & (Xₜ-Yₜₜ *), where Yₜₜ * is the output of the AEI network when Xₛₜ * and Xₜ are the same image. This will direct the HEAR network to occlude unreserved pixels. As shown in Figure 10.

Figure 10. The structure of the HEAR network. Adapted from [1].

Training loss function

The loss function of the HEAR network is:

  1. Losses due to retention of identity:
  2. Does not substantially change the loss of Yₛₜ *:

  1. If Xₛ & Xₜ are the same image, then the output of the HEAR network should be Xₜ:

The total loss is the sum of these losses.

result

  The effect of face changer is amazing. In Figure 11, you can find some examples of its generalization performance on images other than the data set on which it is designed (ie from a broader data set). Notice how it works correctly under different and difficult conditions.

Figure 11. The results show that the converter has good performance. Adapted from [1].

  1. L. Li, J. Bao, H. Yang, D. Chen, F. Wen, FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping (2019), Arxiv.

Author: Ahmed Maher

Translated by: tensor-zhang

 

Published 47 original articles · Liked 105 · Visits 140,000+

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/105679690