Let Hermione sing Tanya Tanya's song without any sense of disobedience

With just a photo of a person's face, the photo can be transformed into 3D, and the mouth shape can basically match the song. How is it done?

It is SadTalker. What is the technical principle? As usual, we still let AI help us read this CVPR 2023 paper  SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

The basic principle is briefly stated as follows:

1. First analyze a 3D face model from the picture. This model includes the shape of the face, expression and other information.

2. Then design two networks. The first network can analyze the exit movement information from the speech, and the second network can analyze the head rotation information from the speech.

3. Finally, based on the mouth movement and head rotation information generated by these two networks, the 3D face model is controlled to perform corresponding movements to render a video of speaking. Learning mouth movements and head movements separately in this way will be more accurate and natural than learning them together as a whole. In addition, this method proposes a novel face rendering method that can turn 3D models into videos very realistically.

The article also mentioned two networks,

One network is ExpNet, which is used to learn the coefficients of mapping audio to expressions. The main principles are: 1. First, use a pre-trained network to analyze only mouth movements, that is, mouth movements, from speech. The ExpNet network then looks at the speech signal and mouth movements at the same time and learns to generate other expression movements of the entire face, such as blinking.

54cf3cff6f1a86efff3e6837e4537b0a.png

In order for the ExpNet network to learn the correct expression, this method uses two techniques:

a. Provide the initial facial expression information to ExpNet, which can reduce the uncertainty of identity and know who is generating it. b. Calculate the shape of the eyes and mouth on the generated 3D face model, and compare it with the real video to make the generated expressions more accurate. In this way, ExpNet can only focus on learning other expression movements except the mouth, and generate realistic expression results.

Another network is PoseVAE, which is used to learn to analyze the head rotation posture from speech.

PoseVAE uses a structure called a variational autoencoder (VAE). Simply put, it contains an encoder and a decoder. The encoder can encode the head rotation posture into a vector, and the decoder can decode the original head posture from this vector.

659ae9b9aedf0c8d7e17d7e6ed3be0bb.png

During training, the PoseVAE network inputs head posture sequences of real videos and learns to extract the relationship between speech features and head movements. When predicting, you only need to input speech, and PoseVAE can output a natural head rotation sequence.

In order to generate different styles of head movements, this network also inputs a style information. In this way, the same voice can correspond to different styles of head movements.

Then the author also open sourced the source code of the entire project OpenTalker/SadTalker: [CVPR 2023] SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation (github.com), and it can also be used as in stable-diffusion webui To use the plug-in, you only need to search for sadtalker directly in the sd webui extension and click to install. It should be noted that sd webui will not directly help you download the model, you need to download it manually. Specifically, open the git repository to see how to download the specific model.

b8dc4e1e2846e6a0684b75bc8ee18a66.png

The downloaded model can be placed directly in the checkpoints directory in the sadtalker file under extension. If there is no such directory, you can create one yourself. Start the webui, and then open the sadtalker tag to use it. We upload a photo of Hermione, and then upload an audio. Because the computer has insufficient video memory, only the 25-second video above is generated.

01ce0bd0d53b560cb5b1879b837576df.png

Let’s see what you think. If you want to try it, please click and follow.

Guess you like

Origin blog.csdn.net/wutao22/article/details/132061608