(3) SadTalker allows stable diffusion characters to speak

Function Description

The github project SadTalker can synthesize a video of a face speaking this voice based on a picture and a piece of audio. Images need to be real people or close to real people. At present, the project has been supported stable diffusion webui. After the SD picture is produced, it can be combined with a piece of audio to synthesize a video of facial speech (a common digital person in Douyin)

SadTalker installation process

The internal access speed is relatively slow, use ghproxy to accelerate, formathttps://ghproxy.com/{github url}

https://ghproxy.com/https://github.com/OpenTalker/SadTalker

Fill in the address and the directory name after downloading as shown in the figure extensions, the downloaded file will be stored in {project你的项目}/stable-diffusion-webui/extensions, and the text of the folder is consistent with the input on the pageSadTalker


After the plug-in installation in the previous step is completed, you need to continue to add two compressed package files, which are placed in the corresponding directories

checkpoints: The entire checkpoints are placed in the {project}/stable-diffusion-webui/extensions/SadTalker extension directory;

gfpgan: Unzip the 4 files alignment_WFLW_4HG.pth detection_Resnet50_Final.pth GFPGANv1.4.pth parsing_parsenet.pth need to be placed in {project}/stable-diffusion-webui/models/GFPGAN

Continue to the next step of environment configuration
ffmpeg: video generation needs to be used (choose the method that suits you according to the environment), the following is the installation method of centos8

dnf install epel-release
yum config-manager --set-enabled PowerTools
yum-config-manager --add-repo=https://negativo17.org/repos/epel-multimedia.repo
dnf install ffmpeg ffmpeg
ffmpeg -version

restart the programpython3 launch.py --enable-insecure-extension-access --xformers --server-name 0.0.0.0

Use the tutorial (1) deploy sdwebui under linux, install the pictures of the model and plug-ins to try the effect, and explain the parameters

  • For pictures, it is best to have a big head, otherwise it will look unnatural
  • Audio file, tested with the audio from the SadTalker example
  • Image processing method atmosphere, crop (cutting), resize (reset size), full (original image), in which the cropexpressions and animations generated based on facial key points are relatively realistic, provided that the full image is not used, it will look weird
  • Remove head motion (works better with preprocess full) This option is necessary for the original image, it optimizes the head movement of the character, and the generated video is more natural; here, because of the clipping, it is not selected to open
  • Face enhancement, check it, you can get better facial quality

insert image description here

The video has been processed by CSDN, it looks a little unnatural, but the actual effect is still good

[video(video-phSCxqv2-1682868060919)(type-csdn)(url-https://live.csdn.net/v/embed/293783)(image-https://video-community.csdnimg.cn/vod-84deb4/bb37ebe0e76971edbdc60764a3fd0102/snapshots/31b790e8ec164ef291c9a86b488c4ec4-00002.jpg?auth_key=4836467687-0-0-f1e4217a71b0a2493dff06c61b65ea10)()]

Text-to-speech generation involves too many non-technical issues, so I won’t expand on it. Let’s look at the TTS-Vue project by yourself

Guess you like

Origin blog.csdn.net/q116975174/article/details/130445569