Generating AI (1) - "Smart Lecturer" Lip Sync

Background: 2023 is the first year of generative AI. This article explains how to use open source projects to convert voice to video.

Purpose: Introduction, compilation and application of the open source project Wav2Lip

Rendering (bottom right corner):
insert image description here
Applicable fields: content creation, education, corporate training

Features: It is not required to appear in person, original sound.

1. Introduction to the open source project Wav2Lip

Address: Wav2Lip Github warehouse address

principle:
insert image description here

1.1 Model training

Training data source: LRS2 dataset

Thesis: Paper

Features: The model input is divided into sound and image, and the direct output is a lip frame image.

1.2 Code implementation

1. Input the picture (or dynamic video) and voice of the virtual lecturer to perform image-level opencv face recognition and lip recognition.

2. Input the original audio of the speaker, and generate dynamic mouth shapes through the inference model of the original audio.

3. Pass the visual quality inspection, blur the background and paste it back to the original image, and make a video animation based on the original image.

4. Use ffmpeg to integrate audio and video output

2. Project Compilation

2.1 Dependency installation

1. Download FFmpeg from the official website

​ windows download ffmpeg EXE file

2. Configure environment variables

​ Add the pointing of ffmpeg to the system environment variable, execute ffmpeg to check whether the environment configuration can be recognized

c:users>ffmpeg
ffmpeg version N-104695-g86a2123a6e-20211129 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 10-win32 (GCC) 20210610

3. Install the latest version of VSCode

4. Install Python 3.8

2.2 Prepare the project

1. Download the project source code and unzip it.

2. Download the pre-generated inference model file Link and copy it to: checkpoints\wav2lip_gan.pth

3. Download the pre-generated face detection reasoning model file pre-trained model and copy it to: face_detection\detection\sfd\s3fd.pth

4. Create a dist folder in the root directory of the source code, and ensure that it has the following structure

Table of contents subfolder document
dist checkpoints wav2lip_gan.pth
face_detection\detection\sfd s3fd.pth
results
temp

2.3 Compile the project

Run the following command in VSCode:

pip install -r requirements.txt

If the native environment lacks the corresponding library, please install it additionally.

After the dependent environment installation is complete, run the following command:

python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source> 

in,

  1. ckpt is the relative path to the wav2lip_gan.pth file
  2. video.mp4 is the full path of the face photo or video of the virtual lecturer
  3. an-audio-source is the full path to the speaker's audio file.

3. Generate EXE

must:Run the console as administrator

1. Install pipenv

cd [项目源码根目录]
pip install pipenv

2, Execute package compilation

#进入虚拟环境(上一步可省略,因为没有虚拟环境的话会自动建立一个)
pipenv shell
#安装模块
#pip install "opencv-python-headless<4.3"
pip install requests librosa==0.7.0 numpy==1.20.0 opencv-contrib-python>=4.2.0.34 opencv-python==4.1.2.30 torch==1.8.0 torchvision==0.9.0 tqdm==4.45.0 numba==0.48 face_detection 
#打包的模块也要安装
pip install pyinstaller
#开始打包,依赖包因为是动态的__import__的,因此需要手动添加依赖模块hidden-import
pyinstaller -F --hidden-import face_detection.detection --hidden-import face_detection.detection.sfd inference.py

3, test

cd dist
inference --checkpoint_path checkpoints\wav2lip_gan.pth --face D:\test\face.jpg --audio D:\test\audio.mp3

Summarize

Through source code compilation, master the use of generated AI to achieve voice and lip synchronization to broadcast video and integrate other functions.

You can consider integrating the generated EXE file into other applications.

  • In terms of performance:

​ It takes 10 seconds for the short audio to run under the CPU; due to the limitation of document writing, the test was not performed under the GPU.

  • In terms of effect:

    1. Cartoon image, lifelike lip sync.

    2. Model image with good image and lifelike lip sync.

    3. The author created his own language - nonsense, lip-syncing is realistic, there is no test monkey, and the effect is estimated to be good.

    4. Professional sales of short videos with body language, lip sync realistic.

ALL:

​ Project source codeFrom Python to C#, if you want to understand the idea of ​​transcoding, please refer to the previous C# AI project.

Guess you like

Origin blog.csdn.net/black0707/article/details/129687345