Background: 2023 is the first year of generative AI. This article explains how to use open source projects to convert voice to video.
Purpose: Introduction, compilation and application of the open source project Wav2Lip
Rendering (bottom right corner):
Applicable fields: content creation, education, corporate training
Features: It is not required to appear in person, original sound.
1. Introduction to the open source project Wav2Lip
Address: Wav2Lip Github warehouse address
principle:
1.1 Model training
Training data source: LRS2 dataset
Thesis: Paper
Features: The model input is divided into sound and image, and the direct output is a lip frame image.
1.2 Code implementation
1. Input the picture (or dynamic video) and voice of the virtual lecturer to perform image-level opencv face recognition and lip recognition.
2. Input the original audio of the speaker, and generate dynamic mouth shapes through the inference model of the original audio.
3. Pass the visual quality inspection, blur the background and paste it back to the original image, and make a video animation based on the original image.
4. Use ffmpeg to integrate audio and video output
2. Project Compilation
2.1 Dependency installation
1. Download FFmpeg from the official website
windows download ffmpeg EXE file
2. Configure environment variables
Add the pointing of ffmpeg to the system environment variable, execute ffmpeg to check whether the environment configuration can be recognized
c:users>ffmpeg
ffmpeg version N-104695-g86a2123a6e-20211129 Copyright (c) 2000-2021 the FFmpeg developers
built with gcc 10-win32 (GCC) 20210610
3. Install the latest version of VSCode
4. Install Python 3.8
2.2 Prepare the project
1. Download the project source code and unzip it.
2. Download the pre-generated inference model file Link and copy it to: checkpoints\wav2lip_gan.pth
3. Download the pre-generated face detection reasoning model file pre-trained model and copy it to: face_detection\detection\sfd\s3fd.pth
4. Create a dist folder in the root directory of the source code, and ensure that it has the following structure
Table of contents | subfolder | document |
---|---|---|
dist | checkpoints | wav2lip_gan.pth |
face_detection\detection\sfd | s3fd.pth | |
results | ||
temp |
2.3 Compile the project
Run the following command in VSCode:
pip install -r requirements.txt
If the native environment lacks the corresponding library, please install it additionally.
After the dependent environment installation is complete, run the following command:
python inference.py --checkpoint_path <ckpt> --face <video.mp4> --audio <an-audio-source>
in,
- ckpt is the relative path to the wav2lip_gan.pth file
- video.mp4 is the full path of the face photo or video of the virtual lecturer
- an-audio-source is the full path to the speaker's audio file.
3. Generate EXE
must:Run the console as administrator
1. Install pipenv
cd [项目源码根目录]
pip install pipenv
2, Execute package compilation
#进入虚拟环境(上一步可省略,因为没有虚拟环境的话会自动建立一个)
pipenv shell
#安装模块
#pip install "opencv-python-headless<4.3"
pip install requests librosa==0.7.0 numpy==1.20.0 opencv-contrib-python>=4.2.0.34 opencv-python==4.1.2.30 torch==1.8.0 torchvision==0.9.0 tqdm==4.45.0 numba==0.48 face_detection
#打包的模块也要安装
pip install pyinstaller
#开始打包,依赖包因为是动态的__import__的,因此需要手动添加依赖模块hidden-import
pyinstaller -F --hidden-import face_detection.detection --hidden-import face_detection.detection.sfd inference.py
3, test
cd dist
inference --checkpoint_path checkpoints\wav2lip_gan.pth --face D:\test\face.jpg --audio D:\test\audio.mp3
Summarize
Through source code compilation, master the use of generated AI to achieve voice and lip synchronization to broadcast video and integrate other functions.
You can consider integrating the generated EXE file into other applications.
- In terms of performance:
It takes 10 seconds for the short audio to run under the CPU; due to the limitation of document writing, the test was not performed under the GPU.
-
In terms of effect:
1. Cartoon image, lifelike lip sync.
2. Model image with good image and lifelike lip sync.
3. The author created his own language - nonsense, lip-syncing is realistic, there is no test monkey, and the effect is estimated to be good.
4. Professional sales of short videos with body language, lip sync realistic.
ALL:
Project source codeFrom Python to C#, if you want to understand the idea of transcoding, please refer to the previous C# AI project.