foreword
1. This is a virtual digital human demo that can talk in real time. It uses NeRF (Neural Radiance Fields) . For the training method, please refer to my previous blog.
2. Text-to-speech uses VITS speech synthesis, project git: https://github.com/jaywalnut310/vits .
3. The language model uses the new open source ChatGLM2-6B, and the current project does not add this interface for the time being. GitHub - THUDM/ChatGLM2-6B: ChatGLM2-6B: An Open Bilingual Chat LLM | Open Source Bilingual Dialogue Language Model )
4. PaddleSpeech is used for voice cloning. This voice cloning is trained very quickly, and the data set used is relatively small. The current project does not add voice cloning for the time being.
5. The current effect:
Real-time dialogue digital human
speech synthesis
1. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a high-expressive speech synthesis model that combines variational inference, normalizing flows and confrontation training. VITS connects the acoustic model and vocoder in speech synthesis through hidden variables instead of spectrum. Random modeling is performed on hidden variables and random duration predictors are used to improve the diversity of synthesized speech. The same input text can be synthesized Speech with different tones and rhythms.
2. The acoustic model is an important part of the sound synthesis system:
It uses a pre-trained speech coder (vocoder) to convert text to speech.
3. The workflow of VITS is as follows:
- Enter text into the VITS system, which converts the text into pronunciation rules.
- Input the pronunciation rules into the pre-trained speech encoder (vocoder), and the vocoder will generate a feature representation of the speech signal according to the pronunciation rules.
- Input the feature representation of the speech signal into the pre-trained speech synthesis model, and the speech synthesis model will generate synthetic speech according to the feature representation.
- The advantage of VITS is that the generated speech is of high quality and can generate fluent speech. However, the disadvantage of VITS is that it requires a large amount of training corpus to train the vocoder and speech synthesis model, and requires a more complicated training process.
4. After the project is git down, let's try to use VITS to do speech synthesis. Here we use gradio to help create a demo.
import os
from datetime import datetime
current_path = os.path.dirname(os.path.abspath(__file__))
os.environ["PATH"] = os.path.join(current_path, "ffmpeg/bin/") + ";" + os.environ["PATH"]
import torch
import commons
import utils
import re
from models import SynthesizerTrn
from text import text_to_sequence_with_hps as text_to_sequence
from scipy.io.wavfile import write
from pydub import AudioSegment
import gradio as gr
dir = "data/video/results/"
device = torch.device("cpu") # cpu mps
hps = utils.get_hparams_from_file("{}/configs/finetune_speaker.json".format(current_path))
hps.data.text_cleaners[0] = 'my_infer_ce_cleaners'
hps.data.n_speakers = 2
symbols = hps.symbols
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
n_speakers=hps.data.n_speakers,
**hps.model).to(device)
_ = net_g.eval()
# G_latest G_trilingual G_930000 G_953000 G_984000 G_990000 G_1004000 G_1021000
# _ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)
_ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)
def add_laug_tag(text):
'''
添加语言标签
'''
pattern = r'([\u4e00-\u9fa5,。!?;:、——……()]+|[a-zA-Z,.:()]+|\d+)'
segments = re.findall(pattern, text)
for i in range(len(segments)):
segment = segments[i]
if re.match(r'^[\u4e00-\u9fa5,。!?;:、——……()]+$', segment):
segments[i] = "[ZH]{}[ZH]".format(segment)
elif re.match(r'^[a-zA-Z,.:()]+$', segment):
segments[i] = "[EN]{}[EN]".format(segment)
elif re.match(r'^\d+$', segment):
segments[i] = "[ZH]{}[ZH]".format(segment) # 数字视为中文
else:
segments[i] = "[JA]{}[JA]".format(segment) # 日文
return ''.join(segments)
def get_text(text, hps):
text_cleaners = ['my_infer_ce_cleaners']
text_norm = text_to_sequence(text, hps.symbols, text_cleaners)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = torch.LongTensor(text_norm)
return text_norm
def infer_one_audio(text, speaker_id=94, length_scale=1):
'''
input_type: 1输入自带语言标签 2中文 3中英混合
length_scale: 语速,越小语速越快
'''
with torch.no_grad():
stn_tst = get_text(text, hps)
x_tst = stn_tst.to(device).unsqueeze(0)
x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
sid = torch.LongTensor([speaker_id]).to(device) # speaker id
audio = \
net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, length_scale=length_scale)[
0][0, 0].data.cpu().float().numpy()
return audio
return None
def infer_one_wav(text, speaker_id, length_scale, wav_name):
'''
input_type: 1输入自带语言标签 2中文 3中英混合
length_scale: 语速,越小语速越快
'''
audio = infer_one_audio(text, speaker_id, length_scale)
write(wav_name, hps.data.sampling_rate, audio)
print('task done!')
def add_slience(wav_path, slience_len=100):
silence = AudioSegment.silent(duration=slience_len)
wav_audio = AudioSegment.from_wav(wav_path)
wav_audio = wav_audio + silence
wav_audio.export(wav_path, format="wav")
pass
# if __name__ == '__main__':
# infer_one_wav(
# '觉得本教程对你有帮助的话,记得一键三连哦!',
# speaker_id=0,
# length_scale=1.2)
def vits(text):
now = datetime.now()
timestamp = datetime.timestamp(now)
file_name = str(timestamp%20).split('.')[0]
audio_path = dir + file_name + ".wav"
infer_one_wav(text,0,1.2,audio_path) #语音合成
return audio_path
inputs = gr.Text()
outputs = gr.Audio(label="Output")
demo = gr.Interface(fn=vits, inputs=inputs, outputs=outputs)
demo.launch()
composite video
1. RAD-NeRF can perform real-time portrait synthesis on the speakers appearing in the video. It is a neural network-based reconstruction of 3D volume scene from a set of 2D images.
RAD-NERF Network Overview
RAD-NeRF uses a network to predict all the pixel colors and densities of the camera viewpoints being visualized. As the camera rotates around the subject, all the viewpoints you want to display do so. Multiple parameters are learned to predict each coordinate in the image at a time. Also, in this case it's not just a NeRF generating a 3D scene. The audio input must also be matched so that the lips, mouth, eyes and movements match what the person is saying.
Instead of predicting the density and color of all pixels to match a particular frame of audio, the network uses two separate new compression spaces called grid spaces, or grid-based NeRF. The coordinates are then converted into a smaller 3D grid space, and the audio is converted into a smaller 2D grid space, which is then sent to the rendering head. This means that the network never merges audio data with spatial data, which would increase the size exponentially, adding a 2D input for each coordinate. Therefore, reducing the size of audio features while maintaining the separation of audio and spatial features will make this method more effective.
But how can the result be better when using a compressed space that contains less information? Adding some controllable features in NeRF, such as blink control, the model will learn more realistic eye behavior compared to previous methods. This is especially important for people who can restore more authenticity.
The second improvement made by RAD-NeRF (the green rectangle in the model overview) is to use the same method to model the torso with another NERF instead of trying to model the torso with the same NERF used for the head, which would require more Fewer parameters and different needs, since the goal here is to animate a moving head rather than a whole body. Since the torso is fairly static in these cases, they use a simpler and more efficient NERF-based module that only works in 2D, working directly in image space instead of using camera rays as usual with NERF Creates many different angles, which is not needed for the torso. Then, recombine the head and torso to produce the final video.
Avatar Adjustment Demonstration Operation
2. After the model is trained, you only need the transforms_train.json file in the data directory and the model file after fine-tuning the body to start writing the reasoning code. Proceed as follows:
- Input voice or text (here, for the convenience of demonstration, only the text input interface is written)
- Obtain the input information and call LLM (Large Language Model) to answer (the project has not yet introduced LLM, only wrote a few fixed answers, and will add LLM and local knowledge base later when there is time).
- Speech synthesis is performed on the obtained answers and an .npy file is generated to drive the video.
- Use the data in .npy and transforms_train.json to synthesize video and output.
import gradio as gr
import base64
import time
import json
import gevent
from gevent import pywsgi
from geventwebsocket.handler import WebSocketHandler
from tools import audio_pre_process, video_pre_process, generate_video, audio_process
import os
import re
import numpy as np
import threading
import websocket
from pydub import AudioSegment
from moviepy.editor import VideoFileClip, AudioFileClip, concatenate_videoclips
import cv2
import pygame
from datetime import datetime
import os
dir = "data/video/results/"
audio_pre_process()
video_pre_process()
current_path = os.path.dirname(os.path.abspath(__file__))
os.environ["PATH"] = os.path.join(current_path, "ffmpeg/bin/") + ";" + os.environ["PATH"]
import torch
import commons
import utils
import re
from models import SynthesizerTrn
from text import text_to_sequence_with_hps as text_to_sequence
from scipy.io.wavfile import write
device = torch.device("cpu") # cpu mps
hps = utils.get_hparams_from_file("{}/configs/finetune_speaker.json".format(current_path))
hps.data.text_cleaners[0] = 'my_infer_ce_cleaners'
hps.data.n_speakers = 2
symbols = hps.symbols
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
n_speakers=hps.data.n_speakers,
**hps.model).to(device)
_ = net_g.eval()
# G_latest G_trilingual G_930000 G_953000 G_984000 G_990000 G_1004000 G_1021000
_ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)
def add_laug_tag(text):
'''
添加语言标签
'''
pattern = r'([\u4e00-\u9fa5,。!?;:、——……()]+|[a-zA-Z,.:()]+|\d+)'
segments = re.findall(pattern, text)
for i in range(len(segments)):
segment = segments[i]
if re.match(r'^[\u4e00-\u9fa5,。!?;:、——……()]+$', segment):
segments[i] = "[ZH]{}[ZH]".format(segment)
elif re.match(r'^[a-zA-Z,.:()]+$', segment):
segments[i] = "[EN]{}[EN]".format(segment)
elif re.match(r'^\d+$', segment):
segments[i] = "[ZH]{}[ZH]".format(segment) # 数字视为中文
else:
segments[i] = "[JA]{}[JA]".format(segment) # 日文
return ''.join(segments)
def get_text(text, hps):
text_cleaners = ['my_infer_ce_cleaners']
text_norm = text_to_sequence(text, hps.symbols, text_cleaners)
if hps.data.add_blank:
text_norm = commons.intersperse(text_norm, 0)
text_norm = torch.LongTensor(text_norm)
return text_norm
def infer_one_audio(text, speaker_id=94, length_scale=1):
'''
input_type: 1输入自带语言标签 2中文 3中英混合
length_scale: 语速,越小语速越快
'''
with torch.no_grad():
stn_tst = get_text(text, hps)
x_tst = stn_tst.to(device).unsqueeze(0)
x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
sid = torch.LongTensor([speaker_id]).to(device) # speaker id
audio = \
net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, length_scale=length_scale)[
0][0, 0].data.cpu().float().numpy()
return audio
return None
def infer_one_wav(text, speaker_id, length_scale, wav_name):
'''
input_type: 1输入自带语言标签 2中文 3中英混合
length_scale: 语速,越小语速越快
'''
audio = infer_one_audio(text, speaker_id, length_scale)
write(wav_name, hps.data.sampling_rate, audio)
print('task done!')
def add_slience(wav_path, slience_len=100):
silence = AudioSegment.silent(duration=slience_len)
wav_audio = AudioSegment.from_wav(wav_path)
wav_audio = wav_audio + silence
wav_audio.export(wav_path, format="wav")
pass
def play_audio(audio_file):
pygame.mixer.init()
pygame.mixer.music.load(audio_file)
pygame.mixer.music.play()
while pygame.mixer.music.get_busy():
pygame.time.Clock().tick(10)
pygame.mixer.music.stop()
def answer(message, history):
global dir
history = history or []
message = message.lower()
if message=="你好":
response = "你好,有什么可以帮到你吗?"
elif message=="你是谁":
response = "我是虚拟数字人幻静,你可以叫我小静或者静静。"
elif message=="你能做什么":
response = "我可以陪你聊天,回答你的问题,我还可以做很多很多事情!"
else:
response = "你的这个问题超出了我的理解范围,等我学习后再来回答你。"
history.append((message, response))
save_path = text2video(response,dir)
return history,history,save_path
def text2video(text,dir):
now = datetime.now()
timestamp = datetime.timestamp(now)
file_name = str(timestamp%20).split('.')[0]
audio_path = dir + file_name + ".wav"
infer_one_wav(text,0,1.1,audio_path) #语音合成
audio_process(audio_path)
audio_path_eo = dir+file_name+"_eo.npy"
save_path = generate_video(audio_path_eo, dir, file_name,audio_path)
return save_path
with gr.Blocks(css="#chatbot{height:300px} .overflow-y-auto{height:500px}") as rxbot:
with gr.Row():
video = gr.Video(label = "数字人",autoplay = True)
with gr.Column():
state = gr.State([])
chatbot = gr.Chatbot(label = "消息记录").style(color_map=("green", "pink"))
txt = gr.Textbox(show_label=False, placeholder="请输入你的问题").style(container=False)
txt.submit(fn = answer, inputs = [txt, state], outputs = [chatbot, state,video])
rxbot.launch()
Run the code, then open http://127.0.0.1:7860/ , and then enter the text to get the synthesized video of the answer.
source code
1. The current source code includes two models of speech synthesis and video synthesis. The most difficult part of environment dependence should be pytorch3d. For this, please refer to my previous blog:
2. The source code was successfully tested in win10, cuda 11.7, cudnn 8.5, python3.10, and conda environments. Source code download address:
https://download.csdn.net/download/matt45m/88078575
After downloading the source code, create a conda environment:
cd vrh
#创建虚拟环境
conda create --name vrh python=3.10
activate vrh
#pytorch 要单独对应cuda进行安装,要不然训练时使用不了GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
#安装所需要的依赖
pip install -r requirements.txt
To install pytorch3d under windows, this dependency still needs to be installed in the conda environment just created.
git clone https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d
python setup.py install
If downloading pytorch3d is very slow, you can use this Baidu network disk to download: Link: https://pan.baidu.com/s/1z29IgyviQe2KQa6DilnRSA Extraction code: dm4q
If an error is reported and exited during the installation, it is recommended to install the vs generation tool here. Microsoft C++ Build Tools - Visual StudioEdit https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/ https://visualstudio.microsoft.com/zh-hans/visual-cpp-build -tools/ 3. If you are interested in this project or encounter any errors during the installation process, you can add my penguin group: 487350510, and we will discuss together.