foreword

1. This is a virtual digital human demo that can talk in real time. It uses NeRF (Neural Radiance Fields) . For the training method, please refer to my previous blog.

2. Text-to-speech uses VITS speech synthesis, project git: https://github.com/jaywalnut310/vits .

3. The language model uses the new open source ChatGLM2-6B, and the current project does not add this interface for the time being. GitHub - THUDM/ChatGLM2-6B: ChatGLM2-6B: An Open Bilingual Chat LLM | Open Source Bilingual Dialogue Language Model ）

4. PaddleSpeech is used for voice cloning. This voice cloning is trained very quickly, and the data set used is relatively small. The current project does not add voice cloning for the time being.

GitHub - PaddlePaddle/PaddleSpeech: Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award. - GitHub - PaddlePaddle/PaddleSpeech: Easy-to-use Speech Toolkit including Self-Supervised Learning model, SOTA/Streaming ASR with punctuation, Streaming TTS with text frontend, Speaker Verification System, End-to-End Speech Translation and Keyword Spotting. Won NAACL2022 Best Demo Award.https://github.com/PaddlePaddle/PaddleSpeech

5. The current effect:

Real-time dialogue digital human

speech synthesis

1. VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) is a high-expressive speech synthesis model that combines variational inference, normalizing flows and confrontation training. VITS connects the acoustic model and vocoder in speech synthesis through hidden variables instead of spectrum. Random modeling is performed on hidden variables and random duration predictors are used to improve the diversity of synthesized speech. The same input text can be synthesized Speech with different tones and rhythms.

2. The acoustic model is an important part of the sound synthesis system:

It uses a pre-trained speech coder (vocoder) to convert text to speech.

3. The workflow of VITS is as follows:

Enter text into the VITS system, which converts the text into pronunciation rules.
Input the pronunciation rules into the pre-trained speech encoder (vocoder), and the vocoder will generate a feature representation of the speech signal according to the pronunciation rules.
Input the feature representation of the speech signal into the pre-trained speech synthesis model, and the speech synthesis model will generate synthetic speech according to the feature representation.
The advantage of VITS is that the generated speech is of high quality and can generate fluent speech. However, the disadvantage of VITS is that it requires a large amount of training corpus to train the vocoder and speech synthesis model, and requires a more complicated training process.

4. After the project is git down, let's try to use VITS to do speech synthesis. Here we use gradio to help create a demo.

import os
from datetime import datetime
current_path = os.path.dirname(os.path.abspath(__file__))
os.environ["PATH"] = os.path.join(current_path, "ffmpeg/bin/") + ";" + os.environ["PATH"]
import torch
import commons
import utils
import re
from models import SynthesizerTrn
from text import text_to_sequence_with_hps as text_to_sequence
from scipy.io.wavfile import write
from pydub import AudioSegment
import gradio as gr

dir = "data/video/results/"

device = torch.device("cpu")  # cpu  mps
hps = utils.get_hparams_from_file("{}/configs/finetune_speaker.json".format(current_path))
hps.data.text_cleaners[0] = 'my_infer_ce_cleaners'
hps.data.n_speakers = 2
symbols = hps.symbols
net_g = SynthesizerTrn(
    len(symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    n_speakers=hps.data.n_speakers,
    **hps.model).to(device)
_ = net_g.eval()
#    G_latest  G_trilingual  G_930000  G_953000 G_984000 G_990000 G_1004000 G_1021000
# _ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)
_ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)


def add_laug_tag(text):
    '''
    添加语言标签
    '''
    pattern = r'([\u4e00-\u9fa5，。！？；：、——……（）]+|[a-zA-Z,.:()]+|\d+)'
    segments = re.findall(pattern, text)
    for i in range(len(segments)):
        segment = segments[i]
        if re.match(r'^[\u4e00-\u9fa5，。！？；：、——……（）]+$', segment):
            segments[i] = "[ZH]{}[ZH]".format(segment)
        elif re.match(r'^[a-zA-Z,.:()]+$', segment):
            segments[i] = "[EN]{}[EN]".format(segment)
        elif re.match(r'^\d+$', segment):
            segments[i] = "[ZH]{}[ZH]".format(segment)  # 数字视为中文
        else:
            segments[i] = "[JA]{}[JA]".format(segment)  # 日文

    return ''.join(segments)


def get_text(text, hps):
    text_cleaners = ['my_infer_ce_cleaners']
    text_norm = text_to_sequence(text, hps.symbols, text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm


def infer_one_audio(text, speaker_id=94, length_scale=1):
    '''
        input_type: 1输入自带语言标签  2中文  3中英混合
        length_scale: 语速，越小语速越快
    '''
    with torch.no_grad():
        stn_tst = get_text(text, hps)
        x_tst = stn_tst.to(device).unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
        sid = torch.LongTensor([speaker_id]).to(device)  # speaker id
        audio = \
            net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, length_scale=length_scale)[
                0][0, 0].data.cpu().float().numpy()
        return audio
    return None


def infer_one_wav(text, speaker_id, length_scale, wav_name):
    '''
        input_type: 1输入自带语言标签  2中文  3中英混合
        length_scale: 语速，越小语速越快
    '''
    audio = infer_one_audio(text, speaker_id, length_scale)
    write(wav_name, hps.data.sampling_rate, audio)
    print('task done!')

def add_slience(wav_path, slience_len=100):
    silence = AudioSegment.silent(duration=slience_len)
    wav_audio = AudioSegment.from_wav(wav_path)
    wav_audio = wav_audio + silence
    wav_audio.export(wav_path, format="wav")
    pass


# if __name__ == '__main__':
#     infer_one_wav(
#         '觉得本教程对你有帮助的话，记得一键三连哦！',
#         speaker_id=0,
#         length_scale=1.2)

def vits(text):
    now = datetime.now()
    timestamp = datetime.timestamp(now)
    file_name = str(timestamp%20).split('.')[0]
    audio_path = dir + file_name + ".wav"
    infer_one_wav(text,0,1.2,audio_path) #语音合成
    return audio_path
    
inputs = gr.Text()
outputs = gr.Audio(label="Output")

demo = gr.Interface(fn=vits, inputs=inputs, outputs=outputs)

demo.launch()

composite video

1. RAD-NeRF can perform real-time portrait synthesis on the speakers appearing in the video. It is a neural network-based reconstruction of 3D volume scene from a set of 2D images.

RAD-NERF Network Overview

RAD-NeRF uses a network to predict all the pixel colors and densities of the camera viewpoints being visualized. As the camera rotates around the subject, all the viewpoints you want to display do so. Multiple parameters are learned to predict each coordinate in the image at a time. Also, in this case it's not just a NeRF generating a 3D scene. The audio input must also be matched so that the lips, mouth, eyes and movements match what the person is saying.

Instead of predicting the density and color of all pixels to match a particular frame of audio, the network uses two separate new compression spaces called grid spaces, or grid-based NeRF. The coordinates are then converted into a smaller 3D grid space, and the audio is converted into a smaller 2D grid space, which is then sent to the rendering head. This means that the network never merges audio data with spatial data, which would increase the size exponentially, adding a 2D input for each coordinate. Therefore, reducing the size of audio features while maintaining the separation of audio and spatial features will make this method more effective.

But how can the result be better when using a compressed space that contains less information? Adding some controllable features in NeRF, such as blink control, the model will learn more realistic eye behavior compared to previous methods. This is especially important for people who can restore more authenticity.

The second improvement made by RAD-NeRF (the green rectangle in the model overview) is to use the same method to model the torso with another NERF instead of trying to model the torso with the same NERF used for the head, which would require more Fewer parameters and different needs, since the goal here is to animate a moving head rather than a whole body. Since the torso is fairly static in these cases, they use a simpler and more efficient NERF-based module that only works in 2D, working directly in image space instead of using camera rays as usual with NERF Creates many different angles, which is not needed for the torso. Then, recombine the head and torso to produce the final video.

Avatar Adjustment Demonstration Operation

2. After the model is trained, you only need the transforms_train.json file in the data directory and the model file after fine-tuning the body to start writing the reasoning code. Proceed as follows:

Input voice or text (here, for the convenience of demonstration, only the text input interface is written)
Obtain the input information and call LLM (Large Language Model) to answer (the project has not yet introduced LLM, only wrote a few fixed answers, and will add LLM and local knowledge base later when there is time).
Speech synthesis is performed on the obtained answers and an .npy file is generated to drive the video.
Use the data in .npy and transforms_train.json to synthesize video and output.

import gradio as gr
import base64
import time
import json
import gevent
from gevent import pywsgi
from geventwebsocket.handler import WebSocketHandler
from tools import audio_pre_process, video_pre_process, generate_video, audio_process
import os
import re
import numpy as np
import threading
import websocket
from pydub import AudioSegment
from moviepy.editor import VideoFileClip, AudioFileClip, concatenate_videoclips
import cv2
import pygame
from datetime import datetime

import os

dir = "data/video/results/"


audio_pre_process()
video_pre_process()
current_path = os.path.dirname(os.path.abspath(__file__))
os.environ["PATH"] = os.path.join(current_path, "ffmpeg/bin/") + ";" + os.environ["PATH"]
import torch
import commons
import utils
import re
from models import SynthesizerTrn
from text import text_to_sequence_with_hps as text_to_sequence
from scipy.io.wavfile import write

device = torch.device("cpu")  # cpu  mps
hps = utils.get_hparams_from_file("{}/configs/finetune_speaker.json".format(current_path))
hps.data.text_cleaners[0] = 'my_infer_ce_cleaners'
hps.data.n_speakers = 2
symbols = hps.symbols
net_g = SynthesizerTrn(
    len(symbols),
    hps.data.filter_length // 2 + 1,
    hps.train.segment_size // hps.data.hop_length,
    n_speakers=hps.data.n_speakers,
    **hps.model).to(device)
_ = net_g.eval()
#    G_latest  G_trilingual  G_930000  G_953000 G_984000 G_990000 G_1004000 G_1021000
_ = utils.load_checkpoint("C:/code/vrh/models/G_1/G_1.pth", net_g, None)


def add_laug_tag(text):
    '''
    添加语言标签
    '''
    pattern = r'([\u4e00-\u9fa5，。！？；：、——……（）]+|[a-zA-Z,.:()]+|\d+)'
    segments = re.findall(pattern, text)
    for i in range(len(segments)):
        segment = segments[i]
        if re.match(r'^[\u4e00-\u9fa5，。！？；：、——……（）]+$', segment):
            segments[i] = "[ZH]{}[ZH]".format(segment)
        elif re.match(r'^[a-zA-Z,.:()]+$', segment):
            segments[i] = "[EN]{}[EN]".format(segment)
        elif re.match(r'^\d+$', segment):
            segments[i] = "[ZH]{}[ZH]".format(segment)  # 数字视为中文
        else:
            segments[i] = "[JA]{}[JA]".format(segment)  # 日文

    return ''.join(segments)


def get_text(text, hps):
    text_cleaners = ['my_infer_ce_cleaners']
    text_norm = text_to_sequence(text, hps.symbols, text_cleaners)
    if hps.data.add_blank:
        text_norm = commons.intersperse(text_norm, 0)
    text_norm = torch.LongTensor(text_norm)
    return text_norm


def infer_one_audio(text, speaker_id=94, length_scale=1):
    '''
        input_type: 1输入自带语言标签  2中文  3中英混合
        length_scale: 语速，越小语速越快
    '''
    with torch.no_grad():
        stn_tst = get_text(text, hps)
        x_tst = stn_tst.to(device).unsqueeze(0)
        x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).to(device)
        sid = torch.LongTensor([speaker_id]).to(device)  # speaker id
        audio = \
            net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, length_scale=length_scale)[
                0][0, 0].data.cpu().float().numpy()
        return audio
    return None


def infer_one_wav(text, speaker_id, length_scale, wav_name):
    '''
        input_type: 1输入自带语言标签  2中文  3中英混合
        length_scale: 语速，越小语速越快
    '''
    audio = infer_one_audio(text, speaker_id, length_scale)
    write(wav_name, hps.data.sampling_rate, audio)
    print('task done!')


def add_slience(wav_path, slience_len=100):
    silence = AudioSegment.silent(duration=slience_len)
    wav_audio = AudioSegment.from_wav(wav_path)
    wav_audio = wav_audio + silence
    wav_audio.export(wav_path, format="wav")
    pass

def play_audio(audio_file):
    pygame.mixer.init()
    pygame.mixer.music.load(audio_file)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        pygame.time.Clock().tick(10)
    pygame.mixer.music.stop()

def answer(message, history):
    global dir
    history = history or []
    message = message.lower()
    if message=="你好":
        response = "你好，有什么可以帮到你吗?"

    elif message=="你是谁":
        response = "我是虚拟数字人幻静，你可以叫我小静或者静静。"

    elif message=="你能做什么":
        response = "我可以陪你聊天，回答你的问题，我还可以做很多很多事情！"

    else:
        response = "你的这个问题超出了我的理解范围，等我学习后再来回答你。"

    history.append((message, response))

    save_path = text2video(response,dir)
    
    return history,history,save_path

def text2video(text,dir):
    now = datetime.now()
    timestamp = datetime.timestamp(now)
    file_name = str(timestamp%20).split('.')[0]
    audio_path = dir + file_name + ".wav"
    infer_one_wav(text,0,1.1,audio_path) #语音合成 

    audio_process(audio_path)
    audio_path_eo = dir+file_name+"_eo.npy"

    save_path = generate_video(audio_path_eo, dir, file_name,audio_path)

    return save_path


with gr.Blocks(css="#chatbot{height:300px} .overflow-y-auto{height:500px}") as rxbot: 
    with gr.Row():
        video = gr.Video(label = "数字人",autoplay = True)
        with gr.Column():
            state = gr.State([])
            chatbot = gr.Chatbot(label = "消息记录").style(color_map=("green", "pink"))
            txt = gr.Textbox(show_label=False, placeholder="请输入你的问题").style(container=False)
    txt.submit(fn = answer, inputs = [txt, state], outputs = [chatbot, state,video])
    
rxbot.launch()

Run the code, then open http://127.0.0.1:7860/ , and then enter the text to get the synthesized video of the answer.

source code

1. The current source code includes two models of speech synthesis and video synthesis. The most difficult part of environment dependence should be pytorch3d. For this, please refer to my previous blog:

Digital Human Solution - 3D Reconstruction Digital Human Source Code and Training Method Based on Live Video

2. The source code was successfully tested in win10, cuda 11.7, cudnn 8.5, python3.10, and conda environments. Source code download address:

https://download.csdn.net/download/matt45m/88078575

After downloading the source code, create a conda environment:

cd vrh
#创建虚拟环境
conda create --name vrh python=3.10
activate vrh
#pytorch 要单独对应cuda进行安装，要不然训练时使用不了GPU
conda install pytorch==2.0.0 torchvision==0.15.0 torchaudio==2.0.0 pytorch-cuda=11.7 -c pytorch -c nvidia
conda install -c fvcore -c iopath -c conda-forge fvcore iopath
#安装所需要的依赖
pip install -r requirements.txt

To install pytorch3d under windows, this dependency still needs to be installed in the conda environment just created.

git clone https://github.com/facebookresearch/pytorch3d.git
cd pytorch3d
python setup.py install

If downloading pytorch3d is very slow, you can use this Baidu network disk to download: Link: https://pan.baidu.com/s/1z29IgyviQe2KQa6DilnRSA Extraction code: dm4q

If an error is reported and exited during the installation, it is recommended to install the vs generation tool here. Microsoft C++ Build Tools - Visual StudioEdit https://visualstudio.microsoft.com/zh-hans/visual-cpp-build-tools/ https://visualstudio.microsoft.com/zh-hans/visual-cpp-build -tools/ 3. If you are interested in this project or encounter any errors during the installation process, you can add my penguin group: 487350510, and we will discuss together.

Digital human solution - NeRF realizes real-time dialogue digital human environment configuration and source code

foreword

speech synthesis

composite video

source code

Guess you like