AI speech synthesis: the implementation process of a Q&A anchor at station B

A Q&A anchor at Station B (experimental)

Idea:

  • Reference: Teach you step-by-step to create your own AI virtual anchor https://blog.csdn.net/weixin_53072519/article/details/130872710, started the first live broadcast in my life at station b, follow the instructions in the post and GPT-vup project , there was no problem with the live broadcast, but maybe the control of expressions, voice, and model operations failed. Maybe the openai key was invalid, or maybe it was for other reasons. In short, it was not successful.
  • GPT-vup project on github:https://github.com/jiran214/GPT-vup
  • Refer again: Write an AI virtual anchor: understand the barrage, witty remarks, joys and sorrows, with a simple implementation https://blog.csdn.net/u012419550/article/details/129254795, ready to follow it Conduct some actual operations. Considering the many components that need to be covered in his article, the possibility of project failure is greatly increased.
  • Therefore, consider using only the barrage reading module, chatgpt module, and text-to-speech module to build a simple barrage reading module, and then convert the text into speech, and output it to the users in the live broadcast room through the voice of the live image.
  • After completing this basic function, consider using the modules in the article to control the mouth shape, movements, expressions, etc. of the model image.

process:


– 20230913

  • To obtain the barrage of the live broadcast room of Station B, the bilibili-api-python library https://pypi.org/project/bilibili-api-python/ is used here, and its live broadcast room related API examples: https://nemo2011.github.io/ bilibili-api/#/examples/live
  • To reply to the barrage text, I plan to use chatgpt, but due to network reasons, iFlytek's Spark is used directly here. Refer to the official documentation: https://www.xfyun.cn/doc/platform/quickguide.html. Remember to use it before using it. Place an order to buy tokens first
  • Convert text to speech tts and still use iFlytek services: https://console.xfyun.cn/services/tts
  • Let vts receive the audio played by the computer. Here, use VB-CABLE to set up the virtual sound card and let vts receive the sound played by the computer: https://vb-audio.com/Cable/
  • vts has its own audio driver, so python is used to play the converted audio, and the images and sounds can be seen in the live broadcast room!
  • Regarding the settings of vts, please refer to: https://www.bilibili.com/video/BV1jf4y1p743

– 20230914

  • iFlytek's free voice is too mechanical and I can't stand it, so I plan to train a tts model locally to play voice.
  • Reference: https://juejin.cn/post/6872260516966842382
  • Real-Time-Voice-Cloning project: https://github.com/CorentinJ/Real-Time-Voice-Cloning
  • Install PyTorch. Since my machine has an Intel core display, I use the CPU version: pip install torch torchvision torchaudio
  • Install the dependencies required for the project: pip install -r requirements.txt
  • Note that since this machine is originally a python3.10 environment, Real-Time-Voice-Cloning requires 3.7. I switched to version 3.8.9 here to facilitate the subsequent running of another project.
  • Finally, after switching the python version, reinstalling the dependent packages, and after many adjustments, the configuration of requirements.txt of the executable version was determined as follows:
inflect==5.3.0
librosa==0.8.1
matplotlib==3.5.1
numpy==1.20.3
numba===0.51.2
Pillow==10.0.0
PyQt5==5.15.6
scikit-learn==1.0.2
scipy==1.7.3
sounddevice==0.4.3
SoundFile==0.10.3.post1
tqdm==4.66.1
umap-learn==0.5.2
Unidecode==1.3.2
urllib3==1.26.7
visdom==0.1.8.9
webrtcvad==2.0.10
lws==1.2.7
pysoundfile==0.9.0.post1
  • Create a new saved_models\default directory under the project, download the pre-trained model and put it in this directory
  • Find the reference voice audio, which is required to be in wav or flac format, and write the code to generate it:
import os.path

from IPython.core.display_functions import display
from IPython.display import Audio
from IPython.utils import io
from synthesizer.inference import Synthesizer
from encoder import inference as encoder
from vocoder import inference as vocoder
from pathlib import Path
import numpy as np
import librosa
import soundfile as sf

'''加载模型'''
encoder_weights = Path("../saved_models/default/encoder.pt")
vocoder_weights = Path("../saved_models/default/vocoder.pt")
syn_dir = Path("../saved_models/default/synthesizer.pt")
encoder.load_model(encoder_weights)
synthesizer = Synthesizer(syn_dir)
vocoder.load_model(vocoder_weights)


def synth(text):
    '''根据文本,目标音频,调用模型合成语音'''
    #in_fpath = Path("./target-1.wav")
    in_fpath = os.path.join(r'E:\\Real-Time-Voice-Cloning\调用模型合成语音','652-130726-0004.flac')
    reprocessed_wav = encoder.preprocess_wav(in_fpath)
    original_wav, sampling_rate = librosa.load(in_fpath)
    preprocessed_wav = encoder.preprocess_wav(original_wav, sampling_rate)
    embed = encoder.embed_utterance(preprocessed_wav)
    print("Synthesizing new audio...")
    with io.capture_output() as captured:
        specs = synthesizer.synthesize_spectrograms([text], [embed])
    generated_wav = vocoder.infer_waveform(specs[0])
    generated_wav = np.pad(generated_wav, (0, synthesizer.sample_rate), mode="constant")
    #librosa.output.write_wav("output_trump_voice.wav", generated_wav, synthesizer.sample_rate)
    sf.write('output_trump_voice.wav', generated_wav, synthesizer.sample_rate, 'PCM_24')
    display(Audio(generated_wav, rate=synthesizer.sample_rate))


synth("oh......Hello, hahahahahaha,  haven't seen you for a long time, how are you? what are you busy with recently?")

The final effect: The English sound is okay, and it depends on the quality of the reference voice you give.

The Chinese language is sparse... This model does not understand Chinese...


Guess you like

Origin blog.csdn.net/AJian759447583/article/details/132901151