Deep understanding of artificial intelligence "reading machine": a new chapter in the realization of speech synthesis

Table of contents

 The principle of artificial intelligence "reading machine"

 Realization of artificial intelligence "reading machine"

0 Background introduction

1 Project effect display

1.1 Real-time display effect of AI "reader"

1.2 AI "Reader" running effect display

1.3 Installation diagram of Intel AI BOX and camera recognition equipment

2 Introduction to Deploying Devices

3 Deployment process

3.1 Call the speech synthesis model from PaddleHub

3.2 Call the speech synthesis model from PaddleSpeech


In modern society, we increasingly rely on voice interaction as the primary way to communicate with computers, mobile devices, smart homes, and more. Among them, the artificial intelligence "reading machine" has become a key technology in this field because of its ability to convert text into natural and smooth speech. Let's discuss in depth the principle, implementation and future development of the AI ​​reader.

 The principle of artificial intelligence "reading machine"

The core technology of the AI ​​reading machine is speech synthesis (TTS, Text-to-Speech), which is a technology that converts text information into audible voice information. This process can generally be divided into two stages: text analysis and speech synthesis.

  • Text Analysis : At this stage, the AI ​​needs to understand the text content of the input. This includes understanding the structure of sentences, how to split words, recognizing the pronunciation of words, and understanding the emotion and intonation of sentences.

  • Speech Synthesis : After text analysis, AI needs to convert the analyzed information into voice. This process needs to simulate the human pronunciation mechanism, including timbre, pitch, speed of sound, etc.

 Realization of artificial intelligence "reading machine"

Most modern AI reading machines are implemented based on deep learning technology. These models learn to mimic the way humans speak by learning from large amounts of speech data. One of the most well-known models is Google's Tacotron and WaveNet.

  • Tacotron : This is an end-to-end text-to-speech model. It takes character-level input and passes it through a sequence-to-sequence model, outputting a spectrogram of the sound.

  • WaveNet : This is a sound synthesis model. It uses a deep learning model to probabilistically generate waveforms of sounds.

These two models are often used in combination. First, Tacotron converts text into a spectrogram of sound, and then WaveNet converts the spectrogram into a sound waveform.

0 Background introduction

Nowadays, with the emergence of various popular apps, "listening to books" has become a new way of reading. However, compared with e-book software, there are many difficulties in "listening to books" from physical books.

For example, e-book software naturally has accurate text input, and only needs to solve the problem of speech synthesis—of course, this seemingly simple step is actually not simple at all. For example, to do a good job of word segmentation and sentence segmentation, the speech synthesis model needs to training on datasets etc.

In contrast, "listening to books" from physical books adds a few more layers of difficulty-how to do accurate OCR recognition? How to reformat the synthesized voice of broken line? How to achieve point reading? How to ensure real-time performance?

Next in this new series, we will start to explore how to implement a "point reader" with good user experience based on the Paddle model library.

Still follow the principle from easy to difficult, now, let us first develop the simplest AI "reading machine".

1 Project effect display

In this project, based on PaddleOCR + PaddleSpeech (PaddleHub) + OpenVINO, we deployed a simple AI "reader" on the edge device of Intel AI BOX. Whether it is a physical book or A4 paper, as long as it is sent to it, it only needs to adjust the placement according to the recognition effect. When the user presses the space bar, a voice can be emitted and the recognition picture at the corresponding moment can be cut off.

1.1 Real-time display effect of AI "reader"

1.2 AI "Reader" running effect display

In [1]

from IPython.display import Video

In [2]

Video('2022-05-09 02-16-02.mkv')
<IPython.core.display.Video object>

In [3]

Video('2022-05-09 02-13-59.mkv')
<IPython.core.display.Video object>

1.3 Installation diagram of Intel AI BOX and camera recognition equipment

2 Introduction to Deploying Devices

Reference Link: ASB210-953-AI Edge AI Computing System Equipped with 11th Generation Intel® Core™ U-Series CPU and Hailo-8™ M.2 AI Acceleration Module

Intel AI BOX we can use it as a host. At the same time, because it is pre-installed with the Windows system, it can be considered as a PC device commonly used by most people.

Of course, since it comes pre-installed with Windows, we will only be able to use PaddleSpeech's basic speech synthesis functionality during deployment.

PaddleSpeech There are three installation methods. According to the difficulty of installation, the three methods can be divided into  easymedium  and  hard .

Way Function support system
Simple (1) Use the command line function of PaddleSpeech. (2) Experience PaddleSpeech on Aistudio. Linux, Mac (M1 chip is not supported), Windows
medium Support the main functions of PaddleSpeech, such as using the models in the existing examples and using PaddleSpeech to train your own models. Linux
difficulty Supports various functions of PaddleSpeech, including decoding with kaldi using join ctc decoder, training language models, using forced alignment, etc. And you become a better developer! Ubuntu

3 Deployment process

3.1 Call the speech synthesis model from PaddleHub

  • Install Parakeet first
    • In fact, PaddleSpeech has only merged Parakeet for the time being. The pre-training models of Chinese and English speech synthesis we want to use are also trained by Parakeet before. The difference between PaddleSpeech and PaddleHub is that the installation process of PaddleSpeech can basically be done. To the user "no sense"
  • Prepare nltk_data file
    • In contrast, PaddleSpeech's great progress in this area is that the preparation of nltk_data has also achieved the user's "no sense", and will directly pull Baidu's image file for installation, which is very considerate!
  • Set screenshot and speech synthesis button
    • For paper-based text materials, the accuracy of OCR recognition is still not very easy to control, so if it is directly recognized, the effect will definitely not be good. Therefore, it can be considered to let the user select a recognition frame with a relatively stable effect through button and button control, and the screenshot is passed into the speech synthesis pipeline to start pronunciation
    • On the other hand, this is also a consideration from the user experience, otherwise the whole process will be very stuck
  • Play continuous audio using playsound
    • Usually the PaddleOCR we use still only supports single-line recognition. Our processing method is to either perform line break recovery before sending it into speech synthesis, or use continuous audio playback to solve this problem. I personally recommend the latter method. The former will require a lot of operations in terms of lexical and layout analysis. If it is not a scene that requires particularly high pronunciation accuracy, the development cost-effectiveness is not high.

In this project, the relevant deployment code has been placed in PaddleCode.zipthe file. After the reader downloads and completes the installation, he can run python can_new.pythe command to start the AI ​​"reader".

3.2 Call the speech synthesis model from PaddleSpeech

This project provides another way to use PaddleSpeech to output synthesized speech. The key codes are as follows:

from paddlespeech.cli import TTSExecutor

# 传入TTSExecutor()
run_paddle_ocr(source=0, flip=False, use_popup=True, tts=TTSExecutor())

Synthesize voice files by run_paddle_ocr()calling the Python API in the function.

wav_file = tts_executor(
    text=txts[i],
    output='wavs/' + str(i) + '.wav',
    am='fastspeech2_csmsc',
    am_config=None,
    am_ckpt=None,
    am_stat=None,
    spk_id=0,
    phones_dict=None,
    tones_dict=None,
    speaker_dict=None,
    voc='pwgan_csmsc',
    voc_config=None,
    voc_ckpt=None,
    voc_stat=None,
    lang='zh',
    device=paddle.get_device())

The advantage of this approach is that more pre-trained models can be selected. TTSExecutor()The source code we can view :

class TTSExecutor(BaseExecutor):
    def __init__(self):
        super().__init__()

        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.tts', add_help=True)
        self.parser.add_argument(
            '--input', type=str, default=None, help='Input text to generate.')
        # acoustic model
        self.parser.add_argument(
            '--am',
            type=str,
            default='fastspeech2_csmsc',
            choices=[
                'speedyspeech_csmsc',
                'fastspeech2_csmsc',
                'fastspeech2_ljspeech',
                'fastspeech2_aishell3',
                'fastspeech2_vctk',
                'tacotron2_csmsc',
                'tacotron2_ljspeech',
            ],
            help='Choose acoustic model type of tts task.')
        self.parser.add_argument(
            '--am_config',
            type=str,
            default=None,
            help='Config of acoustic model. Use deault config when it is None.')
        self.parser.add_argument(
            '--am_ckpt',
            type=str,
            default=None,
            help='Checkpoint file of acoustic model.')
        self.parser.add_argument(
            "--am_stat",
            type=str,
            default=None,
            help="mean and standard deviation used to normalize spectrogram when training acoustic model."
        )
        self.parser.add_argument(
            "--phones_dict",
            type=str,
            default=None,
            help="phone vocabulary file.")
        self.parser.add_argument(
            "--tones_dict",
            type=str,
            default=None,
            help="tone vocabulary file.")
        self.parser.add_argument(
            "--speaker_dict",
            type=str,
            default=None,
            help="speaker id map file.")
        self.parser.add_argument(
            '--spk_id',
            type=int,
            default=0,
            help='spk id for multi speaker acoustic model')
        # vocoder
        self.parser.add_argument(
            '--voc',
            type=str,
            default='pwgan_csmsc',
            choices=[
                'pwgan_csmsc',
                'pwgan_ljspeech',
                'pwgan_aishell3',
                'pwgan_vctk',
                'mb_melgan_csmsc',
                'style_melgan_csmsc',
                'hifigan_csmsc',
                'hifigan_ljspeech',
                'hifigan_aishell3',
                'hifigan_vctk',
                'wavernn_csmsc',
            ],
            help='Choose vocoder type of tts task.')

        self.parser.add_argument(
            '--voc_config',
            type=str,
            default=None,
            help='Config of voc. Use deault config when it is None.')
        self.parser.add_argument(
            '--voc_ckpt',
            type=str,
            default=None,
            help='Checkpoint file of voc.')
        self.parser.add_argument(
            "--voc_stat",
            type=str,
            default=None,
            help="mean and standard deviation used to normalize spectrogram when training voc."
        )
        # other
        self.parser.add_argument(
            '--lang',
            type=str,
            default='zh',
            help='Choose model language. zh or en')
        self.parser.add_argument(
            '--device',
            type=str,
            default=paddle.get_device(),
            help='Choose device to execute model inference.')

        self.parser.add_argument(
            '--output', type=str, default='output.wav', help='output file name')
        self.parser.add_argument(
            '-d',
            '--job_dump_result',
            action='store_true',
            help='Save job result into file.')
        self.parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Increase logger verbosity of current task.')

The list of pre-trained models provided by PaddleSpeech that can be used by the command line and python API is as follows:

  • acoustic model

    Model language
    speedyspeech_csmsc zh
    fastspeech2_csmsc zh
    fastspeech2_aishell3 zh
    fastspeech2_ljspeech in
    fastspeech2_vctk in
  • Vocoder

    Model language
    pwgan_csmsc zh
    pwgan_aishell3 zh
    pwgan_ljspeech in
    pwgan_vctk in
    mb_melgan_csmsc zh

However, TTSExecutor()we can also find from the source code that in fact, it also supports us to configure the speech synthesis model based on custom data sets and finetune, which provides more possibilities for subsequent model performance optimization.

Guess you like

Origin blog.csdn.net/m0_68036862/article/details/131359196