Technical Interpretation | The latest progress of iFlytek’s voice technology Part 2: Speech recognition and speech synthesis

This article will focus on the direction of speech recognition and synthesis, and continue to provide you with relevant technical analysis.

"It is advisable to take a broad view of the scenery." Facing a future where human-computer interaction is more natural and smooth, how is the progress of intelligent voice technology? Where to go?

The following content is compiled based on the keynote speech "Frontier Progress of iFlytek Voice Technology" delivered by Pan Jia, an outstanding scientist at the iFlytek Research Institute, at NCMMSC 2022.


Technical expertise: ⭐⭐⭐⭐⭐

Table of contents

Speech Recognition

1. Mainstream frameworks are all autoregressive end-to-end modeling.

2. Propose a non-autoregressive ASR framework based on unified spatial expression of text and speech

3. Further propose a multi-task learning framework for multi-semantic evaluation

speech synthesis

1. Propose SMART-TTS

2. Virtual tone generation

Speech Recognition

  • Mainstream frameworks are all autoregressive end-to-end modeling

Currently, end-to-end modeling based on autoregressive methods has become the mainstream framework for speech recognition. It mainly includes Attention-based Encoder-Decoder and Transducer structure that introduces prediction network. The so-called autoregressive method is equivalent to introducing a language model mechanism into the speech recognition model. Its characteristic is that predicting the current recognition result requires waiting for previous historical recognition results.

However, when deployed on a large scale, the nature of autoregression will affect the degree of parallelism and reasoning efficiency. Therefore, when we thought about whether we could build a high-accuracy non-autoregressive framework, we naturally thought of CTC (Connectionist temporal classification, connection time series). Classification) - Being a non-autoregressive framework, its properties will cause the output to appear in the form of spikes.

  • Proposed a non-autoregressive ASR framework based on unified spatial expression of text and speech

If we do CTC modeling of Chinese characters, its hidden layer representation can capture the relationship of Chinese character level context. Except for the different duration, it is very close to the mask recovery or error correction task in natural language.

In order to solve the problem of voice and text length mismatch, iFlytek Research Institute has designed an effective solution, which is to add blanks to the text to achieve frame-level expansion. The final effect is that after joint training with massive amounts of pure text data and phonetic word-level CTC data, similar information contained in the contextual language model contained in the text data is absorbed into the entire model. The results also prove to be no inferior to autoregressive ED and Transducer, even better.

  • Further propose a multi-task learning framework for multi-semantic evaluation

At the same time, iFlytek Research Institute further proposed a multi-task learning framework for multi-semantic evaluation to improve the intelligibility of speech recognition. Looking at the left side of the picture above, although the recognition rate reaches 93%, some key parts of the recognition errors affect understanding.

We added some layers after the word-level CTC, and after receiving sentence-level representations, we do goals such as intention classification and grammatical evaluation. We hope that in addition to high recognition results, the whole sentence can also have higher intentions. Classify well and improve the intelligibility of speech recognition systems.

speech synthesis

  • 提出SMART-TTS(Self-supervised Model Assisted pRosody learning for naTural Text To Speech)

In recent years, a lot of work has been done around the general framework of speech synthesis, such as VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech), end-to-end modeling, and prosodic representation.

iFlytek Research Institute has proposed the SMART-TTS framework. The core idea is to modularize the speech synthesis learning process and strengthen the learning of each module through pre-training, rather than directly learning the mapping of text and acoustic features from the beginning.

First do text encoding pre-training. By combining text and speech for pre-training, it is hoped that the representation of the text can contain some information related to the prosody of pronunciation. On this basis, it is relatively easy to do prosody modeling and extract prosody representations.

In addition to some artificially designed statistical features such as traditional fundamental frequency energy or duration, we use contrastive learning to extract prosodic features, making it more capable of representing speech prosody.

After having the prosodic features, we will restore the final acoustic features. The acoustic features use some codes such as VAE, and finally restore its voice through the vocoder on the basis of the codes.

Currently, SMART-TTS has been launched on the iFlytek open platform, and its speech synthesis effects can be directly experienced in the Xuexi Qiangguo and iFlytek Audio APPs.

For more application information about iFlytek’s online speech synthesis technology, click to view:

Online Speech Synthesis_Free Trial-iFlytek Open Platform

  • Virtual sound generation

In addition to SMART-TTS, iFlytek Research Institute has also done another work in the field of speech synthesis: the generation of virtual sounds.

The Metaverse is a very hot topic right now, and NPCs (non-player characters) can be said to be everywhere in the Metaverse space. If the NPC's voice does not match our personality, it will obviously affect our experience. In the face of a large number of NPCs, it is extremely time-consuming and laborious to find a suitable pronunciator for each NPC.

The same situation occurs in audio novels. If multiple characters are read by the same voice, it will make us feel boring. How can we realize the "role playing" of the voice according to the personality of each character?

Virtual timbre generation is to combine the voices of a large number of speakers to train the speech synthesis model. First, the relevant representations of the speakers are extracted through the timbre coding module. These representations serve the purpose of speaker recognition and are distinctive representations in the composition of the timbre space. , unlike generative models, which have many attributes such as interpolation at the spatial level. Therefore, we further project the timbre representation into a new hidden layer representation space through the flow model, and combine the representation of this space with the previous text representation and prosodic representation for speech synthesis.

Because there will be a lot of speaker data during training, and some speakers' timbre feature labels will also be marked during training, such as age, gender, characteristics (sweetness, richness, etc.), etc., with the guidance of these labels, the final The timbre space representation Z is very indicative and also has good interpolation and other properties.

With relevant models in place, usage becomes easier. We can input some speech that we want to generate, such as "young sweet female voice", etc., and then train a mapping relationship with Z through the semantic encoding module. Finally, we can get the timbre that conforms to the timbre control label based on the sampling.

At present, we have used this model to generate more than 500 virtual synthetic sounds, and the naturalness of the synthesized speech exceeds 4.0MOS.

Guess you like

Origin blog.csdn.net/AI_Platform/article/details/129753551