Introduction to Tencent AI Lab Voice Technology Center Application and Research

The tenth "Going into Tencent" seminar of "CCF Voice Dialogue and Hearing Professional Group Entering Enterprise Series Activities" was successfully concluded last Saturday. This seminar was hosted by Associate Professor Qian Yanmin of Shanghai Jiaotong University and invited to the fourth An expert introduced the latest achievements in Tencent's voice and dialogue field, which are:

Dr. Sudan, Deputy Director of Tencent AI Lab Voice Technology Center, Dr. Heng Lu, Senior Algorithm Expert of Tencent AI Lab, Dr. Huang Shen, Expert of Tencent Language Algorithm, Shang Shidong, Senior Director of Tencent Multimedia Lab.

Among them, Dr. Sultan , Deputy Director of Tencent AI Lab Voice Technology Center , gave an academic report titled " Introduction to the Application and Research of Tencent AI Lab Voice Technology Center ", which mainly introduced the implementation of the main applications of Tencent AI Lab Voice Technology Center. The multiple directions include the research progress of array front-end, speech recognition, speech separation and multi-modal interaction technology. It is predicted that PiKa, a speech technology tool platform that will be open to the industry in the second half of the year, and a large-scale multi-modal data set.

Tencent AI Lab is Tencent's enterprise-level AI laboratory. It was established in Shenzhen in April 2016. It currently has more than 100 top research scientists and more than 300 application engineers in China and the United States. With the help of Tencent's long-term accumulation of rich application scenarios, big data, computing power and first-class talents, AI Lab is based on the future, open cooperation, and is committed to continuously improving AI's cognition, decision-making and creativity, and moving towards the vision of "Make AI Everywhere" .

Tencent AI Lab emphasizes both research and application development. Basic research focuses on the four major directions of machine learning, computer vision, speech recognition and natural language processing, and technology applications focus on the four areas of social, gaming, content and medical AI. In the direction of voice technology, we are also actively exploring cutting-edge technologies. In recent years, many papers have been published in audio conferences, covering various technical directions of speech. Of course, some of them are cooperating with college teachers.

We have seen that in recent years, the form of voice interaction has continued to expand, and the major development context is relatively clear, mainly from near-field voice interaction to far-field voice interaction, and further to multi-modal human-computer interaction. New forms are not produced. It means that the problems of the original form have been solved, but the original application scenarios are still expanding and continuously improving for more difficult and complex problems. We mainly continue to promote research and application work along this main line. We have established a complete coverage of the entire near-field and far-field voice interaction technology chain and implemented it around 17 to 18 years ago.

In terms of smart hardware, between 17 and 18 years ago, we established a self-developed front-end system, covering a variety of array types (including 2 wheat, 4 wheat, 6 wheat, ring, linear and other different array forms); Microphone hardware design to the full-stack capabilities of far-field signal enhancement, voice wake-up and recognition, and synthesis supports a variety of self-developed smart speakers, TVs and car products in Tencent; in terms of smart speakers: such as the early Tencent Listening, and Tencent Dingdong speakers with screens and Tencent Dingdong smart screens have obtained L7 certification from the Institute of Information and Communications Technology, making them one of the products with the highest level of intelligence evaluation. The King Robot Speaker is a distinctive speaker. It not only has the appearance of a game character, but also communicates with the game background. It can intelligently guide and accompany the player in the process of playing the game.

In terms of smart TVs: We cooperated with Tencent Video to internally support the Penguin Aurora Smart Box, and externally support a variety of high-end models such as Sony and Philips equipped with AI Lab far-field voice interaction capabilities in 20-21.

In terms of smart vehicles, here is a list of various vehicle microphone array distribution forms, and their advantages and disadvantages; after a year of polishing, we have implemented a vehicle voice front-end solution called VoiceInCar, which is aimed at vehicle echo cancellation and microphone array beamforming. Innovative algorithm. Through different configurations, it can meet the vehicle voice requirements of different car manufacturers with different microphone numbers, different microphone layouts, different numbers of sound zones, and different hardware computing capabilities. It provides an overall solution from project design guidance, project hardware detection to vehicle voice algorithms. The Internet of Vehicles business center has cooperated with a large number of front-loading machines, and many models from Liuzhou Automobile, Great Wall Motor, Changan Automobile and other automakers have successively landed.

After completing the polishing and landing of far-field voice interaction, we expanded to the technology research and development of multi-modal human-computer interaction. Multi-modal human-computer interaction is the future development trend. The integration of multiple modal information can make the interaction more efficient. , More natural. We have implemented and improved new modules around multi-modal human-computer interaction. After multiple iterations, these modules have reached the stage of project landing.

Multimodal interaction is divided into two aspects: input understanding and feedback presentation. The feedback presentation part, or multimodal generation and synthesis, is currently a hot spot in technology research and development and industry applications. We established it in 18 years. AI digital humans focusing on multi-modal generation/synthesis is our main direction. AI digital humans rely on the AI ​​Lab vision center, voice center, and nlp center to form a complete technology chain, which makes good use of our many basic research capabilities. , And presented to users in a multi-modal way; we hope he has such elements: for different scenes of anthropomorphic or cartoon images, industrial-grade high-simulation modeling rendering, flexible and lightweight acquisition and generation process. A richer interactive environment, including virtual scene generation, augmented reality, virtual reality and holographic technology; more natural speech synthesis, singing voice synthesis, text semantic analysis and natural language generation for different scenarios.

In 2019, we have made a number of research advances in the field of multimodal interaction, including the highly natural Durian voice synthesis technology, as well as the leading lip synthesis technology, which can automatically drive the mouth shape and movement through text. On this basis, we have created different types of digital humans, including high-realistic rendering virtual humans that support multi-emotion and multi-language, self-driven two-dimensional virtual anchors that support text and action dance, and high naturalness created by neural network rendering. Degree digital person.

In 2020, we will continue to accelerate the application of digital human technology in all walks of life: including exploring the application of AI in large-scale game content and IP ecological construction, and voice/text driven lip technology to implement multiple game projects, including "Mirror" , Tianmei Wedo project character lip-drive, etc., improve the efficiency of art production.

We will also launch pilot AI anchors with Douyu and Penguin e-sports to provide users with 24-hour commentary, song order and other interactive functions; in addition, we have opened up the complete process of lyrics creation-real-time synthesis/conversion of singing voices, and the initial implementation of singing synthesis technology In the entertainment industry, such as the King’s IP theme song, Wang Junkai’s interactive H5, etc., the real-time generation of AI singing has been realized, and it will be further implemented in applications such as the repair of national K songs and the creation of songs by thousands of people.

We have also launched the 24-hour AI anchor Ai Ling at station B. You can log on to this website to interact with Ai Ling. Since the launch of Ai Ling, the number of tracks has been continuously increasing. Currently, the production rate is 15 to 18 new songs every week. More than 140 popular songs have been supported. Our singing voice synthesis process is highly automated and requires very little post-adjustment. Since we went online for two months, we have gained more than 20,000 fans. We will continue to polish AI anchor Ai Ling and explore AI interaction Ability to perform gameplay verification and user verification.

In terms of research work, today we mainly share with you our recent work in several basic directions, including array front-end, speech recognition, speech separation and multi-modal interaction technology directions.

 1. Array front

1.1 Voice wake up

In the front-end of the array, we first introduce voice wake-up. The wake-up performance is the most intuitive user experience to evaluate the effect of the front-end system. Therefore, we continue to polish the wake-up aspect. The main problem of voice wake-up is the contradiction between low power consumption and high accuracy. The main challenge is the voice quality in complex acoustic environments such as noise and multi-person interference. This is an evolution of our wake-up technology. After building a fixed wake-up and custom wake-up system, we have carried out in-depth optimization around the combination of front-end array and wake-up model.

The challenge of voice wake-up is that there is a lot of interference with people's speech under noisy conditions. At this time, it is impossible to accurately determine the direction of the target voice. Therefore, it may be better to let it focus on multiple directions at the same time. We propose to set multiple fixed beams and add one The processed microphone signal, but the disadvantage is that it needs to detect whether each beam is awakened, so the amount of calculation increases.

In response to this problem, we introduced a self-attention mechanism to automatically integrate the multiple results obtained by the fixed beam. This method not only improves the wake-up performance, but also reduces the calculation amount by about 70% under the fixed beam configuration as shown in Figure 4. Furthermore, we propose the joint optimization of multi-channel multi-sound zone neural network enhancement and wake-up model. The first is to use neural network to replace traditional fixed beam enhancement. Specifically, it is based on a multi-channel neural network enhancement model that introduces multiple directional features in designated directions. That is, the spatial orientation feature, which simulates wake-up and background interference during the training process, so that the model can enhance the sound source signal closest to each specified direction, and then the multi-zone enhancement model and the wake-up model are jointly optimized. The experimental results at the bottom right It can be seen that the overall performance of this method is good, especially when the signal-to-interference ratio is low, the wake-up performance is greatly improved.

1.2 ADL-MVDR(ALL deep learning MVDR)

The work of our latest full neural network MVDR: The traditional MVDR formula is shown below, which needs to better determine the speech and noise parts, so the neural network has been partially combined to estimate the mask on the time and frequency points through the neural network. To improve performance, it still needs to perform the inversion of the noise covariance matrix and the calculation of the direction derivative requires pca, both of which may be numerically unstable, and if the covariance matrix is ​​estimated in a frame-by-frame recursive manner, The weighting coefficient is an empirical heuristic way.

We propose a method that fully adopts deep learning MVDR, using rnn to replace the covariance matrix inversion and pca operations respectively. In addition, we use complexratio filtering to replace the time-frequency mask to make the training process more stable. The estimate is also more accurate.

We have conducted experiments on a more complex, multi-channel, multi-modal, multi-speaker overlapping data task. It can be seen that in various scenarios such as different spatial angles and different overlapping speakers, PESQ, Si- A result of SNR, SDR, WER. The conclusion drawn from the experimental results is that the general pure NN method will bring serious nonlinear distortion, which is not conducive to ASR, and the general MVDR method has a large residual noise; and the ADL-MVDR method we proposed has a significant impact on all objective indicators and WER is obviously better than the ordinary mask-based MVDR method, and it is also significantly better than the general pure neural network method.

 2. Voice recognition

In the direction of speech recognition, the main improvements can be summarized from two aspects, framework guidelines and model structure. Last year, we made some improvements on the basis of the RNN Transducer model, mainly to achieve differentiated training and external language models on the verified RNNT. The introduction of, at the time, there was no article reporting the results of the differentiated training on the RNN Transducer model.

2.1 RNNT model improvement

The main problem here is clear to everyone, that is, there is a certain mismatch between the rnnt training criteria and the final WER/CER measurement standard. At the same time, the decoder uses teacher-forcing during RNNT training (the decoder input uses the real label sequence), but the inference decoding When the decoder relies on the symbols output by the previous decoding, the second problem is that the text data seen during the RNNT end-to-end model training is limited, and the long-tail word recognition ability is weak.

For the first problem, we use minimum Bayesian risk (MBR) training to minimize the expected Levenshtein distance between the labeled sequence and the online generated Nbest, while retaining the original RNNT criteria for multi-task training.

For the second question, we introduce an external language model to improve, including the introduction of an external neural network or Ngram language model for on-the-flyrescore when RNNT beamsearch is decoded, and online generation during minimum Bayesian risk (MBR) training In Nbest, an external language model is introduced and the information of the external language model is injected into the model training. We conduct experiments on a strong baseline system. The model structure is a hybrid model structure of TDNN and transformer. Tests on both test sets are obtained. Obvious benefits. We will continue to do some attempts in the follow-up, including low-latency streaming RNNT end-to-end recognition, combined with the work of the second pass rescore of LAS.

2.2 DFSMN-SAN-MEM model structure

Our recent work on model structure can be summarized as two typical connection relationships for time series classification tasks, one is RNN, such as LSTM or GRU, and the other is convolution-like models, such as FSMN, TDNN, etc. The self-attention structure can also be regarded as this type, but its connection uses a complex attention mechanism, and it must be processed on the entire sentence. The advantages of the convolution-like model are good parallelism, flexible adjustment of the context range, and the model can be stacked deeper. In industrial applications, the amount of data can reach tens of thousands or even hundreds of thousands of hours for training. At this time, parallelism and training speed are more important considerations.

We introduced the SAN (self-attention) into the FSMN network, conducted a lot of experiments, and explored an optimal network structure. The model contains 3 DFSMN-SAN blocks, and each block contains 10 layers of DFMSN models and 1 layer of self- Attention layer structure, the main conclusion is: the DFSMN-SAN structure we proposed is significantly better than the pure DFSMN model, and also better than the pure SAN model. The pure SAN network is sensitive to hyperparameters and has a large amount of calculation. A better effect can be achieved by interleaving the introduction of SAN in the high-level FSMN network, indicating that low-level features only need to be extracted with a simple model structure, and interleaving the introduction of SAN in the middle and high-level. We found a phenomenon in the SAN model experiment that because SAN has a good ability to synthesize context, the longer the context used, the better the performance. Therefore, an idea is how to further use context information that is larger than the sentence. We propose to further introduce the memory structure in the SAN layer, and specifically propose two ways to make the model use more global information. In the experiment, compared with the model without memory, it has obtained a further significant improvement.

2.3 Application of NAS in large-scale speech recognition system

In terms of model structure, we also focus on one of our recent work, which is the application of NAS (Neural Architecture Search) in large-scale speech recognition systems. Here, large-scale refers to at least tens of thousands of hours of training data and at least tens of thousands of hours of industrial product applications. Training data magnitude. We see that NAS has achieved success in the field of vision, and its technology is constantly evolving rapidly. From the initial reinforcement learning framework based on thousands of GPU-DAYs to several GPU-DAYs today, the efficiency has been greatly improved. Since speech tasks are more complex than image recognition tasks in terms of input dimensions, number of output categories, and sample scale, we focused on the feasibility of NAS in speech recognition tasks, including search training time and video memory resources. Affordable; we experimented with searching for model architecture on small data sets, and then expanding and migrating to training on large-scale data.

In the specific method of NAS, DARTS proposes a differentiable structure search framework. During training, network parameters (traditional neural network parameters) and structure coefficients are learned at the same time. The importance of candidate operations is determined by the structure coefficients to generate the final Search network. PDARTs proposes a progressive approach based on DARTs. In PDARTs, the search is divided into multiple stages, the number of layers of the network is gradually increased, and the candidate operations with low importance are deleted at the end of each stage to ease the search and evaluation stages of the network Performance loss caused by different layers.

For the speech recognition task, we choose to first conduct a large number of candidate sets search experiments on a 150-hour data set such as aishell-1. One of our main tasks is to achieve a better recognition rate and model complexity balance. Improved the search candidate space. The above figure shows the Normal Cell and Reduction Cell structures obtained from the final search. Based on this structure, we migrated it to a large data set for model training experiments. In AIshell2, a 1,000-hour data The results on the set show that the recognition rate is significantly improved, and the model complexity of the model obtained by improving the candidate set search is also reduced by nearly half compared with the original search space.

We further applied the NAS search model to tens of thousands of hours for training on an industrial-grade data scale. The data contains multiple fields, noise diversity, far-field simulation, and different styles. We deepen the network based on the searched cell structure , Adjust the number of initial channels, the overall model is 32cells, FP32/FP16 mixed precision optimization, V100 24 GPU card parallel training, every epoch ~8.5 hours, the entire training can be completed in about a week, which is a completely acceptable cycle range; It can be seen that compared with our current best artificially designed DFSMN-SAN-MEM model, it has achieved a relative improvement of 12% to 18% on multiple test sets. This is a very encouraging result and indicates that NAS There is still a lot of room in the recognition system, and it may soon be possible to get rid of the manual refinement of the model. We also introduce a time-delay limitation under the NAS framework to search for a network architecture suitable for streaming recognition tasks.

Here is an introduction to our voice tool platform. Although the voice toolkit Kaldi is a comprehensive set of voice tools that has been widely used, its neural network part is still inferior to mainstream deep learning frameworks in terms of flexibility and efficiency; we are comparing Combining the pytroch and kaldi frameworks to create our own training tool platform, which combines Kaldi's complete voice functions and the flexibility and efficiency of the mainstream deep learning framework; this tool platform is also promoted by the company's open source collaboration and is constantly improving with its brother departments. All directions including speech recognition, speech synthesis, voiceprint recognition, speech separation, keyword detection, etc. are all integrated into this platform.

Our platform is named PiKa, which means the combination of pytorch and kaldi. Pika is an animal called "pika" in English. It happens to be a combination of two animals, which means lightness and flexibility. We plan to gradually open the tool platform to the industry from the end of the year. Its main functions include support for various traditional systems, as well as various new end-to-end systems; its features include: focusing on Chinese tasks, efficient online noise reduction Dataloader and multi-machine multi-card distributed Trainer, tens of thousands of hours of large-scale data performance verification and speed optimization, the training speed is 4~5 times of kaldi, and it also supports the integration of new AILab algorithms, such as LSTMp (LSTM with projection) pytorch The underlying implementation, SpecSwap, DFSMN-SAN-MEM, etc.

 3. Voice separation

Our work on speech separation is mainly single-channel speech separation. We focus on three aspects, one is the performance of speech separation itself, the other is the promotion of speech separation, and the third is to improve the recognition performance of separated speech.

3.1 Separation model combining local recursion and global attention

The first is to improve the performance of speech separation. The performance here not only refers to objective indicators such as SISNR, but also includes computational complexity, because the current optimal models such as Conv-Tasnet or DPRNN are actually quite complex in calculation , DPRNN is equivalent to containing 2, 30 layers of one-way LSTM, a data of about 100 hours will take about a week to train. After a lot of experiments, we propose a GALR (Globally Attentive Locally Recurrent Networks) model. Its key point is :

(1) Use recurrent neural network to memorize and process the information in the local segments of the waveform  

(2) Use the attention mechanism to extract the global correlation of signals between segments and segments

From the experimental results, it can be seen that the 1.5M size model can achieve the separation performance equivalent to 2.6MDPRNN; at the same time, it can reduce the GPU memory by 36.1% and the calculation amount by 49.4%; in the public data WSJ0-2mix, it is better than DPRNN under the same configuration. Better performance; in 2000h Chinese data, the SISNR for separating target speech is 9% higher than DPRNN.

3.2 Speech separation semi-supervised learning algorithm

As we all know, the generalization ability of the separation model has always been a problem that academia and industry want to solve. The generalization of separation is a more serious problem. Because the combination of multiple sounds has more possibilities, it may cause more mismatches. In addition, there is no way to effectively label the data that has been mixed. MBT (Mixup-Breakdown Training) is an easy-to-implement consistency-based semi-supervised learning algorithm that we propose, called the hybrid-decomposition training method, which can be used for speech separation tasks. MBT first introduces the average teacher model to predict the separation result of the input mixed signal. The input mixed signal includes labeled data and unlabeled data;

These intermediate output (so-called "Breakdown") results are then randomly interpolated and mixed to obtain a pseudo "labeled" mixed signal (so-called "Mixup"); finally, by optimizing the prediction consistency between the teacher model and the student model, To update the student model. This is the first work we have seen to propose the use of semi-supervised learning methods on speech separation tasks to effectively improve the generalization performance of mismatched application scenarios.

3.3 Improve the recognition performance of separated speech

In practical application scenarios, speech recognition is often required. The ultimate goal of speech separation is to obtain higher recognition accuracy, but the separation model processing process inevitably introduces signal errors and distortions, which will worsen the recognition performance. The common solution is to train the acoustic model and speech separation model jointly. In this regard, we have two main conclusions:

One is that joint training can be carried out using a lighter recognition model, and then the joint optimized separation module can be docked with a large online recognition system, and a significant improvement can still be obtained.

The second is to introduce a loss criterion at the feature level of fbank for multi-task optimization, which also has a certain effect on reducing the distortion caused by separation.

We have also proposed another end-to-end neural network framework EAR (Full name Extract, Adapt and Recognize), which directly introduces an adaptor in separation and recognition. The role of the adaptor is to explicitly pass through the neural network to adapt to the masked spectrum. To learn from a transitional representation of the recognition feature, the results of the comparison with other methods on the test set show that the EAR network framework we designed has strong robustness and can still perform very well in noisy speech. It can be seen from the multiple test sets that the acoustic model we proposed has greatly improved on each test set.

We integrate the above speech separation technology and apply it to complex music background speech separation and recognition, around the task of video speech transcribing and subtitle generation. Among them, background music noise is a particularly typical problem. Background music is widespread in short videos and has very However, the recognition performance of the existing speech recognition system will be significantly reduced under the condition of strong background music. Through the use of our above-mentioned separation and joint optimization technology to train on large-scale speech and background music data, the recognition rate of multiple music background test sets is relatively improved by more than 20%, and no background music discrimination module is needed. The music test set can also obtain a relative improvement of 1% to 3%.

 4. Multi-modal technology

4.1 Multimodal speech separation

This is our multi-modal voice separation system. The input device is a microphone array and camera.

First, the system detects n people. Take the person in the red box as the target speaker. The first mode, face detection tells us that the target speaker is in this direction, and the second mode is the key point of the face. It can tell us the shape of the lips of the target speaker; in the third mode, if the target has a registered voice, it can tell us his voiceprint information. In the next step, the three multi-modal information can be sent to three feature extraction networks to extract the target speaker's information. This information and multi-channel voice signals are sent to the separation network together to output the voice of the target person. When building this system, this is the first work we know of using three modes and voice separation. For specific model structure, modal fusion method, joint training algorithm, please refer to our paper for details.

By experimenting with the results of different modal information combinations, the importance of different modalities, direction information, lip information, and voiceprint information can be evaluated. These three modalities have different effects on the system. From the results, in general: the direction information is the strongest, the Lip information is the most complementary to other modes, and the three modes are complementary. It can be seen that if the three modes are used together, the WER can be reduced from 19%. To 10%; the following is an example of the three modal complements. This example shows that using 3 modals can solve almost all cornercases, such as when the target speaks sideways, only opens the mouth without talking, and so on.

We tested the robustness of each mode, such as the case that Lip could not detect. We used different dropouts when testing, and we can see that the performance of the system is relatively robust to Lip. Another test is the robustness of the direction information. The error of the direction information here may be the false alarm error caused by the reflection of the glass light in the face detection, or the direction of the face caused by side-face speaking, not the direction of audio. From the dark green curve, it can be seen that when the angle between the target and the interferer is less than 15, the direction information is artificially added to about 5 degrees of interference, and there will be a drop of less than 0.5dB; between the target and the interferer In the case of an angle >15, artificially adding about 5 degrees of interference to the direction information, there will be no performance degradation.

Here we compare the separation effect of two single-mode systems and our multi-mode system; the first single-mode system is Google’s VoiceFilter; the second single-mode system is Google’s Look-into-Listen; The third is our multi-modal system.

4.1 Multi-modal speech recognition

We have also done some multi-modal audio-visualASR work. Here is the latest work, combining multi-modal separation + beamforming + multi-modal ASR for joint training. This table is only a small part of our results. For more details, you are welcome to take a look at the papers we have listed. The baseline here is a common method, which is to use the direction of groundtruth, do delay and sum beamforming, and then do audio-only Traditional ASR (AM is TDNN, the results of E2Emodel are worse, please refer to the paper for details), the second line is the traditional ASR of multi-modal separation + audio-only mentioned earlier, WER has a huge improvement of 50% , The third line is the system of multi-modal separation +MVDR+audio-visualASR, and WER has a relative 24% improvement. We have proposed a new ASR multi-modal fusion scheme, which is discussed in the paper. This is what we know, the first multi-channel-multi-modal-speech recognition work.

In the near future, we will open source a 3500hr multi-modal dataset (Chinese data from Tencent Video, etc.) to help everyone solve the cocktail party problem. It will be the largest multi-channel + audio + video labeled data. The annotations include: humantranscriptions of text, label of speaker, direction of sound source, boundingbox and landmark of face detection, etc.

The test set is a real environment multi-channel + audio + video recording using our AILab self-developed equipment. This data set can help everyone to overcome and solve the three key problems in the cocktail party problem: diarization, separation and ASR.

Stay tuned!

Backstage reply keywords [AI Lab] can get guests to share PPT.

Cloud + Community Salon online Issue 5

[Architecture Evolution] Special live broadcast is underway

Scan code to make an appointment live


Guess you like

Origin blog.csdn.net/Tencent_TEG/article/details/108570724