GitHub 3.1K open source of the industry's first streaming speech synthesis system

Intelligent voice technology can be seen everywhere in life. Common intelligent application assistants, voice broadcasts, and virtual digital humans that have become popular in recent years all have intelligent voice technology. Intelligent speech is a comprehensive technology composed of speech recognition, speech synthesis, natural language processing and many other technologies. It has high requirements for developers and has always been a difficulty in enterprise applications.

The Paddle speech model library PaddleSpeech provides developers with a variety of speech processing capabilities such as speech recognition, speech synthesis, voiceprint recognition, and sound classification. The code is all open source, and various services are deployed with one click. Developers can easily create industrial applications!

PaddleSpeech has received widespread attention from developers since it was open sourced, and the attention continues to rise.

In the process, we also continuously upgrade based on user feedback, introduce new ones, and optimize user experience.

This time, PaddleSpeech version 1.0 is officially released, bringing four important upgrades to developers:
Newly released PP-TTS: the industry's first open source end-to-end streaming speech synthesis system, supporting streaming acoustic models and streaming vocoders , an open source one-click streaming speech synthesis service deployment solution.

Newly released PP-ASR: an open source streaming speech recognition system based on tens of thousands of hours of data, and an open source one-click streaming speech recognition service deployment solution. Support Language Model decoding and personalized speech recognition.
Newly released PP-VPR: an open-source full-link voiceprint extraction and retrieval system, which can easily build an industrial-grade system in 10 minutes.
One-click service capabilities: speech recognition, speech synthesis, voiceprint recognition, sound classification, punctuation recovery, and one-click deployment of five core voice services.

★ Project Portal ★
Click at the end of the article to read the original text with one-click GET!
GitHub - PaddlePaddle/PaddleSpeech: Easy-to-use Speech Toolkit including SOTA/Streaming ASR with punctuation, influential TTS with text frontend, Speaker Verification System and End-to-End Speech Simultaneous Translation.

The following is a detailed interpretation of the content of this release.

1. PP-TTS The industry's first open source end-to-end streaming speech synthesis system

Speech synthesis is the "mouth" for the machine to "speak". With the development of deep learning technology, the effect of using end-to-end neural network for speech synthesis has been greatly improved compared with traditional technologies, but the response time of end-to-end speech synthesis is long, and it is difficult to implement in scenarios with high real-time requirements. Difficult to meet business needs.

For example, in real-time interactive virtual digital human applications, the virtual human needs to respond quickly to user instructions, otherwise it will consume the user's patience and reduce the user experience. At this time, a streaming speech synthesis system is required. While ensuring the synthesis quality, Improve the response speed and enhance the interactive experience.

The newly released PP-TTS by PaddleSpeech provides a one-click deployment of a streaming speech synthesis system solution, which solves the problems of long response time and difficult landing in the application of speech synthesis technology.

Streaming inference structure, reducing average response delay

Taking the acoustic model FastSpeech2 and the vocoder HiFi-GAN as examples, PP-TTS innovates the Decoder module of FastSpeech2, replaces the FFT-Block as a convolutional structure, and innovatively proposes a streaming method based on FastSpeech2 combined with HiFi-GAN The reasoning structure, the streaming reasoning in the form of Chunk, can make the output of the acoustic model and the vocoder consistent with the non-streaming reasoning.

PP-TTS streaming speech synthesis can greatly reduce the average response delay while ensuring the synthesis quality:

Test environment: the test case is the last 100 pieces of the CSMSC dataset, and the CPU is Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz

Compared with end-to-end non-streaming synthesis, the average response delay of PP-TTS streaming synthesis is reduced by 97.4%, and it can respond in real time even on ordinary CPU notebooks.

Text front-end optimization

PP-TTS provides a text-to-speech front-end optimization solution for Chinese scenes: text regularization processing is performed for common non-standard words such as time, date, phone, temperature, etc.; open-source soft tone shifting, triple tone shifting and " A "no" transposition (G2P) solution. On the self-built text regularization test set, the CER is as low as 0.73%; the pinyin of the CSMSC dataset is marked as Ground Truth, and the WER of word-to-phonetic conversion (G2P) is as low as 2.6%.

2. PP-ASR. Streaming speech recognition system based on tens of thousands of hours of data

If speech synthesis is the "mouth" of the machine, then speech recognition is the "ear" of the machine. Only with an accurate "ear" can the machine become smarter. The advantage of the end-to-end non-streaming speech recognition model is that the recognition effect is better, but the disadvantage is that the system has a large delay and cannot meet the needs of real-time interaction scenarios. To solve this problem, PaddleSpeech version 1.0 brings you PP-ASR: a streaming speech recognition system based on tens of thousands of hours of data from WenetSpeech.

Under the premise of guaranteeing the recognition effect, PP-ASR streaming speech recognition can significantly reduce the response delay, and can obtain the recognition result in real time, improving the user experience.

Test data set: Conformer model, the test data set is AIShell-1, the stream recognition block length is 640ms, GPU: Tesla V100-SXM2-32GB, CPU: 80 Core Intel(R) Xeon(R) Gold 6271C CPU@ 2.60 GHz 

Personalized Identification Solution

The personalized recognition scheme based on WFST supports speech recognition tasks in specific scenarios. For example, in the traffic reimbursement scenario, general speech recognition is poor in recognizing entities such as POI, date, and time. Personalized recognition based on WFST can improve the accuracy of recognition. On the internal test set of taxi reimbursement, the general recognition CER is 5.4%, and the optimized CER is 1.32%, an absolute increase of 4.08%.

For the demonstration effect, see the example at the end of the article

3. PP-VPR. Full link voiceprint recognition and audio retrieval system

As a biological feature, the voiceprint feature has the advantages of good anti-counterfeiting, not easy to tamper and steal, etc., combined with voice recognition and dynamic password technology, it is very suitable for remote identity authentication scenarios. On the basis of voiceprint recognition technology, combined with audio retrieval technology (such as speech, music, speaker, etc. retrieval), it can quickly query and find similar voice (or the same speaker) clips in massive audio data.

Among them, voiceprint recognition is a typical pattern recognition problem, and its basic system architecture is as follows:

PaddleSpeech's open source PP-VPR voiceprint recognition and audio retrieval system integrates the industry's leading voiceprint recognition model, uses the ECAPA-TDNN model to extract voiceprint features, and the recognition error rate (EER, Equal error rate) is as low as 0.83 %, and by connecting MySQL and Milvus in series, a complete audio retrieval system can be built to achieve millisecond-level audio retrieval.

4. One-click deployment of five core voice services, voice recognition, voice synthesis, voiceprint recognition, voice classification and punctuation recovery

In industrial applications, it is more convenient to provide trained models to others in the form of services. Considering that building a complete set of network service applications is a tedious job, PaddleSpeech provides you with a one-click deployment service. You can start speech recognition, speech synthesis, voiceprint recognition, sound classification and punctuation at the same time with one line of code on the command line. Restore the five major services.

Demo use and display

Enter the demo/speech_server directory, and start speech recognition, speech synthesis, voiceprint recognition, sound classification and punctuation recovery services with one click.

At this point, the service has been mounted to the configured port 8090, and we can call the service through the command line.

Client calls, taking speech recognition as an example:

Recognition result:

The services of speech synthesis, voiceprint recognition, voice classification and punctuation recovery are similar, and you can refer to the corresponding documents.

Guess you like

Origin blog.csdn.net/weixin_41888295/article/details/125066775