Open source (offline) Chinese speech recognition ASR (speech-to-text) tool arrangement

Open source (offline) Chinese speech recognition ASR (speech-to-text) tool arrangement

Table of contents

open ai's open source tool: whisper

introduction

On September 21, 2022, Open AI opened up the Whisper neural network, which claims that its English speech recognition ability has reached human level, and it also supports automatic speech recognition in 98 other languages. The Automatic Speech Recognition (ASR) model provided by the Whisper system is trained to run speech recognition and translation tasks. They can convert speech in various languages ​​into text, and can also translate these texts into English.

The core function of whisper, speech recognition, for most people, can help us organize meetings, lectures, and classroom recordings into transcripts more quickly; for film and television lovers, it can automatically generate subtitles for resources without subtitles, no need to worry about it Waiting hard for the subtitle resources of major subtitle groups; for foreign language learners, using whisper to translate your pronunciation practice recordings can be a good test of your oral pronunciation level. Of course, all major cloud platforms provide speech recognition services, but they are basically networked, and there are always hidden dangers to personal privacy and security. However, whisper is completely different. Whisper runs completely locally without networking, which fully guarantees personal privacy, and whisper recognize The accuracy rate is quite high.

quote

ASRT Speech Recognition Project

Introduction to ASRT

ASRT is a speech recognition tool based on deep learning, which can be used to develop the most advanced speech recognition system . It has been developed by AI lemon blogger (Xidian University, Xi'an Key Laboratory of Big Data and Visual Intelligence) since 2016. The open source speech recognition project has a baseline of 85% recognition accuracy , and can achieve a recognition accuracy of about 95% under certain conditions . ASRT includes a speech recognition algorithm server (for training or deploying API services) and client SDKs for multiple platforms and programming languages . It supports one-sentence recognition and real-time streaming recognition . The relevant codes have been open sourced on GitHub and Gitee .

quote

Microsoft Speech Service (paid)

Introduction to Microsoft Speech Services

Microsoft Speech Services provides speech-to-text and text-to-speech capabilities through Azure Speech resources. You can transcribe speech to text with high precision, generate natural-sounding text-to-speech voices, translate speech, and use speaker recognition during conversations. Microsoft Speech Services (supposedly) offers: speech recognition (speech-to-text), speech synthesis (text-to-speech), access to real-time translations, recording conversations, or integrating speech into robotic experiences.

The speech-to-text module mainly includes the following aspects:

Real-time speech-to-text

  • When using real-time speech-to-text, audio is transcribed when speech is recognized from the microphone or from a file. For applications that require real-time transcription of audio, use real-time speech-to-text, for example:

    • Transcription, description text or subtitles of live meetings

    • Contact Center Agent Assistant

    • dictation

    • voice agent

    • pronunciation assessment

batch transcription

Batch transcription is used to transcribe large volumes of audio in storage. You can point to an audio file with a Shared Access Signature (SAS) URI and receive transcriptions asynchronously. Use Batch Transcription for applications that need to transcribe audio in batches, such as:

  • Transcription, subtitles or subtitles of pre-recorded audio
  • Contact Center Post-Call Analytics
  • Binarization

custom voice

Using Custom Speech, you can evaluate and improve speech recognition accuracy for your applications and products. Custom speech models are available for real-time speech-to-text, speech translation, and batch transcription.

Out-of-the-box speech recognition utilizes a common language model as the base model, which is trained using data owned by Microsoft and reflects commonly spoken languages. Base models are pre-trained using dialects and voices representing a variety of common domains. When you make a speech recognition request, the latest base model for each supported language is used by default. The base model works well in most speech recognition scenarios.

Custom models can be used to augment the base model by providing text data to train the model to improve recognition of application-specific domain-specific vocabulary. It can also be used to improve recognition based on application-specific audio conditions by providing audio data with reference transcriptions.

quote

PaddleSpeech

Description of PaddleSpeech

PaddleSpeech is an open-source model library based on PaddlePaddle's voice direction, which is used for the development of various key tasks in speech and audio. It contains a large number of cutting-edge and influential models based on deep learning, including speech recognition (ASR). You can use PaddleSpeech to train and test Chinese speech recognition models.

quote

Guess you like

Origin blog.csdn.net/guigenyi/article/details/130605249