Simple use of 6 open source software that supports Chinese speech recognition

foreword

Excerpted from Baidu Encyclopedia

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the lexical content of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. This differs from speaker recognition and speaker verification, which attempt to identify or identify the speaker of the speech rather than the lexical content contained in it.

Speech recognition is one of the fields of deep learning. There are also many projects on github that implement ASR. Some projects that support Chinese ASR are as follows. The following will demonstrate the simple use

  1. https://github.com/PaddlePaddle/PaddleSpeech
  2. https://github.com/nl8590687/ASRT_SpeechRecognition
  3. https://github.com/nobody132/masr
  4. https://github.com/espnet/espnet
  5. https://github.com/wenet-e2e/wenet
  6. https://github.com/mozilla/DeepSpeech

1. PaddleSpeech

PaddleSpeech is an open source model library based on PaddlePaddle 's speech direction, used for the development of various key tasks in speech and audio, and contains a large number of cutting-edge and influential models based on deep learning.

PaddleSpeech won the NAACL2022 Best Demo Award , please visit the Arxiv paper.

1.1 Installation

Just install it according to the official document: https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/README_cn.md#%E5%AE%89%E8%A3%85

The official strongly recommends that users install PaddleSpeech on python versions above 3.7 in a Linux environment .

But my computer is windows, and the python version is 3.6.5. Choose to install python 3.7.0 version under windows: https://www.python.org/downloads/windows/ , the following error is reported during the installation process

pip install paddlespeech

Problem 1: The version of paddlespeech-ctcdecoders is wrong

 Could not find a version that satisfies the requirement paddlespeech-ctcdecoders (from paddlespeech) (from versions: )
No matching distribution found for paddlespeech-ctcdecoders (from paddlespeech)

I found that paddlespeech-ctcdecoders does not have a windows version, and found that someone has already compiled the windows version. I compiled the windows version of paddlespeech-ctcdecoders for the first time , but the official reminder: it doesn’t matter if paddlespeech-ctcdecoders is not installed successfully, this package is not necessary.

Problem 2: gbk encoding error

UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 5365: illegal multibyte sequence

This problem disappears when I run it again

1.2 run

File not found error:

FileNotFoundError: [WinError 2] 系统找不到指定的文件。

Open the Python\Python37\lib\subprocess.py file and change shell=False to shell=True on line 684

    _child_created = False  # Set here since __del__ checks it

    def __init__(self, args, bufsize=-1, executable=None,
                 stdin=None, stdout=None, stderr=None,
                 preexec_fn=None, close_fds=True,
                 shell=True, cwd=None, env=None, universal_newlines=None,
                 startupinfo=None, creationflags=0,
                 restore_signals=True, start_new_session=False,
                 pass_fds=(), *, encoding=None, errors=None, text=None):

Use the official example test:

paddlespeech asr --lang zh --input C:\Users\supre\Desktop\sound\PaddleSpeech-develop\zh.wav

paddle_zh_demo_baidu

Tested with two examples of my own recordings and found it to be very accurate.

paddle_zh_demo_1

paddle_zh_demo_2

1.3 More functions

In addition to speech recognition, it supports more functions

  1. sound classification
paddlespeech cls --input input.wav
  1. voiceprint recognition
paddlespeech vector --task spk --input input_16k.wav
  1. Speech translation (English-Chinese) (Windows system is not supported yet)
paddlespeech st --input input_16k.wav
  1. speech synthesis
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav

Found "Hello, aabond, welcome to Baidu Flying Paddle" where aabond has no voice, guess it can't synthesize English, and may need to provide other parameters
insert image description here

  1. Punctuation Restoration
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭

2. ASRT

ASRT is a Chinese speech recognition system based on deep learning, using tensorFlow.keras based on deep convolutional neural network and long short-term memory neural network, attention mechanism and CTC implementation.

2.1 installation

Download the server and install dependencies

$ pip install tensorflow==2.5.2
$ pip install wave
$ pip install matplotlib
$ pip install requests
$ pip install scipy
$ pip install flask
$ pip install waitress

download client

2.2 run

python asrserver_http.py

Using Baidu demo as an example, it can be displayed correctly

asrt_zh_demo_1

Using my own recording as an example, "The Art of Speaking" is recognized as "Up" and is not displayed correctly. Is it because I speak too fast? should require training data

asrt_zh_demo_2

Real-time speech recognition is also available, and the accuracy is found to be average
asrt_zh_demo_3

3. MASR

MASR is an end-to-end deep neural network based Mandarin Chinese speech recognition project.

MASR uses a gated convolutional neural network (Gated Convolutional Network), and the network structure is similar to Wav2letter proposed by Facebook in 2016. But the activation function used is not ReLUor is HardTanh, but GLU(gated linear unit). Hence the name gated convolutional network. According to my experiments, GLUthe convergence rate of using HardTanhis faster than . This project can be used as a reference if you want to study the effect of convolutional network for speech recognition.

3.1 Installation

Download the source code , download the pre-trained model data , create a pretrained directory under the source code directory, and put the model file into it

3.2 run

python examples/demo-record-recognize.py

Recognition result:

"The Art of Speaking" identified as "Speaking and Yishu"

masr_zh_demo_1

"Just because you are so beautiful" identified as "Just because of Lipaiyun"

masr_zh_demo_2

4. ESPnet

ESPnet: End-to-end speech processing toolkit, covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker binarization, spoken language understanding, etc. ESPnet uses pytorch as the deep learning engine and follows Kaldi-style data processing, feature extraction/formatting and recipes to provide a complete setup for various speech processing experiments.

4.1 Installation

pip install espnet
pip install espnet_model_zoo
pip install kenlm

4.2 run

Run the following code, the download speed may be too slow, you can manually download the Chinese model data and place it under the Python installation path Python36\Lib\site-packages\espnet_model_zoo\a1dd2b872b48358daa6e136d4a5ab08bto speed up

import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
**d.download_and_unpack("kamo-naoyuki/aishell_conformer"),
    # Decoding parameters are not included in the model file
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)
audio, rate = soundfile.read("zh.wav")
nbests = speech2text(audio)
text, *_ = nbests[0]
print(text)

It can be found that Baidu demo can correctly identify

espnet_zh_demo_1

Test your own recording demo, "The Art of Speaking" is recognized as "Constraints of Speaking"

espnet_zh_demo_2

5. WeNet

Wenet is an open-source voice recognition toolkit for industrial applications developed by the voice team of Momenwen and the voice laboratory of Western Polytechnic University. This tool provides a one-stop service from training to deployment of voice recognition with a simple solution. Its main features as follows:

  • Using the conformer network structure and CTC/attention loss joint optimization method, it has the industry's first-class recognition effect.
  • Provides solutions for direct deployment on the cloud and on the end, minimizing the engineering work between model training and product landing.
  • The framework is simple, and the model training part is completely based on the pytorch ecology, and does not depend on complex tools such as kaldi to be installed.
  • Detailed annotations and documentation, ideal for learning the fundamentals and implementation details of end-to-end speech recognition.

5.1 Installation

pip install wenetruntime

5.2 run

import sys
import wenetruntime as wenet

wav_file = sys.argv[1]
decoder = wenet.Decoder(lang='chs')
ans = decoder.decode_wav(wav_file)
print(ans)

If downloading the file is too slow, you can manually download the file to the C:\Users\supre\.wenet\chs directory, remember to decompress chs.tar.gz instead of final.zip.

Use Baidu demo test, very accurate

wenet_zh_demo_1

Use your own recording sound test, the error is as follows:

wenet_error_1

It should only be able to input mono, but what I recorded was binaural. Solution: Use Audacity to record, change the channel to mono, and change the sampling frequency to 16000 HZ, the default encoder remains unchanged, otherwise the following error will appear
wenet_error_2

The end result is still very accurate.

wenet_zh_demo_2

6. DeepSpeech

DeepSpec is an open source speech-to-text (Speech-To-Text, STT) engine that uses a model trained on machine learning techniques based on Baidu's deep speech research papers . The Deep Speech project uses Google's Tensorflow to make implementation easier.

6.1 Installation

pip install deepspeech

6.2 Run

Download the Chinese model: deepspeech-0.9.3-models-zh-CN.pbmm , deepspeech-0.9.3-models-zh-CN.scorer

deepspeech --model deepspeech-0.9.3-models-zh-CN.pbmm --scorer deepspeech-0.9.3-models-zh-CN.scorer --audio zh.wav

Use Baidu demo test, there is an error

deepspeeh_zh_demo_1

Using self-recordings, "The Art of Talking" identified as "A Cup of Orchids"

deepspeeh_zh_demo_2

Use deepspeech's own data to test "he has two older brothers" and identify it as "there will be two older brothers"

deepspeeh_zh_demo_3

reference

  1. import paddle encountered an error
  2. Teach you how to use ASRT to deploy Chinese speech recognition API server
  3. Wenet - E2E speech recognition tool for industrial application

Guess you like

Origin blog.csdn.net/qq_23091073/article/details/126627958