Article Directory
foreword
Excerpted from Baidu Encyclopedia
Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the lexical content of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. This differs from speaker recognition and speaker verification, which attempt to identify or identify the speaker of the speech rather than the lexical content contained in it.
Speech recognition is one of the fields of deep learning. There are also many projects on github that implement ASR. Some projects that support Chinese ASR are as follows. The following will demonstrate the simple use
- https://github.com/PaddlePaddle/PaddleSpeech
- https://github.com/nl8590687/ASRT_SpeechRecognition
- https://github.com/nobody132/masr
- https://github.com/espnet/espnet
- https://github.com/wenet-e2e/wenet
- https://github.com/mozilla/DeepSpeech
1. PaddleSpeech
PaddleSpeech is an open source model library based on PaddlePaddle 's speech direction, used for the development of various key tasks in speech and audio, and contains a large number of cutting-edge and influential models based on deep learning.
PaddleSpeech won the NAACL2022 Best Demo Award , please visit the Arxiv paper.
1.1 Installation
Just install it according to the official document: https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/README_cn.md#%E5%AE%89%E8%A3%85
The official strongly recommends that users install PaddleSpeech on python versions above 3.7 in a Linux environment .
But my computer is windows, and the python version is 3.6.5. Choose to install python 3.7.0 version under windows: https://www.python.org/downloads/windows/ , the following error is reported during the installation process
pip install paddlespeech
Problem 1: The version of paddlespeech-ctcdecoders is wrong
Could not find a version that satisfies the requirement paddlespeech-ctcdecoders (from paddlespeech) (from versions: )
No matching distribution found for paddlespeech-ctcdecoders (from paddlespeech)
I found that paddlespeech-ctcdecoders does not have a windows version, and found that someone has already compiled the windows version. I compiled the windows version of paddlespeech-ctcdecoders for the first time , but the official reminder: it doesn’t matter if paddlespeech-ctcdecoders is not installed successfully, this package is not necessary.
Problem 2: gbk encoding error
UnicodeDecodeError: 'gbk' codec can't decode byte 0x80 in position 5365: illegal multibyte sequence
This problem disappears when I run it again
1.2 run
File not found error:
FileNotFoundError: [WinError 2] 系统找不到指定的文件。
Open the Python\Python37\lib\subprocess.py file and change shell=False to shell=True on line 684
_child_created = False # Set here since __del__ checks it
def __init__(self, args, bufsize=-1, executable=None,
stdin=None, stdout=None, stderr=None,
preexec_fn=None, close_fds=True,
shell=True, cwd=None, env=None, universal_newlines=None,
startupinfo=None, creationflags=0,
restore_signals=True, start_new_session=False,
pass_fds=(), *, encoding=None, errors=None, text=None):
Use the official example test:
paddlespeech asr --lang zh --input C:\Users\supre\Desktop\sound\PaddleSpeech-develop\zh.wav
Tested with two examples of my own recordings and found it to be very accurate.
1.3 More functions
In addition to speech recognition, it supports more functions
- sound classification
paddlespeech cls --input input.wav
- voiceprint recognition
paddlespeech vector --task spk --input input_16k.wav
- Speech translation (English-Chinese) (Windows system is not supported yet)
paddlespeech st --input input_16k.wav
- speech synthesis
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav
Found "Hello, aabond, welcome to Baidu Flying Paddle" where aabond has no voice, guess it can't synthesize English, and may need to provide other parameters
- Punctuation Restoration
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
2. ASRT
ASRT is a Chinese speech recognition system based on deep learning, using tensorFlow.keras based on deep convolutional neural network and long short-term memory neural network, attention mechanism and CTC implementation.
2.1 installation
Download the server and install dependencies
$ pip install tensorflow==2.5.2
$ pip install wave
$ pip install matplotlib
$ pip install requests
$ pip install scipy
$ pip install flask
$ pip install waitress
download client
2.2 run
python asrserver_http.py
Using Baidu demo as an example, it can be displayed correctly
Using my own recording as an example, "The Art of Speaking" is recognized as "Up" and is not displayed correctly. Is it because I speak too fast? should require training data
Real-time speech recognition is also available, and the accuracy is found to be average
3. MASR
MASR is an end-to-end deep neural network based Mandarin Chinese speech recognition project.
MASR uses a gated convolutional neural network (Gated Convolutional Network), and the network structure is similar to Wav2letter proposed by Facebook in 2016. But the activation function used is not ReLU
or is HardTanh
, but GLU
(gated linear unit). Hence the name gated convolutional network. According to my experiments, GLU
the convergence rate of using HardTanh
is faster than . This project can be used as a reference if you want to study the effect of convolutional network for speech recognition.
3.1 Installation
Download the source code , download the pre-trained model data , create a pretrained directory under the source code directory, and put the model file into it
3.2 run
python examples/demo-record-recognize.py
Recognition result:
"The Art of Speaking" identified as "Speaking and Yishu"
"Just because you are so beautiful" identified as "Just because of Lipaiyun"
4. ESPnet
ESPnet: End-to-end speech processing toolkit, covering end-to-end speech recognition, text-to-speech, speech translation, speech enhancement, speaker binarization, spoken language understanding, etc. ESPnet uses pytorch as the deep learning engine and follows Kaldi-style data processing, feature extraction/formatting and recipes to provide a complete setup for various speech processing experiments.
4.1 Installation
pip install espnet
pip install espnet_model_zoo
pip install kenlm
4.2 run
Run the following code, the download speed may be too slow, you can manually download the Chinese model data and place it under the Python installation path Python36\Lib\site-packages\espnet_model_zoo\a1dd2b872b48358daa6e136d4a5ab08b
to speed up
import soundfile
from espnet_model_zoo.downloader import ModelDownloader
from espnet2.bin.asr_inference import Speech2Text
d = ModelDownloader()
speech2text = Speech2Text(
**d.download_and_unpack("kamo-naoyuki/aishell_conformer"),
# Decoding parameters are not included in the model file
maxlenratio=0.0,
minlenratio=0.0,
beam_size=20,
ctc_weight=0.3,
lm_weight=0.5,
penalty=0.0,
nbest=1
)
audio, rate = soundfile.read("zh.wav")
nbests = speech2text(audio)
text, *_ = nbests[0]
print(text)
It can be found that Baidu demo can correctly identify
Test your own recording demo, "The Art of Speaking" is recognized as "Constraints of Speaking"
5. WeNet
Wenet is an open-source voice recognition toolkit for industrial applications developed by the voice team of Momenwen and the voice laboratory of Western Polytechnic University. This tool provides a one-stop service from training to deployment of voice recognition with a simple solution. Its main features as follows:
- Using the conformer network structure and CTC/attention loss joint optimization method, it has the industry's first-class recognition effect.
- Provides solutions for direct deployment on the cloud and on the end, minimizing the engineering work between model training and product landing.
- The framework is simple, and the model training part is completely based on the pytorch ecology, and does not depend on complex tools such as kaldi to be installed.
- Detailed annotations and documentation, ideal for learning the fundamentals and implementation details of end-to-end speech recognition.
5.1 Installation
pip install wenetruntime
5.2 run
import sys
import wenetruntime as wenet
wav_file = sys.argv[1]
decoder = wenet.Decoder(lang='chs')
ans = decoder.decode_wav(wav_file)
print(ans)
If downloading the file is too slow, you can manually download the file to the C:\Users\supre\.wenet\chs directory, remember to decompress chs.tar.gz instead of final.zip.
Use Baidu demo test, very accurate
Use your own recording sound test, the error is as follows:
It should only be able to input mono, but what I recorded was binaural. Solution: Use Audacity to record, change the channel to mono, and change the sampling frequency to 16000 HZ, the default encoder remains unchanged, otherwise the following error will appear
The end result is still very accurate.
6. DeepSpeech
DeepSpec is an open source speech-to-text (Speech-To-Text, STT) engine that uses a model trained on machine learning techniques based on Baidu's deep speech research papers . The Deep Speech project uses Google's Tensorflow to make implementation easier.
6.1 Installation
pip install deepspeech
6.2 Run
Download the Chinese model: deepspeech-0.9.3-models-zh-CN.pbmm , deepspeech-0.9.3-models-zh-CN.scorer
deepspeech --model deepspeech-0.9.3-models-zh-CN.pbmm --scorer deepspeech-0.9.3-models-zh-CN.scorer --audio zh.wav
Use Baidu demo test, there is an error
Using self-recordings, "The Art of Talking" identified as "A Cup of Orchids"
Use deepspeech's own data to test "he has two older brothers" and identify it as "there will be two older brothers"