Audio Tools kaldi deployment and modeling research study

 

About Speech Recognition

Speech recognition (speech recognition) technology, also known as automatic speech recognition (English: Automatic Speech Recognition, ASR), computer speech recognition (English: Computer Speech Recognition) or a speech-to-text recognition (English: Speech To Text, STT) its goal is the computer automatically converts the human voice content to the corresponding text.

According to different latitudes classified as follows:

  • Press vocabulary (vocabulary) Size Category:
  • Small vocabulary: dozens of words;
  • Moderate vocabulary: hundreds to thousands of words
  • Large Vocabulary: a few thousand to tens of thousands
  • By way of speaking (style) Category:
  • Isolated word (isolated words)
  • Continuous (continously)
  • (Acoustic) acoustic environment classified according to:
  • Studio
  • Different levels of environmental noise
  • Press the speaker (Speaker) Category:
  • Speaker-related (Speaker depender)
  • Speak phonemes (Phoneme): the pronunciation of words are constructed from phonemes, for English, the phoneme set is commonly used CMU phoneme set consisting of 39 phonemes. For the Chinese, generally directly with all initials and finals as phonemes, tones also consider additional Speech Recognition.

The CMU Pronouncing Dictionary.

  • Acoustic model  : the knowledge is the acoustic and speech learning (Phonetics) is integrated to generate a feature extraction section features as input, and generates a sequence of variable length wherein the acoustic model score.
  • The language model  : learning the relationship between word from the training corpus (usually text), to estimate the likelihood of word sequence hypothesis, also known by the language model scores.
  • The GMM  : the Model Mixture Gaussian, Gaussian mixture model, a statistical model based on the speech features described Fourier spectrum, conventional acoustic models for modeling.
  • The HMM  : Hidden Markov the Model, HMM, for a description contains unknown parameters of hidden Markov process, which is difficult to determine the parameters of the process hidden from the observed parameters. These parameters are then used to further analysis, such as pattern recognition.
  • MFCCs  : Mel-Frequency Cepstral Coefficients, mel-frequency cepstral coefficients, are mel-frequency cepstrum coefficient of the composition. Cepstrum derived from the audio clip (cepstrum). Mel cepstrum and cepstrum of the frequency difference is, mel-frequency cepstrum band division is divided equally on the mel scale, it is closer to humans than in the frequency band used for normal linear cepstrum interval auditory system. Widely used in speech recognition.
  • Fbank  : the Filter Bank Frequency Mel, Mel-frequency filter bank.
  • WER  : Word Error Rate, word error rate, is the most common indicators to measure the performance of speech recognition system.

GNN-HMM basic principles

 

 

 

A speech recognition system includes a signal processing and feature extraction, the core of several acoustic model training, the language model training and recognition engine, etc., the picture shows a schematic block diagram of speech recognition:

  • Feature extraction  : The first step is a speech recognition feature extraction, to remove the speech signal for speech recognition useless redundant information (such as background noise), reserved information can reflect the essential characteristics of speech (feature vector extraction suitable for the following acoustic model ), and represented by a certain form; more common feature extraction algorithm MFCC.
  • Acoustic model training  : learning model feature parameters simultaneous feature parameters an acoustic model parameters of the speech training database, may be identified when the speech to be recognized by matching to obtain recognition results. Current mainstream speech recognition system to use HMM acoustic modeling.
  • Language training models  : the model is used to calculate the probability of occurrence of a sentence, is mainly used to decide which is more likely word sequence. The language model is divided into three levels: knowledge dictionary, grammar, syntax knowledge. Training text library syntax, semantic analysis, obtained through a statistical model based on language training model.
  • Speech decoding and search algorithm  : wherein the decoder is for the input speech signal, in accordance with the establishment of a network has been trained to identify an acoustic model, language model and a dictionary, then the search algorithm to find the best path in the network according to, in order to enable the maximum output probability of the speech signal word strings, thus determining the text of the speech samples.

DNN-HMM (Neural Network)

But with the development of neural network technology, has emerged behind the speech recognition model based on DNN-HMMs, the sentence is to use GMMs before DNN replaced by better modeling capabilities.

DNN mainstream models, including FFDNN, CDDNN, TDNN, RNN, etc., as well as training and some trick that can be used.

https://blog.csdn.net/Magical_Bubble/article/details/90674521

This document is a model training methods used GNN-HMM

Why research kaldi tool

 

 

 

 

Kaldi's architecture as shown below:

 

 

 

 

kaldi deployment

Depend on the environment

Gcc 4.8 and above

 

 

 

 

Patch

 

 

 

 

Make .

 

 

 

 

automake

 

 

 

 

Autoconf

 

 

 

 

zlib zlib-devel

 

 

 

 

 

gdbm

 

 

 

 

bzip2

 

 

 

 

sqlite

 

 

 

 

openssl-devel

 

 

 

 

 

Readline

 

 

 

 

 

python3 

Here I compiled and installed, it does not replace the previous 2.7; soft even under the / usr / bin is directly established python3

 

 

 

 

 

Kaldi installation

Decompression kaldi-master.zip

[root@localhost mnt]# cd kaldi-master/tools/

Check environmental requirements

[root@localhost tools]# cat INSTALL

 

 

 

 

Check the operation of the case will depend on the environment to meet the script receipt OK as shown below

 

 

 

 

View the number of cores cup

 

 

 

 

Check the number of start compiling the kernel

[root@localhost tools]# make -j 8

 

 

 

 

[root@localhost tools]# ./extras/install_irstlm.sh

安装语言模型

 

 

 

 

安装编译安装,时间会有些长, 可能会出现一些编译失败的错误信息,可以根据实际报错内容进行逐个排查,

[root@localhost tools]# cd ../src/

[root@localhost src]# ./configure

[root@localhost src]# make depend

[root@localhost src]# make

检查编译成功后的执行文件

[root@localhost kaldi-master]# cd src/bin

[root@localhost bin]# ls

 

 

 

 

 

kaldi目录说明

 

 

 

 

  • ./tools目录下是kaldi依赖的包
  • ./src目录存放kaldi的源代码
  • ./egs目录保存着一些kaldi在公共语音数据集上的训练步骤(shell脚本)以及测试的结果
    • s5/run.sh包含了在这个数据集上所有的训练步骤,包括数据预处理、训练以及测试gmm/dnn/lstm/tdnn等模型、实验结果统计在内的各种脚本。理论上来说只要相关环境配置正确,运行run.sh就能完成整个训练步骤。
    • s5/RESULTS里面保存着最近的实验结果
    • s5/conf就是一些训练所要用到的配置文件
    • s5/{local, steps, utils}里面则是run.sh所要用到的一些脚本文件

在kaldi中,目前针对深度神经网络提供三种代码库。第一个是"nnet1"(位于nnet/和nnetbin/下),最初由Karel Vesely维护;第二个"nnet2"(位于nnet2/和nnet2bin/下)最初由Daniel Povey维护;第三个"nnet3"(位于nnet3/和nnet3bin/下)由Daniel的nnet2转化而来


 

 

 


验证基础demo

[root@localhost kaldi-master]# cd egs/yesno/s5/

[root@localhost s5]# ./run.sh

开始下载学习集并开始制作

 

 

 

 

输出后的结果,运行到这里,说明kaldi 已经正确安装好。

 

 

 

 

WER(WordError Rate)是字错误率,是一个衡量语音识别系统的准确程度的度量。其计算公式是WER=(I+D+S)/N,其中I代表被插入的单词个数,D代表被删除的单词个数,S代表被替换的单词个数。也就是说把识别出来的结果中,多认的,少认的,认错的全都加起来,除以总单词数。这个数字当然是越低越好。

[root@localhost s5]# cd waves_yesno/

[root@localhost waves_yesno]# ll

生成的音频

 

 

 

 

制作第一个训练模型thchs30

Kaldi中文语音识别公共数据集一共有4个(据我所知),分别是:

1.aishell: AI SHELL公司开源178小时中文语音语料及基本训练脚本,见kaldi-master/egs/aishell

2.gale_mandarin: 中文新闻广播数据集(LDC2013S08, LDC2013S08)

3.hkust: 中文电话数据集(LDC2005S15, LDC2005T32)

4.thchs30: 清华大学30小时的数据集,可以在http://www.openslr.org/18/下载

 

 

 

 

 

 

 

 

 

 

这里使用thchs30数据集进行训练

已下载的数据包

 

数据集

音频时长(h)

句子数

词数

train(训练)

25

10000

198252

dev(开发)

2:14

893

17743

test(测试)

6:15

2495

49085

数据包简介

还有训练好的语言模型word.3gram.lm和phone.3gram.lm以及相应的词典lexicon.txt。
其中dev的作用是在某些步骤与train进行交叉验证的,如local/nnet/run_dnn.sh同时用到exp/tri4b_ali和exp/tri4b_ali_cv。训练和测试的目标数据也分为两类:word(词)和phone(音素)
1.local/thchs-30_data_prep.sh主要工作是从$thchs/data_thchs30(下载的数据)三部分分别生成word.txt(词序列),phone.txt(音素序列),text(与word.txt相同),wav.scp(语音),utt2pk(句子与说话人的映射),spk2utt(说话人与句子的映射)
2.#produce MFCC features是提取MFCC特征,分为两步,先通过steps/make_mfcc.sh提取MFCC特征,再通过steps/compute_cmvn_stats.sh计算倒谱均值和方差归一化。
3.#prepare language stuff是构建一个包含训练和解码用到的词的词典。而语言模型已经由王东老师处理好了,如果不打算改语言模型,这段代码也不需要修改。
a)基于词的语言模型包含48k基于三元词的词,从gigaword语料库中随机选择文本信息进行训练得到,训练文本包含772000个句子,总计1800万词,1.15亿汉字
b)基于音素的语言模型包含218个基于三元音的中文声调,从只有200万字的样本训练得到,之所以选择这么小的样本是因为在模型中尽可能少地保留语言信息,可以使得到的性能更直接地反映声学模型的质量。
c)这两个语言模型都是由SRILM工具训练得到。

制作开始

[root@localhost ~]# mkdir -p /DATA/works/

上传学习包并解压

 

 

 

 

[root@localhost ~]# cd /mnt/kaldi-master/egs/thchs30/s5/

修改内容如下:

export train_cmd=run.pl

export decode_cmd="run.pl --mem 4G"

export mkgraph_cmd="run.pl --mem 8G"

export cuda_cmd="run.pl --gpu 1"

[root@localhost s5]# vim cmd.sh

 

 

 

 

[root@localhost s5]# vim run.sh

 

 

 

 

 

开始制作模型

[root@localhost s5]# ./run.sh

我们没有DNN(无GPU)来跑,所以运行起来会比较慢。//后续有资源的情况尝试大数据

 

 

 

 

模型生成目录说明

模型生成后存放的路径thchs30/s5/exp/tri1

 

 final.mdl 就是训练出来的可以使用的模型,另外,在 graph_word 下面的 words.txt 和 HCLG.fst 分别为字典以及有限状态机。单独介绍这三个文件,是因为我们下面的示例主要基于这三个文件来识别的。

 

 

 

 

验证模型

将制作好的模型 复制到以下路径

/mnt/kaldi-master/egs/thchs30/online_demo/online-data/models/tri1

 

 

 

 

/mnt/kaldi-master/egs/thchs30/online_demo

[root@localhost online_demo]# vim run.sh

 

 

 

 

[root@localhost online_demo]# ./run.sh   模型识别的是
/mnt/kaldi-master/egs/thchs30/online_demo/online-data/audio路径下所有的单音频文件

 

 

 

 

识别的结果内容

 

 

 

 

 

算法过程描述

大概有几个过程:数据准备,monophone单音素训练, tri1三因素训练, trib2进行lda_mllt特征变换,trib3进行sat自然语言适应,trib4做quick(这个我也不懂),后面就是dnn了

 

Run 脚本说明

 

 

 

 

这个脚本的输入参数有三个:1.data/mfcc/train  2.exp/make_mfcc/train  3.mfcc/train

1.data/mfcc/train中有数据预处理后的一些文件:phone.txt spk2utt text utt2spk wav.scp word.txt

2.exp/make_mfcc/train中应该是要保存程序运行的日志文件的

3.mfcc/train中是提取出的特征文件

待研究的方向

 

Kaldi在线识别方法

用的虚拟机,没有音频入口

后续补充

Kaldi封装实时语音翻译

后续补充

基于Kaldi+GStreamer搭建线上的实时语音识别器

https://www.jianshu.com/p/ef7326b27786

 

Kaldi结合GPU的模型训练

后续补充

 

Kaldi结合CNN的模型训练

https://blog.csdn.net/DuishengChen/article/details/50085707?locationNum=11&fps=1

后续补充

 

Kaldi 声纹识别ivector模型

后续补充

https://blog.csdn.net/u011930705/article/details/85340905

https://blog.csdn.net/monsieurliaxiamen/article/details/79638227

 

Kaldi中文识别开源模型CVTE

后续补充

https://www.jianshu.com/p/d64e70faaf1d

https://blog.csdn.net/snowdroptulip/article/details/78952428

 

 

Kaldi 训练唤醒词模型

后续补充

https://blog.csdn.net/cj1989111/article/details/88017908

FAQ

模型制作报错:/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found

 

[root@localhost ~]# find  / -name libstdc++.so.6 本地看看是否有此库

 

这里可以看到,我本地是存在的,应该是调不到的原因

[root@localhost ~]# vim /etc/ld.so.conf

/usr/lib64/

/usr/local/lib

/usr/local/lib64

/usr/lib

 

加入依赖库路径

重新加载

[root@localhost ~]# ldconfig   

 

如果不行,就用编译的gcc里的库做个软连接

[root@localhost online_demo]# ln -s /mnt/gcc/gcc-5.4.0/gcc-build/x86_64-unknown-linux-gnu/libstdc++-v3/src/.libs/libstdc++.so.6 /usr/lib64/libstdc++.so.6   
[root@localhost online_demo]# ll /usr/lib64/libstdc++.so.6   

 

Guess you like

Origin www.cnblogs.com/relax-zw/p/11972617.html