Article directory
Introduction - Speech Recognition ASR
Reference blog .
In traditional speech recognition based on GMM-HMM, the unit smaller than phone is state. Generally, each phoneme consists of three states, and in particular, silence (SIL) consists of five states. The state mentioned here refers to the hidden state in the HMM, and each frame of data refers to the observed value in the HMM. Each state can be represented by a GMM model (the parameters of this GMM model are obtained through training). When identifying, put the eigenvalue corresponding to each frame of data into the GMM of each state to calculate the probability, and the one with the highest probability is the state corresponding to this frame. Then get the phoneme from the state (HMM is responsible), get the word from the phoneme (the dictionary model is responsible), get the sentence from the word (the language model is responsible), and finally complete the recognition.
1. Kaldi compilation process
Compiling kaldi for the first time is likely to lack various things, it is best to have administrator privileges to install.
## 下载
git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream
cd kaldi
## 编译tools
cd tools
extras/check_dependencies.sh //缺什么就安装什么,需要管理员权限
make
## 编译src
cd ../src
./configure --shared
make depend -j 8
make -j 8
2. librispeech example
Kaldi itself has built-in asr examples of many corpora. The librispeech example is a common English corpus with a total of 960 hours of data. In addition, the commonly used Chinese corpus is aishell2, which needs to be applied for. Follow the training process to view the generated files.
Open kaldi/egs/librispeech/s5, where cmd.sh is the configuration related to the cluster. If it is a stand-alone training, change it to
export train_cmd=run.pl
export decode_cmd=run.pl
export mkgraph_cmd=run.pl
Then there is the main training script, run.sh, modify the data in the first line to the path of the corpus that you are going to store.
The script consists of 20 stages, which can be run on the command line one by one to observe what is generated.
Step 1, download the corpus and dictionary, or you can download it yourself in openslr , there are many open source ASR corpora.
if [ $stage -le 1 ]; then
for part in dev-clean test-clean dev-other test-other train-clean-100; do
local/download_and_untar.sh $data $data_url $part
done
local/download_lm.sh $lm_url data/local/lm
fi
Step 2, to restructure the data into the form required by kaldi, will generate a folder for each set.
if [ $stage -le 2 ]; then
for part in dev-clean test-clean dev-other test-other train-clean-100; do
local/data_prep.sh $data/LibriSpeech/$part data/$(echo $part | sed s/-/_/g)
done
fi
In each folder, the more important files are text, wav.scp, utt2spk, spk2utt, feats.scp, cmvn.scp.
Among them, the first three items need to be prepared manually, and the latter items can be automatically generated according to the first three items.
$ls data/train_clean_100
cmvn.scp conf feats.scp frame_shift spk2gender spk2utt split20 text utt2dur utt2num_frames utt2spk wav.scp
# text <utterance-id> <text>
# 第一个为句子的id,若有说话人信息应该把说话人的编号(speaker-id)作为话语编号的前缀,以便排序;
# 第二个为转录文本,这些词不一定都在词典里,不在的词会被映射到data/lang/oov.txt文件的特定词。
$head -3 train_clean_100/text
103-1240-0000 CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED
103-1240-0001 THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT
103-1240-0002 FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR
# wav.scp <recording-id> <extended-filename>
# 第一个为记录的语音id,当没有segments文件时它等于utterance-id;
# 第二个为文件路径,也可以是提取路径的命令。
$head -3 train_clean_100/wav.scp
103-1240-0000 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac |
103-1240-0001 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac |
103-1240-0002 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0002.flac |
# utt2spk <utterance-id> <speaker-id>
# 若无说话人信息,让speaker-id=utterance-id,但不要设置一个全局的speaker-id,会导致训练时倒谱均值归一化无效。
$head -3 train_clean_100/utt2spk
103-1240-0000 103-1240
103-1240-0001 103-1240
103-1240-0002 103-1240
# spk2utt <speaker-id> <utterance-id1> <utterance-id2> ....
# 可通过如下命令提取,行数一般比utt2spk少,为说话人个数。
# $utils/utt2spk_to_spk2utt.pl data/train_clean_100/utt2spk > data/train_clean_100/spk2utt
$head -3 train_clean_100/spk2utt
103-1240 103-1240-0000 103-1240-0001 103-1240-0002 103-1240-0003 ....
103-1241 103-1241-0000 103-1241-0001 103-1241-0002 103-1241-0003 ....
1034-121119 1034-121119-0000 1034-121119-0001 1034-121119-0002 1034-121119-0003 ....
# feats.scp <utterance-id> <extended-filename-of-features>
# 提取的mfcc路径,第一行的14表示从14个位置读起
# $steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train_clean_100 exp/make_mfcc/train_clean_100 $mfccdir
$head -3 train_clean_100/feats.scp
103-1240-0000 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:14
103-1240-0001 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:18444
103-1240-0002 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:39292
# cmvn.scp <speaker-id> <extended-filename-of-cmvn>
# 说话人的倒谱归一化均值和方差的统计信息
# $steps/compute_cmvn_stats.sh data/train_clean_100 exp/make_mfcc/train_clean_100 $mfccdir
$head -3 train_clean_100/cmvn.scp
103-1240 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:9
103-1241 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:257
1034-121119 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:508
步骤3,准备词典,并生成语言模型保存于 data/lang_nosp。
if [ $stage -le 3 ]; then
local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" \
data/local/lm data/local/lm data/local/dict_nosp
utils/prepare_lang.sh data/local/dict_nosp \
"<UNK>" data/local/lang_tmp_nosp data/lang_nosp
local/format_lms.sh --src-dir data/lang_nosp data/local/lm
fi
Focus on the language model folder.
$ls lang_nosp
L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt
# 包含音素集的信息,用 utils/prepare_lang.sh 生成。
$ls lang_nosp/phones
align_lexicon.int context_indep.txt extra_questions.int nonsilence.txt roots.int silence.csl wdisambig_phones.int
align_lexicon.txt disambig.csl extra_questions.txt optional_silence.csl roots.txt silence.int wdisambig_words.int
context_indep.csl disambig.int nonsilence.csl optional_silence.int sets.int silence.txt word_boundary.int
context_indep.int disambig.txt nonsilence.int optional_silence.txt sets.txt wdisambig.txt word_boundary.txt
# 音素和单词,在整数和文本形式之间来回映射。
$head -3 lang_nosp/phones.txt
<eps> 0
SIL 1
SIL_B 2
$head -5 lang_nosp/words.txt
<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
A 4
# L.fst 是有限状态机形式的词典,输入音素符号,输出词符号。
# L_disambig.fst 是包含了歧义符号`#1, #2`等的词典。
# 仅一行,超出词典范围的符号及其对应的整数形式
$cat lang_nosp/oov.txt
<UNK>
$cat lang_nosp/oov.int
3
# 定义了HMM的拓扑。
$cat lang_nosp/topo
....
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
....
Step 4, expand into ternary and quaternary language models, and generate two new language model folders.
if [ $stage -le 4 ]; then
# Create ConstArpaLm format language model for full 3-gram and 4-gram LMs
utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \
data/lang_nosp data/lang_nosp_test_tglarge
utils/build_const_arpa_lm.sh data/local/lm/lm_fglarge.arpa.gz \
data/lang_nosp data/lang_nosp_test_fglarge
fi
$ls lang_nosp_test_tglarge/
G.carpa L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt
$ls lang_nosp_test_fglarge/
G.carpa L.fst L_disambig.fst oov.int oov.txt phones phones.txt topo words.txt
Then look at the model file.
Step 8, train the single factor model and decode it.
if [ $stage -le 8 ]; then
# 训单音素HMM模型,保存于exp/mono
steps/train_mono.sh --boost-silence 1.25 --nj 20 --cmd "$train_cmd" \
data/train_2kshort data/lang_nosp exp/mono
# decode using the monophone model
(
# 构建HCLG解码图
# data/lang_nosp_test_tgsmall/L_disambig.fst + data/lang_nosp_test_tgsmall/G.fst -> data/lang_nosp_test_tgsmall/tmp/LG.fst
# data/lang_nosp_test_tgsmall/tmp/LG.fst + data/lang_nosp_test_tgsmall/tmp/ilabels_3_1(消歧符) -> data/lang_nosp_test_tgsmall/tmp/CLG_3_1.fst
# data/lang_nosp_test_tgsmall/tmp/CLG_3_1.fst + exp/mono/graph_nosp_tgsmall/Ha.fst(由make-h-transducer形成) -> exp/mono/graph_nosp_tgsmall/HCLGa.fst
# exp/mono/graph_nosp_tgsmall/HCLGa.fst + 自循环add_self_loops -> exp/mono/graph_nosp_tgsmall/HCLG.fst
utils/mkgraph.sh data/lang_nosp_test_tgsmall \
exp/mono exp/mono/graph_nosp_tgsmall
for test in test_clean test_other dev_clean dev_other; do
steps/decode.sh --nj 20 --cmd "$decode_cmd" exp/mono/graph_nosp_tgsmall \
data/$test exp/mono/decode_nosp_tgsmall_$test
done
)&
fi
Step 9: After aligning with the monophone model, train the triphone model.
if [ $stage -le 9 ]; then
# 将每一个特征向量都对应到了具体的 phone 的状态上,每一段utt对应一串表示状态变化的 transition_id
# exp/mono -> exp/mono_ali_5k
steps/align_si.sh --boost-silence 1.25 --nj 10 --cmd "$train_cmd" \
data/train_5k data/lang_nosp exp/mono exp/mono_ali_5k
# 训三音素模型,保存于exp/tri1
# 一个音素在不同上下文会有不同的发音,三音素模型对l_a_i和l_a_n用不同的GMM建模,并同样映射到a。
# 2000为决策树的叶子数,10000为总高斯数。
steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" \
2000 10000 data/train_5k data/lang_nosp exp/mono_ali_5k exp/tri1
fi
Step 10, after aligning with the triphone model, train the LDA+MLLT model.
if [ $stage -le 10 ]; then
# 继续对齐
steps/align_si.sh --nj 10 --cmd "$train_cmd" \
data/train_10k data/lang_nosp exp/tri1 exp/tri1_ali_10k
# 训LDA+MLLT模型,在mfcc特征提取出来后,将相邻几个帧拼接起来,降到40维,用LDA 去评估,经过多次迭代,最后使用对角变换,用转换后的特征去训练。
steps/train_lda_mllt.sh --cmd "$train_cmd" \
--splice-opts "--left-context=3 --right-context=3" 2500 15000 \
data/train_10k data/lang_nosp exp/tri1_ali_10k exp/tri2b
fi
Step 11, after aligning with the LDA+MLLT model, train the LDA+MLLT+SAT model.
if [ $stage -le 11 ]; then
### 继续对齐
steps/align_si.sh --nj 10 --cmd "$train_cmd" --use-graphs true \
data/train_10k data/lang_nosp exp/tri2b exp/tri2b_ali_10k
### 训LDA+MLLT+SAT模型,是训说话人自适应(Speaker Adaptive Training)的,同样是特征转换后再训。
steps/train_sat.sh --cmd "$train_cmd" 2500 15000 \
data/train_10k data/lang_nosp exp/tri2b_ali_10k exp/tri3b
fi
Step 12: On the 100h clean data, after aligning with the LDA+MLLT+SAT model, train the LDA+MLLT+SAT model.
if [ $stage -le 12 ]; then
# 先预对齐一次,计算 fmllr transforms, 再用预对齐和 fmllr 一起计算最终对齐。
steps/align_fmllr.sh --nj 20 --cmd "$train_cmd" \
data/train_clean_100 data/lang_nosp \
exp/tri3b exp/tri3b_ali_clean_100
# 训LDA+MLLT+SAT模型。
steps/train_sat.sh --cmd "$train_cmd" 4200 40000 \
data/train_clean_100 data/lang_nosp \
exp/tri3b_ali_clean_100 exp/tri4b
fi
The complete process can refer to the previous blog .