[Speech recognition] Detailed explanation of kaldi's data and model files - librispeech


Introduction - Speech Recognition ASR

Reference blog .

In traditional speech recognition based on GMM-HMM, the unit smaller than phone is state. Generally, each phoneme consists of three states, and in particular, silence (SIL) consists of five states. The state mentioned here refers to the hidden state in the HMM, and each frame of data refers to the observed value in the HMM. Each state can be represented by a GMM model (the parameters of this GMM model are obtained through training). When identifying, put the eigenvalue corresponding to each frame of data into the GMM of each state to calculate the probability, and the one with the highest probability is the state corresponding to this frame. Then get the phoneme from the state (HMM is responsible), get the word from the phoneme (the dictionary model is responsible), get the sentence from the word (the language model is responsible), and finally complete the recognition.

1. Kaldi compilation process

Compiling kaldi for the first time is likely to lack various things, it is best to have administrator privileges to install.

## 下载
git clone https://github.com/kaldi-asr/kaldi.git kaldi --origin upstream
cd kaldi

## 编译tools
cd tools
extras/check_dependencies.sh //缺什么就安装什么,需要管理员权限
make

## 编译src
cd ../src
./configure --shared
make depend -j 8
make -j 8

2. librispeech example

Kaldi itself has built-in asr examples of many corpora. The librispeech example is a common English corpus with a total of 960 hours of data. In addition, the commonly used Chinese corpus is aishell2, which needs to be applied for. Follow the training process to view the generated files.

Open kaldi/egs/librispeech/s5, where cmd.sh is the configuration related to the cluster. If it is a stand-alone training, change it to

export train_cmd=run.pl
export decode_cmd=run.pl
export mkgraph_cmd=run.pl

Then there is the main training script, run.sh, modify the data in the first line to the path of the corpus that you are going to store.
The script consists of 20 stages, which can be run on the command line one by one to observe what is generated.

Step 1, download the corpus and dictionary, or you can download it yourself in openslr , there are many open source ASR corpora.

if [ $stage -le 1 ]; then
  for part in dev-clean test-clean dev-other test-other train-clean-100; do
    local/download_and_untar.sh $data $data_url $part
  done
  local/download_lm.sh $lm_url data/local/lm
fi

Step 2, to restructure the data into the form required by kaldi, will generate a folder for each set.

if [ $stage -le 2 ]; then
  for part in dev-clean test-clean dev-other test-other train-clean-100; do
    local/data_prep.sh $data/LibriSpeech/$part data/$(echo $part | sed s/-/_/g)
  done
fi

In each folder, the more important files are text, wav.scp, utt2spk, spk2utt, feats.scp, cmvn.scp.
Among them, the first three items need to be prepared manually, and the latter items can be automatically generated according to the first three items.

$ls data/train_clean_100
cmvn.scp  conf  feats.scp  frame_shift  spk2gender  spk2utt  split20  text  utt2dur  utt2num_frames  utt2spk  wav.scp

# text <utterance-id> <text>
# 第一个为句子的id,若有说话人信息应该把说话人的编号(speaker-id)作为话语编号的前缀,以便排序;
# 第二个为转录文本,这些词不一定都在词典里,不在的词会被映射到data/lang/oov.txt文件的特定词。
$head -3 train_clean_100/text
103-1240-0000 CHAPTER ONE MISSUS RACHEL LYNDE IS SURPRISED MISSUS RACHEL LYNDE LIVED 
103-1240-0001 THAT HAD ITS SOURCE AWAY BACK IN THE WOODS OF THE OLD CUTHBERT
103-1240-0002 FOR NOT EVEN A BROOK COULD RUN PAST MISSUS RACHEL LYNDE'S DOOR

# wav.scp <recording-id> <extended-filename>
# 第一个为记录的语音id,当没有segments文件时它等于utterance-id;
# 第二个为文件路径,也可以是提取路径的命令。
$head -3 train_clean_100/wav.scp
103-1240-0000 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0000.flac |
103-1240-0001 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0001.flac |
103-1240-0002 flac -c -d -s /home/fwq/Project/kaldi/kaldi/data/LibriSpeech/LibriSpeech/train-clean-100/103/1240/103-1240-0002.flac |

# utt2spk <utterance-id> <speaker-id>
# 若无说话人信息,让speaker-id=utterance-id,但不要设置一个全局的speaker-id,会导致训练时倒谱均值归一化无效。
$head -3 train_clean_100/utt2spk
103-1240-0000 103-1240
103-1240-0001 103-1240
103-1240-0002 103-1240

# spk2utt <speaker-id> <utterance-id1> <utterance-id2> ....
# 可通过如下命令提取,行数一般比utt2spk少,为说话人个数。
# $utils/utt2spk_to_spk2utt.pl data/train_clean_100/utt2spk > data/train_clean_100/spk2utt
$head -3 train_clean_100/spk2utt
103-1240 103-1240-0000 103-1240-0001 103-1240-0002 103-1240-0003 ....
103-1241 103-1241-0000 103-1241-0001 103-1241-0002 103-1241-0003 ....
1034-121119 1034-121119-0000 1034-121119-0001 1034-121119-0002 1034-121119-0003 ....

# feats.scp <utterance-id> <extended-filename-of-features>
# 提取的mfcc路径,第一行的14表示从14个位置读起
# $steps/make_mfcc.sh --nj 20 --cmd "$train_cmd" data/train_clean_100 exp/make_mfcc/train_clean_100 $mfccdir
$head -3 train_clean_100/feats.scp
103-1240-0000 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:14
103-1240-0001 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:18444
103-1240-0002 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/raw_mfcc_train_clean_100.1.ark:39292

# cmvn.scp <speaker-id> <extended-filename-of-cmvn>
# 说话人的倒谱归一化均值和方差的统计信息
# $steps/compute_cmvn_stats.sh data/train_clean_100 exp/make_mfcc/train_clean_100 $mfccdir
$head -3 train_clean_100/cmvn.scp
103-1240 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:9
103-1241 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:257
1034-121119 /home/fwq/Project/kaldi/kaldi/egs/librispeech/s5/mfcc/cmvn_train_clean_100.ark:508
步骤3,准备词典,并生成语言模型保存于 data/lang_nosp。

if [ $stage -le 3 ]; then
  local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" \
   data/local/lm data/local/lm data/local/dict_nosp

  utils/prepare_lang.sh data/local/dict_nosp \
   "<UNK>" data/local/lang_tmp_nosp data/lang_nosp

  local/format_lms.sh --src-dir data/lang_nosp data/local/lm
fi

Focus on the language model folder.

$ls lang_nosp
L.fst  L_disambig.fst  oov.int  oov.txt  phones  phones.txt  topo  words.txt

# 包含音素集的信息,用 utils/prepare_lang.sh 生成。
$ls lang_nosp/phones
align_lexicon.int  context_indep.txt  extra_questions.int  nonsilence.txt        roots.int  silence.csl    wdisambig_phones.int
align_lexicon.txt  disambig.csl       extra_questions.txt  optional_silence.csl  roots.txt  silence.int    wdisambig_words.int
context_indep.csl  disambig.int       nonsilence.csl       optional_silence.int  sets.int   silence.txt    word_boundary.int
context_indep.int  disambig.txt       nonsilence.int       optional_silence.txt  sets.txt   wdisambig.txt  word_boundary.txt

# 音素和单词,在整数和文本形式之间来回映射。
$head -3 lang_nosp/phones.txt
<eps> 0
SIL 1
SIL_B 2
$head -5 lang_nosp/words.txt
<eps> 0
!SIL 1
<SPOKEN_NOISE> 2
<UNK> 3
A 4

# L.fst 是有限状态机形式的词典,输入音素符号,输出词符号。
# L_disambig.fst 是包含了歧义符号`#1, #2`等的词典。

# 仅一行,超出词典范围的符号及其对应的整数形式
$cat lang_nosp/oov.txt
<UNK>
$cat lang_nosp/oov.int
3

# 定义了HMM的拓扑。
$cat lang_nosp/topo
....
<TopologyEntry>
<ForPhones>
1 2 3 4 5 6 7 8 9 10
</ForPhones>
<State> 0 <PdfClass> 0 <Transition> 0 0.25 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 </State>
<State> 1 <PdfClass> 1 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 2 <PdfClass> 2 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 3 <PdfClass> 3 <Transition> 1 0.25 <Transition> 2 0.25 <Transition> 3 0.25 <Transition> 4 0.25 </State>
<State> 4 <PdfClass> 4 <Transition> 4 0.75 <Transition> 5 0.25 </State>
<State> 5 </State>
</TopologyEntry>
....

Step 4, expand into ternary and quaternary language models, and generate two new language model folders.

if [ $stage -le 4 ]; then
  # Create ConstArpaLm format language model for full 3-gram and 4-gram LMs
  utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \
    data/lang_nosp data/lang_nosp_test_tglarge
  utils/build_const_arpa_lm.sh data/local/lm/lm_fglarge.arpa.gz \
    data/lang_nosp data/lang_nosp_test_fglarge
fi
$ls lang_nosp_test_tglarge/
G.carpa  L.fst  L_disambig.fst  oov.int  oov.txt  phones  phones.txt  topo  words.txt
$ls lang_nosp_test_fglarge/
G.carpa  L.fst  L_disambig.fst  oov.int  oov.txt  phones  phones.txt  topo  words.txt

Then look at the model file.
Step 8, train the single factor model and decode it.

if [ $stage -le 8 ]; then
  # 训单音素HMM模型,保存于exp/mono
  steps/train_mono.sh --boost-silence 1.25 --nj 20 --cmd "$train_cmd" \
                      data/train_2kshort data/lang_nosp exp/mono

  # decode using the monophone model
  (
    # 构建HCLG解码图
    # data/lang_nosp_test_tgsmall/L_disambig.fst + data/lang_nosp_test_tgsmall/G.fst -> data/lang_nosp_test_tgsmall/tmp/LG.fst
    # data/lang_nosp_test_tgsmall/tmp/LG.fst + data/lang_nosp_test_tgsmall/tmp/ilabels_3_1(消歧符) -> data/lang_nosp_test_tgsmall/tmp/CLG_3_1.fst
    # data/lang_nosp_test_tgsmall/tmp/CLG_3_1.fst + exp/mono/graph_nosp_tgsmall/Ha.fst(由make-h-transducer形成) -> exp/mono/graph_nosp_tgsmall/HCLGa.fst
    # exp/mono/graph_nosp_tgsmall/HCLGa.fst + 自循环add_self_loops -> exp/mono/graph_nosp_tgsmall/HCLG.fst
    utils/mkgraph.sh data/lang_nosp_test_tgsmall \
                     exp/mono exp/mono/graph_nosp_tgsmall

    for test in test_clean test_other dev_clean dev_other; do
      steps/decode.sh --nj 20 --cmd "$decode_cmd" exp/mono/graph_nosp_tgsmall \
                      data/$test exp/mono/decode_nosp_tgsmall_$test
    done
  )&
fi

Step 9: After aligning with the monophone model, train the triphone model.

if [ $stage -le 9 ]; then
  # 将每一个特征向量都对应到了具体的 phone 的状态上,每一段utt对应一串表示状态变化的 transition_id 
  # exp/mono -> exp/mono_ali_5k
  steps/align_si.sh --boost-silence 1.25 --nj 10 --cmd "$train_cmd" \
                    data/train_5k data/lang_nosp exp/mono exp/mono_ali_5k

  # 训三音素模型,保存于exp/tri1
  # 一个音素在不同上下文会有不同的发音,三音素模型对l_a_i和l_a_n用不同的GMM建模,并同样映射到a。
  # 2000为决策树的叶子数,10000为总高斯数。
  steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" \
                        2000 10000 data/train_5k data/lang_nosp exp/mono_ali_5k exp/tri1
fi

Step 10, after aligning with the triphone model, train the LDA+MLLT model.

if [ $stage -le 10 ]; then
  # 继续对齐
  steps/align_si.sh --nj 10 --cmd "$train_cmd" \
                    data/train_10k data/lang_nosp exp/tri1 exp/tri1_ali_10k


  # 训LDA+MLLT模型,在mfcc特征提取出来后,将相邻几个帧拼接起来,降到40维,用LDA 去评估,经过多次迭代,最后使用对角变换,用转换后的特征去训练。
  steps/train_lda_mllt.sh --cmd "$train_cmd" \
                          --splice-opts "--left-context=3 --right-context=3" 2500 15000 \
                          data/train_10k data/lang_nosp exp/tri1_ali_10k exp/tri2b
fi

Step 11, after aligning with the LDA+MLLT model, train the LDA+MLLT+SAT model.

if [ $stage -le 11 ]; then
  ### 继续对齐
  steps/align_si.sh  --nj 10 --cmd "$train_cmd" --use-graphs true \
                     data/train_10k data/lang_nosp exp/tri2b exp/tri2b_ali_10k

  ### 训LDA+MLLT+SAT模型,是训说话人自适应(Speaker Adaptive Training)的,同样是特征转换后再训。
  steps/train_sat.sh --cmd "$train_cmd" 2500 15000 \
                     data/train_10k data/lang_nosp exp/tri2b_ali_10k exp/tri3b
fi

Step 12: On the 100h clean data, after aligning with the LDA+MLLT+SAT model, train the LDA+MLLT+SAT model.

if [ $stage -le 12 ]; then
  # 先预对齐一次,计算 fmllr transforms, 再用预对齐和 fmllr 一起计算最终对齐。
  steps/align_fmllr.sh --nj 20 --cmd "$train_cmd" \
    data/train_clean_100 data/lang_nosp \
    exp/tri3b exp/tri3b_ali_clean_100

  # 训LDA+MLLT+SAT模型。
  steps/train_sat.sh  --cmd "$train_cmd" 4200 40000 \
                      data/train_clean_100 data/lang_nosp \
                      exp/tri3b_ali_clean_100 exp/tri4b
fi

The complete process can refer to the previous blog .

Guess you like

Origin blog.csdn.net/tobefans/article/details/125434241