[Speech recognition] kaldi installation and use case (librispeech)


1. Kaldi installation

According to the official website tutorial, the installation of kaldi first obtains the project through git, and then compiles it.

git clone https://github.com/kaldi-asr/kaldi.git
cd kaldi/tools/; make; cd ../src; ./configure; make

If an error is reported, it may be that the relevant dependencies are not installed, you can follow the prompts to install step by step (root permission is required).

sudo apt-get install zlib1g-dev automake autoconf sox subversion
sudo bash extras/install_mkl.sh

2. Use of kaldi

2.1 ASR model training of librispeech

The sample code of each database is placed in the egs directory, and a folder is a database, which is very comprehensive. Enter egs/librispeech/s5/. Each code will have a cmd.sh (introduce single-machine multi-card run.pl or multi-machine multi-card quene.pl mode), path.sh (introduce various kaldi paths), run .sh (the entire main process of training and testing). The following mainly looks at run.sh. The overall process is to import parameters -> download part of the data and preprocess -> prepare and create a language model -> extract features -> train part of the data set -> train single factor and triphone models and transform training ->Add more data sets->Transform training->Add all data sets->Transform training->Decode->Train tdnn model. details as follows:

#!/usr/bin/env bash

## 导入参数
data=/home/fwq/Project/kaldi/kaldi/data

data_url=www.openslr.org/resources/12
lm_url=www.openslr.org/resources/11
mfccdir=mfcc
stage=1

. ./cmd.sh
. ./path.sh
. parse_options.sh

set -e

## 下载数据
if [ $stage -le 1 ]; then
  for part in dev-clean test-clean dev-other test-other train-clean-100; do
    local/download_and_untar.sh $data $data_url $part
  done


  local/download_lm.sh $lm_url data/local/lm
fi

## 生成各种数据的各种文件,如wav.scp,text,utt2spk,spk2gender,utt2dur
if [ $stage -le 2 ]; then
  for part in dev-clean test-clean dev-other test-other train-clean-100; do
    local/data_prep.sh $data/LibriSpeech/$part data/$(echo $part | sed s/-/_/g)
  done
fi

## 准备语言模型,准备字典(local/prepare_dict_sh),准备语言相关数据(utils/prepare_lang.sh),格式化数据(local/format_lms.sh)
if [ $stage -le 3 ]; then
  local/prepare_dict.sh --stage 3 --nj 30 --cmd "$train_cmd" \
   data/local/lm data/local/lm data/local/dict_nosp

  utils/prepare_lang.sh data/local/dict_nosp \
   "<UNK>" data/local/lang_tmp_nosp data/lang_nosp

  local/format_lms.sh --src-dir data/lang_nosp data/local/lm
fi

## 用3-gram和4-gram语言模型创建ConstArpaLm格式语言模型
if [ $stage -le 4 ]; then
  utils/build_const_arpa_lm.sh data/local/lm/lm_tglarge.arpa.gz \
    data/lang_nosp data/lang_nosp_test_tglarge
  utils/build_const_arpa_lm.sh data/local/lm/lm_fglarge.arpa.gz \
    data/lang_nosp data/lang_nosp_test_fglarge
fi

## 数据特征提取,提取mfcc,计算每条wav文件的均值方差
if [ $stage -le 5 ]; then
  if [[  $(hostname -f) ==  *.clsp.jhu.edu ]]; then
    utils/create_split_dir.pl /export/b{
    
    02,11,12,13}/$USER/kaldi-data/egs/librispeech/s5/$mfcc/storage \
     $mfccdir/storage
  fi
fi

if [ $stage -le 6 ]; then
  for part in dev_clean test_clean dev_other test_other train_clean_100; do
    steps/make_mfcc.sh --cmd "$train_cmd" --nj 40 data/$part exp/make_mfcc/$part $mfccdir
    steps/compute_cmvn_stats.sh data/$part exp/make_mfcc/$part $mfccdir
  done
fi

## 训练100小时的小数据集
if [ $stage -le 7 ]; then

  utils/subset_data_dir.sh --shortest data/train_clean_100 2000 data/train_2kshort
  utils/subset_data_dir.sh data/train_clean_100 5000 data/train_5k
  utils/subset_data_dir.sh data/train_clean_100 10000 data/train_10k
fi

## 训练单音素模型(mono)
if [ $stage -le 8 ]; then
  steps/train_mono.sh --boost-silence 1.25 --nj 20 --cmd "$train_cmd" \
                      data/train_2kshort data/lang_nosp exp/mono
fi

## 对齐,训练三音素模型(tri1)
if [ $stage -le 9 ]; then
  steps/align_si.sh --boost-silence 1.25 --nj 10 --cmd "$train_cmd" \
                    data/train_5k data/lang_nosp exp/mono exp/mono_ali_5k

  steps/train_deltas.sh --boost-silence 1.25 --cmd "$train_cmd" \
                        2000 10000 data/train_5k data/lang_nosp exp/mono_ali_5k exp/tri1
fi

## 对齐,对三音素做LDA+MLLT变换(tri2b)
if [ $stage -le 10 ]; then
  steps/align_si.sh --nj 10 --cmd "$train_cmd" \
                    data/train_10k data/lang_nosp exp/tri1 exp/tri1_ali_10k


  steps/train_lda_mllt.sh --cmd "$train_cmd" \
                          --splice-opts "--left-context=3 --right-context=3" 2500 15000 \
                          data/train_10k data/lang_nosp exp/tri1_ali_10k exp/tri2b
fi

## 对齐,对三音素做LDA+MLLT+SAT变换(tri3b)
if [ $stage -le 11 ]; then
  steps/align_si.sh  --nj 10 --cmd "$train_cmd" --use-graphs true \
                     data/train_10k data/lang_nosp exp/tri2b exp/tri2b_ali_10k

  steps/train_sat.sh --cmd "$train_cmd" 2500 15000 \
                     data/train_10k data/lang_nosp exp/tri2b_ali_10k exp/tri3b

fi

## 对齐,对三音素做LDA+MLLT+SAT变换(tri4b)
if [ $stage -le 12 ]; then
  steps/align_fmllr.sh --nj 20 --cmd "$train_cmd" \
    data/train_clean_100 data/lang_nosp \
    exp/tri3b exp/tri3b_ali_clean_100

  steps/train_sat.sh  --cmd "$train_cmd" 4200 40000 \
                      data/train_clean_100 data/lang_nosp \
                      exp/tri3b_ali_clean_100 exp/tri4b
fi

## 从训练数据中计算发音和静音概率,并重新创建lang目录
if [ $stage -le 13 ]; then
  steps/get_prons.sh --cmd "$train_cmd" \
                     data/train_clean_100 data/lang_nosp exp/tri4b
  utils/dict_dir_add_pronprobs.sh --max-normalize true \
                                  data/local/dict_nosp \
                                  exp/tri4b/pron_counts_nowb.txt exp/tri4b/sil_counts_nowb.txt \
                                  exp/tri4b/pron_bigram_counts_nowb.txt data/local/dict

  utils/prepare_lang.sh data/local/dict \
                        "<UNK>" data/local/lang_tmp data/lang
  local/format_lms.sh --src-dir data/lang data/local/lm

  utils/build_const_arpa_lm.sh \
    data/local/lm/lm_tglarge.arpa.gz data/lang data/lang_test_tglarge
  utils/build_const_arpa_lm.sh \
    data/local/lm/lm_fglarge.arpa.gz data/lang data/lang_test_fglarge
fi

## 对齐,训练nnet2模型,现在已经不这么用了,所以and了个false
if [ $stage -le 14 ] && false; then
  steps/align_fmllr.sh --nj 30 --cmd "$train_cmd" \
    data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100

  local/nnet2/run_5a_clean_100.sh
fi

## 合并360小时的数据,变成460小时
if [ $stage -le 15 ]; then
  local/download_and_untar.sh $data $data_url train-clean-360

  local/data_prep.sh \
    $data/LibriSpeech/train-clean-360 data/train_clean_360
  steps/make_mfcc.sh --cmd "$train_cmd" --nj 40 data/train_clean_360 \
                     exp/make_mfcc/train_clean_360 $mfccdir
  steps/compute_cmvn_stats.sh \
    data/train_clean_360 exp/make_mfcc/train_clean_360 $mfccdir

  utils/combine_data.sh \
    data/train_clean_460 data/train_clean_100 data/train_clean_360
fi

## 对齐,做LDA+MLLT+SAT变换(tri5b)
if [ $stage -le 16 ]; then
  steps/align_fmllr.sh --nj 40 --cmd "$train_cmd" \
                       data/train_clean_460 data/lang exp/tri4b exp/tri4b_ali_clean_460

  steps/train_sat.sh  --cmd "$train_cmd" 5000 100000 \
                      data/train_clean_460 data/lang exp/tri4b_ali_clean_460 exp/tri5b
fi
#local/nnet2/run_6a_clean_460.sh

## 合并500小时的数据,变成960小时
if [ $stage -le 17 ]; then
  local/download_and_untar.sh $data $data_url train-other-500

  local/data_prep.sh \
    $data/LibriSpeech/train-other-500 data/train_other_500
  steps/make_mfcc.sh --cmd "$train_cmd" --nj 40 data/train_other_500 \
                     exp/make_mfcc/train_other_500 $mfccdir
  steps/compute_cmvn_stats.sh \
    data/train_other_500 exp/make_mfcc/train_other_500 $mfccdir

  utils/combine_data.sh \
    data/train_960 data/train_clean_460 data/train_other_500
fi

## 对齐,做LDA+MLLT+SAT变换(tri6b),解码
if [ $stage -le 18 ]; then
  steps/align_fmllr.sh --nj 40 --cmd "$train_cmd" \
                       data/train_960 data/lang exp/tri5b exp/tri5b_ali_960

  steps/train_quick.sh --cmd "$train_cmd" \
                       7000 150000 data/train_960 data/lang exp/tri5b_ali_960 exp/tri6b

  utils/mkgraph.sh data/lang_test_tgsmall \
                   exp/tri6b exp/tri6b/graph_tgsmall
  for test in test_clean test_other dev_clean dev_other; do
      steps/decode_fmllr.sh --nj 20 --cmd "$decode_cmd" \
                            exp/tri6b/graph_tgsmall data/$test exp/tri6b/decode_tgsmall_$test
      steps/lmrescore.sh --cmd "$decode_cmd" data/lang_test_{
    
    tgsmall,tgmed} \
                         data/$test exp/tri6b/decode_{
    
    tgsmall,tgmed}_$test
      steps/lmrescore_const_arpa.sh \
        --cmd "$decode_cmd" data/lang_test_{
    
    tgsmall,tglarge} \
        data/$test exp/tri6b/decode_{
    
    tgsmall,tglarge}_$test
      steps/lmrescore_const_arpa.sh \
        --cmd "$decode_cmd" data/lang_test_{
    
    tgsmall,fglarge} \
        data/$test exp/tri6b/decode_{
    
    tgsmall,fglarge}_$test
  done
fi

## 划分“好”的数据来训练数据(tri6b_cleaned)
if [ $stage -le 19 ]; then
  local/run_cleanup_segmentation.sh
fi

## 训练和测试nnet3 tdnn模型
if [ $stage -le 20 ]; then
  local/chain/run_tdnn.sh
fi

2.2 Test your own dataset using the pre-trained model

First, create a folder of your own database, and set the soft links of steps, utils, and rnnlm.

ln -s /home/fwq/Project/kaldi/kaldi/egs/wsj/s5/utils utils
ln -s /home/fwq/Project/kaldi/kaldi/egs/wsj/s5/steps steps
ln -s /home/fwq/Project/kaldi/kaldi/scripts/rnnlm rnnlm

Then start to prepare your own database. The files required by kaldi are as follows. This part needs to be written and generated according to your own database format, and placed in data/corpus_name/. The following corpus is named test.

  1. wav.scp: This list contains voice IDs and corresponding WAV locations in the system
  2. utt2spk: List of utterance IDs and corresponding speaker IDs. If you don't have speaker information, you can copy utt-id as spk-id.
  3. text: the transcription of the utterance. This will require scoring your decoded output.
    Then sort and copy the data.
utils/fix_data_dir.sh data/test
utils/utt2spk_to_spk2utt.pl data/test/utt2spk > data/test/spk2utt
for datadir in test; do
    utils/copy_data_dir.sh data/$datadir data/${datadir}_hires
done

With the data, it is necessary to prepare to generate mfcc features. It is necessary to create a new conf folder and create a new configuration file of conf/mfcc_hires.conf, which is added as follows:

-use-energy=false   # use average of log energy, not energy.
--num-mel-bins=40     # similar to Google's setup.
--num-ceps=40     # there is no dimensionality reduction.
--low-freq=20     # low cutoff frequency for mel bins... this is high-bandwidth data, so
                  # there might be some information at the low end.
--high-freq=-400 # high cutoff frequently, relative to Nyquist of 8000 (=7600)

Features and CMVN statistics can then be computed for the data.

for datadir in test; do
    steps/make_mfcc.sh --nj 20 --mfcc-config conf/mfcc_hires.conf --cmd "$train_cmd" data/${datadir}_hires
    steps/compute_cmvn_stats.sh data/${datadir}_hires
    utils/fix_data_dir.sh data/${datadir}_hires
done

Next is the download and import of the pre-trained model. By default the content will be extracted to data and exp directories. 2 language models are provided here: (tgsmall small triplet model) and rnnlm (based on LSTM), both of which are trained with LibriSpeech training transcripts. We will use tgsmall model for decoding and RNNLM for recording.

wget http://kaldi-asr.org/models/13/0013_librispeech_v1_chain.tar.gz
wget http://kaldi-asr.org/models/13/0013_librispeech_v1_extractor.tar.gz
wget http://kaldi-asr.org/models/13/0013_librispeech_v1_lm.tar.gz
tar -xvzf 0013_librispeech_v1_chain.tar.gz
tar -xvzf 0013_librispeech_v1_extractor.tar.gz
tar -xvzf 0013_librispeech_v1_lm.tar.gz

Use the i-vector extractor to get the i-vector of the test data. This will extract the 100-dimensional i-vectors to exp/nnet3_cleaned.

for data in test; do
    nspk=$(wc -l <data/${
     
     data}_hires/spk2utt)
    steps/online/nnet2/extract_ivectors_online.sh --cmd "$train_cmd" --nj "${nspk}" \
      data/${data}_hires exp/nnet3_cleaned/extractor \
      exp/nnet3_cleaned/ivectors_${data}_hires
done

Use tgsmallLM to create the decoded map.

export dir=exp/chain_cleaned/tdnn_1d_sp
export graph_dir=$dir/graph_tgsmall
utils/mkgraph.sh --self-loop-scale 1.0 --remove-oov \
  data/lang_test_tgsmall $dir $graph_dir

Use the created graph for decoding.

export decode_cmd="run.pl"
for decode_set in test; do
  steps/nnet3/decode.sh --acwt 1.0 --post-decode-acwt 10.0 \
    --nj 8 --cmd "$decode_cmd" \
    --online-ivector-dir exp/nnet3_cleaned/ivectors_${decode_set}_hires \
    $graph_dir data/${decode_set}_hires $dir/decode_${decode_set}_tgsmall
done

Check WER before checking. Here we use sclite scoring, which is available in Kaldi and used for most egs.

for decode_set in test; do
  steps/score_kaldi.sh --cmd "run.pl" data/${decode_set}_hires $graph_dir $dir/decode_${decode_set}_tgsmall
done
cat exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall/scoring_kaldi/best_wer
%WER 57.15 [ 14722 / 25761, 2501 ins, 2559 del, 9662 sub ] exp/chain_cleaned/tdnn_1d_sp/decode_test_tgsmall/wer_17_1.0

Re-score using RNNLM.

export decode_cmd="run.pl"
for decode_set in test; do
    decode_dir=exp/chain_cleaned/tdnn_1d_sp/decode_${decode_set}_tgsmall;
    rnnlm/lmrescore_pruned.sh \
        --cmd "$decode_cmd" \
        --weight 0.45 --max-ngram-order 4 \
        data/lang_test_tgsmall exp/rnnlm_lstm_1a \
        data/${decode_set}_hires ${decode_dir} \
        exp/chain_cleaned/tdnn_1d_sp/decode_${decode_set}_rescore
done

Scoring is included in the lmrescore_pruned.sh script.

cat exp/chain_cleaned/tdnn_1d_sp/decode_test_rescore/wer_17_1.0
# %WER 56.12 [ 14456 / 25761, 2607 ins, 2452 del, 9397 sub ]

3. Kaldi experience

Unlike the python used in the past, the initial use of kaldi is realized through shell scripts. It is based on the bottom layer of c++. Through different developments in the community, it now has a very large script library. Many functions have achieved efficient encapsulation, but if you want to provide a feature training model yourself, you need to look at the shell code and then look at the c++ code.

references

Reference 1
Reference 2

Guess you like

Origin blog.csdn.net/tobefans/article/details/125434121