Kaldi Speech Recognition Technology (7) ----- Training GMM

Kaldi Speech Recognition Technology (7) ----- GMM

training GMM

In the previous article, we talked about the advantages of GMM compared to DTW, so how do we obtain training GMM?

The training process of GMM is as follows:

insert image description here

The whole process is divided into 10 links, 5 of which are related to alignment. For the convenience of understanding, only 2 of these 10 links (train_mono single-factor training model and align_si alignment ) are discussed , and the others are basically carried out Optimization, everyone involved in this area can Baidu by themselves in the future. The general process is to train a Gaussian model first, use the Gaussian model to align the training data, and then use the aligned data to train a Gaussian model. For example (the picture on the right above), at first, train a monophone model, then use the monophone model to align the training data, and then based on these aligned data, train a triphone model, and then use the triphone model to align Align the training data, then use lda, mllt and other algorithms to re-evaluate the GMM model, then align the training data, then perform speaker-independent and related operations on the model, and finally align the training data again. The whole process of GMM training model is like this. In general, the better the model is trained, the more accurate the alignment will be, which will improve the accuracy of speech recognition.

Training GMM—mono training process

Let's take a look at how the monophone model is trained. First, to train a model, there must be a starting model, and then iterative training on this starting model. So kaldi calls gmm-init-mono to initialize the model, which uses the features in the training data to initialize the model. After the model is initialized, it is time to make a graph. There are several inputs for this graph, including the initialized model, L.fst in the lang file, and the text in the dictionary and training data. The generated result is a sentence down to the phoneme level Compressed package (gz) of the fst.

insert image description here

After we have completed the first two steps, how to use the features we extracted to correspond to the FST map we made requires alignment. The third step uses uniform alignment (equal alignment in the figure). As mentioned earlier, FST splits the sentence of the label file into words, words, phonemes, and then the state in HMM.

For the convenience of understanding, we only refine the phonemes in the picture, split the word (China) into characters (Zhongheguo), and then split it into phonemes (zh, ong1, g, uo2), each word has 2 phonemes , a total of 4 phonemes, the yellow vertical bars above represent a frame feature (MFCC feature), the uniform alignment here is to assume that there are 4 phonemes and 100 frame features, each phoneme is evenly divided into 25 frame features, respectively Find the mean and variance of these 25 frame features, and you can get the corresponding Gaussian model.

The lower part of the picture is our triphone model (considering co-pronunciation), the triphone model has a large amount of data, so decision tree clustering is used in kaldi to group some phonemes with similar pronunciation into one category, Then focus on training. After uniform alignment, we will re-evaluate the model. The initialization model is randomly selected from part of the training data. Now we have counted the characteristics of all audio data, and through uniform alignment, the corresponding state and jump of each frame feature turn probability. Based on this information, we re-evaluate the model. The next step is to align the data with the re-evaluated model (step 5). At this time, instead of using uniform alignment, we use the statistics after the re-evaluated model. Combined with the previous fst to generate new alignment information, this process can be simply understood as a model decoding the training set, and put the features corresponding to each frame of data into the GMM corresponding to each phoneme or each state to calculate the probability , the phoneme with the highest probability is the phoneme corresponding to this frame. As we said earlier, each phoneme corresponds to 25 frames. After this step, the corresponding phoneme changes (assuming that the first phoneme corresponds to 20 frames) . Based on these alignment information, we can statistically calculate the ratio of the number of jumps of each phoneme to the total number of jumps, so that we can re-estimate the jump probability of each state in the HMM network, and then update these state transition probabilities and each After the alignment information of one frame, we can re-evaluate our GMM model (skip from step 6 to step 4). In this way, the model is re-evaluated repeatedly, new alignment information is generated, the transition probability is calculated, and then the model is re-evaluated. The specific number of times like this is set by us, and the default is 40 times.

train_mono.sh is used to train GMM

Let's take a look at how the monophone model is trained in Kaldi by calling the train_mono.sh script in the step folder. In particular, if you use multi-thread parameters, the number of nj cannot exceed the number of speakers (spekerid), which divides the number of threads according to the number of speakers.

train_mono.sh usage:

./steps/train_mono.sh 
Usage: steps/train_mono.sh [options] <data-dir> <lang-dir> <exp-dir>
 e.g.: steps/train_mono.sh data/train.1k data/lang exp/mono
main options (for others, see top of script file)
  --config <config-file>                           # config containing options
  --nj <nj>                                        # number of parallel jobs
  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.

Reference: Getting started with Kaldi in detail train_mono.sh official document

First prepare the kaldi environment

. ~/kaldi/utils/path.sh
mkdir H_learn
cd ~/kaldi/data

then execute the script

./steps/train_mono.sh --nj 2 --cmd "run.pl" H/kaldi_file_test L/lang H/mono

Detailed parameter explanation:
The first parameter: –nj Several threads are trained in parallel ( note: if the features of each speaker are extracted, the number of nj cannot exceed the number of speakers ) The second parameter: run.pl runs the third one
locally
Parameters: feature folder (including cmvn and original mfcc features, see column blog post five)
fourth parameter: lang folder (lang folder in L folder)
fifth parameter: output GMM training data (monophone model )folder

insert image description here
insert image description here

Train GMM—Generate Files

After using train_mono.sh to train the monophone model, we generated these files. The first and most important thing is the model (mdl, where 0.mdl represents the initialized model; 40.mdl represents the result of 40 iterations; final.mdl represents The final model), followed by the file at the end of occs, which can be simply understood as a global statistic, counting the information of each phoneme or each phoneme corresponding to several states. ali..gz is the alignment information, and the alignment information will be updated every time the model is iterated. fst..gz is grid information, which is some FST information we mentioned earlier. tree is a decision tree, which gathers some phonemes with similar pronunciation into one category, which is convenient for calculation. The log is the log generated during the training process. If an error occurs during the training process, you can basically find the corresponding error message in it.
insert image description here
*.mdl: Model 0.mdl represents the initialized model; 40.mdl represents the result of 40 iterations; final.mdl represents the final result.
*.occs: the number of occurrences of each pdf
ali.*.gz: alignment information
fst.*.gz: grid information
tree: decision tree
log: training process log

Training GMM—final model view

Convert final.mdlthe data to text and output to final_mdl.txttext (--binary=false means not to use binary data)

gmm-copy --binary=false final.mdl final_mdl.txt

vim final_mdl.txt Open the file as follows:

insert image description here
insert image description here

This part of the information is in the topo file generated when we generated G.fst before the first 23 lines. We can draw the corresponding HMM states, that is, states 6-909 share one HMM3 state, and 1-5 (silent phonemes) share one HMM5 state. The 24th line 2727 represents the number of our decision tree clusters. In the decision tree information: the first column is the phoneme index (phone-id), the second column is the HMM state, and the third column is the PDF index (decision tree class)
insert image description here
insert image description here

We can see that our phonemes have a total of 909, so why can they be clustered into 2737 categories? As can be seen in the figure, from line 25 to line 49 in the second column of 25 lines, they all belong to mute phonemes (why? The serial numbers are all 0-4, 5 states, only mute phonemes have 5 states ). Finally, we can get the formula for the number of clusters as:

The number of decision tree classes = number of silent phonemes * 5 + non-silent phonemes * 3

insert image description here

insert image description here

Since there are many other information in the decision tree, you can open the local decision tree information to view it yourself. The information in <LogProbs> is the transition probability of each class, and the number in <dimension> is 3 times the input feature (MFCC). It is because the input features have made first-order difference and second-order difference, plus the original feature is the number 39, <numpdfs> is the number of our decision tree classes, and each class can correspond to a Gaussian mixture model ( GMM). The following is the description of the 656 GMMs in the figure. The description of a Gaussian model only needs the mean and variance (know it). If it is a mixed Gaussian, multiple Gaussian combinations are required. Each The weight of Gaussian in the mixed Gaussian. For example, there are 2 weights in the weight <weights> in the above figure, then it means that the mixed Gaussian describing the probability of this phoneme is composed of 2 single Gaussians. The respective weights are XXX, XXX, the average has 39 columns, one Gaussian corresponds to one 39-dimensional average, and two Gaussians have two averages. Likewise, the variance is the same. In addition to weights, there is also a hyperparameter <gconsts>, which is also a Gaussian corresponding to one. Not too much in-depth introduction here.

Training GMM—final.occs view

Next, let's look at the final.occs file, which can be simply understood as a global statistic, which counts information about each phoneme or how many states each phoneme corresponds to. That is, a description of each HMM network. We mentioned earlier that each phoneme can be subdivided into BEIS (corresponding to several states), we know that the HMM mute phoneme has 5 states (BEIS + 5 states of itself), and the non-muted phoneme BEI has 3 states. occs is to split each phoneme and make statistics separately. Transition-state is a description of each state, where the number in the count of pdf indicates the number of times this edge appears.

Statistics of each phoneme or information of several states corresponding to each phoneme

show-transitions phones.txt final.mdl final.occs > final_occs.txt

vim final_occs.txtOpen the file as follows:

insert image description here

Training GMM—View Alignment Information
# 1 解压  
gzip -d ali.1.gz

The features are mentioned at the frame level, and the training is also trained at the frame level. What is the result of the training? You can use the ali-to-phones command to check which frames correspond to which phoneme. Each number in the figure represents a phoneme ID. For easy viewing, we need to convert the phoneme id to the corresponding phoneme.

# 2 使用ali-to-phones进行查看
ali-to-phones --per-frame=true final.mdl ark:ali.1 ark,t:ali.1.txt

What is the phoneme corresponding to each phone-id? Use the int2sym.pl script to convert the phoneme id to the corresponding phoneme, and we can see the time information of each phoneme.

# 3 将音素id转换为对应的音素
~/kaldi/data/utils/int2sym.pl -f 2- phones.txt <ali.1.txt >ali.1.phones

How long does each phoneme last? For example, this sentence "set twenty-nine degrees", you can check and verify it through the audio viewing software. Since we are using a single phoneme training model, the alignment may not be completely accurate, so the model needs to be trained repeatedly.

# 4 各音素的对齐时间信息
ali-to-phones --ctm-output=true final.mdl ark:ali.1 -| ~/kaldi/data/utils//int2sym.pl -f 5 phones.txt >ali.1.time

insert image description here

Training GMM—fsts.*.gz view

What is stored here is the fst network structure information corresponding to each statement, so I won’t go into details here.

# 1、解压  
gzip -d fsts.1.gz
# 2、查看
fstcopy ark:fsts.1 ark,t:fsts.1.txt
vim fsts.1.txt

insert image description here

Training GMM—tree decision tree view
  • text view
tree-info tree

insert image description here
num-pdfs 683 means: the number of decision tree classes, which are divided into 683 classes in
total
. The position of the decision tree is 0, if it is a 3 phoneme, here is 1 (central means the middle phoneme, the front is 0, and the back is 2)

  • visual decision tree
draw-tree phones.txt tree | dot -Gsize=8,10.5 -Tps | ps2pdf - tree.pdf

insert image description here

The decision tree can be drawn visually through the draw-tree command. Since the decision tree is very large, only a part of it is intercepted here, and it is not fully displayed. You can open it in pdf for viewing. As can be seen from the figure above, the SIL silent phoneme is a child node, corresponding to 5 nodes (0-4 nodes), and the phoneme ua5 corresponds to 3 nodes (5-7 nodes), which corresponds to what was mentioned above (5 states for muted phonemes, 3 states for non-muted phonemes)

align_si.sh for alignment

for alignment

After the monophone model is trained, we can use the monophone model to align the training data. The alignment command is align_si.sh, this is the second step of the GMM training model.

cd ~/kaldi/data
./steps/align_si.sh --nj 2 --cmd "run.pl" H/kaldi_file_test L/lang  H/mono H/mono_ali 

Detailed parameter explanation:
The first parameter: –nj Several threads are trained in parallel ( note: if the features of each speaker are extracted, the number of nj cannot exceed the number of speakers ) The second parameter: run.pl runs the third one
locally
Parameters: feature folder (including cmvn and original mfcc features)
fourth parameter: lang folder (lang folder in L folder)
fifth parameter: monophone training model folder
sixth parameter: after alignment data folder for

insert image description here

Training GMM—see the content generated by mono_ali.sh

insert image description here

# 1、解压  
gzip -d ali.1.gz
# 2、生成各音素的对齐时间信息
ali-to-phones --ctm-output=true final.mdl ark:ali.1 -|~/kaldi/data/utils/int2sym.pl -f 5 phones.txt >ali.1.time

insert image description here

ps: Similarly, we can also use the ali-to-phones command to view the alignment information.

Training GMM—comparing mono and mono_ali alignment information

insert image description here

We can compare our aligned information with the alignment information when training Mono. It can be seen that the total number of lines of the two files is not much different, and most of the alignment information is not much different, indicating that the alignment ability of the monophone model is just like this, and it needs to be changed. New algorithm to improve the alignment ability of the model.

Guess you like

Origin blog.csdn.net/yxn4065/article/details/129146948