HTK-based voice dialing system

Based on HTK voice dialing system

open

  NWPU

2011-6-22

aims:

The system can recognize spoken continuous string of numbers and names of several groups. For a sub-word model is ( Sub-Word, EG .. phoneme), a certain scalability. When adding a new name, simply modify the pronunciation dictionary and grammar to the task. Continuous Gaussian mixture model output, using the formula binding state triphone clustering decision trees formed by voice.

content:

1. Data Preparation

(1)       task syntax definition

(2)       the dictionary definition

(3)       Record Voice data

(4)       annotation data, to obtain the true value file

(5)       the extracted feature data

2. Create a monophones HMM model

(6)       create the same model initialization method monophones

(7)       repair dumb phoneme model

(8)       re-correction data

3. Create a binding state triphone HMM model

(9)       obtained triphone HMM

(10)   Binding triphone

4. Recognition Evaluation

( 11 ) verification test results

step:

1. Data Preparation

You need to record training data and test data. For the calibration, but also need to mark text data. Here with the task grammar ( Task Grammar generate true value text () Ground Truth ) . In order to deal with the training data, you need to define a set of voice and words to a dictionary covers training and test data involved.

(1)       task syntax definition

Task Syntax to include regular expressions in the form of a variable is defined, stored in a file Gram (handmade in Notepad ++ or UltraEdit carried out under the environment , the last blank line) where:

 

The syntax is above executives said, must HParse convert HTK represents the underlying.

Run command: HParse Gram wdnet

 

It indicates the file stored in the bottom wdnet ( HParse tool to generate ) in.

 

(2)       the dictionary definition

Using BEEP Voice Dictionary (ready), wherein the removing accent.

After the addition of each speaker sp (short pause). If there mute flag to use MP command to sil and sp combined into sil , these processing commands in global.ded (handmade) script.

 

File wlist (this system due to fewer words involved, so you can hand-made) is an ordered list of all the words appear in the task syntax.

 

       File names are specific pronunciations of names (handmade, including the SEND-the START, the END-the SENT ).

 

       Execution HDMan:

HDMan -m -w lists/wlist - g global.ded   -n lists/monophones 0 -l dlog dict/dict1   dict/beep dict/names

 

The resulting file monophones 0 is used phoneme list (including SP ) , resulting dlog parameter file that contains the dictionary generated dict1 relevant statistical information, you will be prompted to missing words. Generating tasks related pronunciation dictionary dict1, require manual changes to SENT-END and SENT-START plus no output flag.

 

To avoid dlog appear in warnning, available in names and beep create an edited script of the same name are in the same directory, is empty can.

 

(3)       to record voice data

HSGen tool generates consistent with task grammar of the sentence, used to guide the recording:

HSGen -l -n 10 wdnet dict/dict1>labels/trainprompts

HSGen -l -n 1 0 wdnet dict/dict1>labels/testprompts

       根据上述生成的指令文件,录制相应的 10 个训练用语音数据 文件和 10 个测试用语音数据 文件。一个录制例子如下:

HSLab ./data/Train/speech/S0001

 

(4)       标注数据,得到真值文件

perl 脚本 prompts2mlf( 现成的 ) 可以把录音文本截成单词级真值文件 trainwords _2 .mlf testwords _2 .mlf

perl scripts/prompts2mlf labels/trainwords _2 .mlf labels/trainprompts

perl scripts/prompts2mlf labels/testwords _2 .mlf labels/testprompts

:将生成的文件 trainwords _2 .mlf testwords _2 .mlf trainwords _1 .mlf testwords _1 .mlf 的格式 "*/S0*.lab" 添加到其文件末尾,并保存为 trainwords.mlf testwords.mlf

标注编辑器 HLEd 可把单词级真值文本( word level MLF )转成音素级真值文本( phone level MLF phones0.mlf

HLEd -l * -d dict/dict1 -i labels/phones0.mlf mkphones0.led labels/trainwords.mlf

编辑脚本 mkphones0.led 的内容如下:

 

其中 EX 命令表示按照字典 dict1 进行展开, IS 表示在每个话语的前后插入标志, DE 一行表示 phones0.mlf 中单词间不用 sp 隔开。

 

(5)   数据的特征提取

这里所用特征为 MFCC 。工具 HCopy 可以实现提取特征的工作

HCopy -T 1 -C config/config 1 -S codetr.scp

其中,配置文件 config1 要设置转换参数(红色标出), config 内容如下:

# Coding parameters

    TARGETKIND = MFCC_0_D_A           // 目标文件参数类型

    TARGETRATE = 100000.0               // 目标速率, 100 /

     SOURCEFORMAT = WAV                // 源文件格式

    SAVECOMPRESSED = T                // 以压缩的方式存储

    SAVEWITHCRC = T                    // 附加校验和到输出参数中

    ZMEANSOURCE=TRUE

    SOURCERATE=208                    // 源文件的速率

    WINDOWSIZE = 250000.0               // 25ms 为一帧进行分帧处理

    USEHAMMING = T                    // 采用汉明窗,进行加窗处理

    PREEMCOEF = 0.97                   // 预加重系数

    NUMCHANS = 26                     //26 组滤波器

    CEPLIFTER = 22                      // 倒谱滤波系数

    NUMCEPS = 12                        // 参数个数

    ENORMALISE = F                    // log 能量不进行 归一

 

实现该命令所需的脚本文件 codetr.scp 可采用如下方式生成: DOS 环境下进入到 wav 文件所在路径,用 dir/b/s > wav.scp 指令将所有的 wav 文件名写入到 wav.scp 文件中(注意删除多出的一行),然后在 Notepad++ 中构造 下图 所示的文件, coder.scp ( 注:生成的wav.scp 中的文件路径是绝对路径,可以手动改成相对路径)

 

codetr.scp 指定训练及输入和输出文件列表。执行结果, HCopy codetr.scp 文件左侧的语音数据 config 1 的配置提取特征并存入 codetr.scp 文件右侧特征文件中。

对于测试数据如法炮制。

HCopy -T 1 -C config/config 1 -S codet e .scp

 

2. 创建单音素 HMM 模型

6 )一致初始化法创建单音素模型

       定义一个原始模型 proto:

 

 

训练文件 train.scp 的生成也是在 DOS 环境下进入到 MFCC 特征的文件路径下,执行 dir/b/s> train.scp 。需要注意的是要在 Nodepad++ UltraEdit 下把多余的一行删除掉。

 

用全局均值和方差来初始化 HMM 模型的高斯参数:

HCompV -T 1 -C config/config1 -f 0.01 -m -S train.scp -M hmm s/hmm0   proto

       在目录 hmm0 下生成了更新后的 proto 和一个截至宏 vFloors 。基于 ./hmms/hmm0/ 下的两个文件,手工制作主宏文件 hmmdefs 和与 vFloors 相关的宏 macro, 具体制作过程参见 HTKbook

 

 

由于暂时不使用 sp 模型,删除 monophones 0 中的 sp, 构成 monophones 1 文件,重估参数:

HERest   -C config/config1 - I   labels/phone s 0.mlf   - t 250.0 150.0 1000.0 - S train.scp -H hmms/hmm0/macros -H hmms/hmm0/hmmdefs   -M hmms/hmm1   lists/monophones 1

同上,重复估计两次:

HERes t   -C ./config/config1 -I ./labels/phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H ./hmms/hmm1/macros -H ./hmms//hmm1/hmmdefs -M ./hmms/hmm2 ./lists/monophones 1

 

HERest   -C ./config/config1 -I ./labels/phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H ./hmms/hmm2/macros -H ./hmms/hmm2/hmmdefs -M ./hmms/hmm3 ./lists/monophones1

 

(6)       修补哑音素模型

hmm3 中的 macros 复制到 hmm4 中, hmmdefs 中的 sil 复制到文件末尾并将 sil 改为 sp 及状态改为 3 放到 hmm4

 

(1) 利用 HHEd 加入回溯转移概率:

HHEd -T 1 -H hmms/hmm4/macros - H hmms/hmm4/hmmdefs -M hmms/hmm5 sil.hed   lists/monophone s0

修改 mkphones0.led, 去掉最后一行,存为 mkphones1.led ,利用 HLEd 工具得到包含 sp

的音素级真值文本:

HLEd -l * -d ./dict/dict1 -i ./labels/phones1.mlf mkphones1.led ./labels/trainwords.mlf

 

(2) 重估两次:

HERest -C config/config1 -I labels/phone s0 .mlf -t 250.0 150.0 1000.0 - S train.scp -H hmms/hmm5/macros -H hmms/hmm5/hmmdefs -M hmms/hmm6   lists/monophones 0

 

HERest -C config/config1 -I labels/phone s0 .mlf -t 250.0 150.0 1000.0 - S train.scp -H hmms/hmm 6 /macros -H hmms/hmm 6 /hmmdefs -M hmms/hmm 7   lists/monophones 0

 

8 )重校准训练数据

确认 trainwords.mlf 中的路径为 ”*/S0 * .lab” 并且加上前面的 140 句话 ,修改 dict 1 加入 silence sil 一项,另存为 dict 2 ,执行 HVite 进行 Viterbi 校准:

HVite -l * -o SWT - b silence -C config/config1 -a -H hmms/hmm7/macros   -H hmms/hmm7/hmmdefs -i labels/aligned.mlf -m -t 350.0 -y lab -I labels/trainwords.mlf -S train.scp   dict/dict 2   lists/monophones 0

 

利用 HERest 重估两次 ,最后保存到 hmm9

HERest _3.4 -C config/config1 -I labels/aligned.mlf -t 250.0 150.0 1000.0 - S train.scp -H hmms/hmm 7 /macros -H hmms/hmm 7 /hmmdefs -M hmms/hmm 8   lists/monophones 0

 

HERest _3.4 -C config/config1 -I labels/aligned.mlf   -t 250.0 150.0 1000.0 - S train.scp -H hmms/hmm 8 /macros -H hmms/hmm 8 /hmmdefs -M hmms/hmm 9   lists/monophones 0

来看看这时的识别率怎么样

HVite -H ./hmms/hmm 9 /macros -H ./hmms/hmm 9 /hmmdefs -S test.scp -l * -i ./results/recout_step 9 .mlf -w wdnet -p 0.0 -s 5.0 ./dict/dict 2 ./lists/monophones 0

HResults -I ./labels/testwords.mlf ./lists/monophones 0 results/recout_step 9 .mlf

Reproduced in: https: //my.oschina.net/dake/blog/196721

Guess you like

Origin blog.csdn.net/weixin_34355715/article/details/91508814