【计算机科学】【2012.12】训练深度神经网络进行瓶颈特征提取

在这里插入图片描述

本文为德国卡尔斯鲁厄理工学院(作者:Jonas Gehring)的硕士论文,共52页。

在自动语音识别系统中,对音频信号进行预处理生成特征是实现良好识别率的重要组成部分。现有的研究表明,人工神经网络可以用来提取出比人工设计的特征提取算法具有更好识别性能的特征。一种可能的方法是训练一个具有小瓶颈层的网络,然后使用该层中单元的激活为系统其余部分生成特征向量。

深度学习是机器学习的一个领域,它是一种能够处理具有许多隐藏层的神经网络的有效训练算法,并从数据中自动发现相关特征。虽然深度学习通常应用于计算机视觉,但最近的多篇文献已经证明了深度网络在语音识别任务上也能取得优异的性能。

本文提出了一种从深度神经网络中提取瓶颈特征的新方法。首先对去噪自动编码器堆栈在同一层次以无监督的方式进行训练,然后,将该堆栈转换为前馈神经网络和瓶颈层,再添加一个隐藏层和分类层。最后对整个网络进行微调以估计语音目标的状态,以便在瓶颈层中生成识别特征。多次广东话会话的电话语音实验表明,该体系结构可以有效地利用深度神经网络所带来的容量增加,从而生成更有用的特征,获得更好的识别性能。实验证明,这种能力在很大程度上依赖于通过预训练初始化的自动编码器堆栈。与倒谱系数的特征相比,从对数mel尺度滤波器组系数中提取特征会产生额外的增益。此外,可以通过使用更多数据的预训练自动编码器来实现小的改进,这对于只有少量转录数据可用的设置来说是一个有趣的特性。对较大数据集的评估导致使用标准特征的基线系统的识别错误率(相对误差为8%至10%)显著降低,因此证明了本文所提出体系结构的一般适用性。

In automatic speech recognition systems, preprocessing the audio signal to generate features is an important part of achieving a good recognition rate. Previous works have shown that artificial neural networks can be used to extract good, discriminative features that yield better recognition performance than manually engineered feature extraction algorithms. One possible approach for this is to train a network with a small bottleneck layer, and then use the activations of the units in this layer to produce feature vectors for the remaining parts of the system. Deep learning is a field of machine learning that deals with efficient training algorithms for neural networks with many hidden layers, and with automatic discovery of relevant features from data. While most frequently used in computer vision, multiple recent works have demonstrated the ability of deep networks to achieve superior performance on speech recognition tasks as well. In this work, a novel approach for extracting bottleneck features from deep neural networks is proposed. A stack of denoising auto-encoders is first trained in a layer-wise and unsupervised manner. Afterwards, the stack is transformed to a feed-forward neural network and a bottleneck layer, an additional hidden layer and the classification layer are added. The whole network is then fine-tuned to estimate phonetic target states in order to generate discriminative features in the bottleneck layer. Multiple experiments on conversational telephone speech in Cantonese show that the proposed architecture can effectively leverage the increased capacity introduced by deep neural networks by generating more useful features that result in better recognition performance. Experiments confirm that this ability heavily depends on initializing the stack of autoencoders with pre-training. Extracting features from log mel scale filterbank coecients results in additional gains when compared to features from cepstral coecients. Further, small improvements can be achieved by pre-training auto-encoders with more data, which is an interesting property for settings where only little transcribed data is available. Evaluations on larger datasets result in significant reductions of recognition error rates (8% to 10% relative) over baseline systems using standard features, and therefore demonstrate the general applicability of the proposed architecture.

1 引言

2 项目背景

3 通过深度神经网络得到的瓶颈特征

4 实验

5 结论

附录A 训练算法

附录B 图形处理器上的神经网络训练

下载英文原文地址:

http://page4.dfpan.com/fs/7lc4j232152931679f1/

更多精彩文章请关注微信号:在这里插入图片描述

猜你喜欢

转载自blog.csdn.net/weixin_42825609/article/details/90045464
今日推荐