神经网络-CNN结构和语音识别应用

                       

一、基本结构

入门介绍:https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/
参考deep learning. Ian Goodfellow的chapter9
cross-correlation: S(i,j)=(IK)(i,j)= M  N I(i+m,j+n)K(m,n) S(i,j)=(I∗K)(i,j)=∑M∑NI(i+m,j+n)K(m,n)

(二)ctc-cnn

 

Zhang, Y., Pezeshki, M., Brakel, P., Zhang, S., Laurent, C., Bengio, Y., Courville, A. (2016) Towards End-to-End Speech Recognition with Deep Convolutional Neural Networks. Proc. Interspeech 2016, 410-414.

性能和LSTM差不多,在同样参数量的情况下加速2.5X
将之前的LSTM网络结构替换为CNN,然后跟着全连接层,顶层使用CTC准则进行训练
这里写图片描述

 

W. Song and J. Cai, “End-to-End Deep Neural Network for Automatic Speech Recognition,” Technical Report. 2015 stanford

CNNs are exceptionally good at capturing high level features in spatial domain and have demonstrated unparalleled success in computer vision related tasks. One natural advantage of using CNN is that it’s invariant against translations of the variations in frequencies, which are common observed across speaker with different pitch due to their age or gender.
这里写图片描述
对数据帧使用时间窗获得一个单通道的图像,使用5X3的filter,考虑到频率维度的长度大于时间维度的长度。
首先使用CNN+softmax训练一个帧的分类器,然后固定CNN的参数,使用DNN+RNN+CTC替换softmax进行CTC训练,使用CNN预训练比直接训练CTC效果要好一些。

           

猜你喜欢

转载自blog.csdn.net/qq_44944990/article/details/89420033