END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA

Johan Rohdin, Anna Silnova, Mireia Diez, Oldrich Plchot, Pavel Mat ˇ ejka, Luk ˇ a´s Burget

布尔诺科技大学布尔诺捷克 from ICASSP 2018

摘要

最近，几个基于DNN的端到端说话人验证系统被提了出来。这些系统被证实在短语句的文本无关任务上有着于文本相关任务一样的优势。但是在文本无关的长句任务中，端到端系统依然优于i-vector+plda系统的性能。在我们的工作中，我们开发了一个端到端的说话人验证系统，它以模拟i-vector+plda系统为初始化。然后对系统进行端到端的进一步训练，并加上正则化，以避免偏离初始系统太多。这种方法还可以减少过拟合，过拟合通常会限制端到端系统的性能。提出的系统性能在长句和短句的任务上都要由于i-vector+plda基线系统。

INTRODUCTION

近年来有许多人尝试在SV中利用NN。大多数尝试都是用NN替换或是改善i-vector+plda系统的一部分（特征提取、统计足量数据、提取i-vector、plda）。例如用NN瓶颈特征代替传统mfcc特征【1】，NN声学模型代替GMM模型来提取足量数据【2】，NNs改善【3，4】或是代替【5】plda。更有甚者，使用NNs以句子的帧级特征作为输入然后产生句子级的特征，通常被称为embedding【6，7，8，9，10，11】。embedding是通过池化机制得到的，例如在NN的一层或多层的帧级输出上取均值【6】，或是使用RNN【7】。一种有效的方法是训练NN来分类说话人，使用多分类训练【6，10，11】。为了做SV,embedding被提取出来后会使用标准后端打分，例如plda。理想中NNs应该直接训练来做SV任务，例如对2句话做接受或拒绝的二分类任务【7，8，9】。这样的系统被称为端到端系统并且被证实在TD任务【7，8】中很有竞争力，同样在TI任务中（短句测试和大量训练数据）【9】。然而，在长句TI任务中，i-vector+plda性能依然优于端到端系统。

E2E系统在长句TI任务中表现不好的原因之一可能是因为训练数据的过拟合。第二个可能的原因是之前的工作都是在短句上训练NN，这可能导致训练和测试数据失配。

在我们的工作中，我们开发了一个E2E SV系统以模拟i-vector+plda系统为初始化。我们的系统包含一个NN模块用来提取足量的数据（f2s，注：feature -> statistic），一个NN模块来提取i-vectors（s2i），最后是一个鉴别性plda模型【12，13】用于打分。这3个模块是独立训练以模拟i-vector+plda系统相应的部分。独立训练之后，我们把这3个模块组合起来以E2E的形式在长句和短句上进一步训练。这E2E训练过程中，我们对模型参数进行正则化以避免偏离基线系统太多。这也减少了过拟合的风险。此外，通过独立训练中3个模块，我们可以更容易找到它们的最佳结构并且发现它们在E2E系统训练中的困难。

通过在3个数据集上测试系统，我们证实我们提出的系统在长句和短句TI任务上都要优于I-vector+plda系统（无论是生成性还是鉴别性训练的）。

参考文献

[1] A. Lozano-Diez, A. Silnova, P. Matejka, O. Glembek, O. Pl- ˇ chot, J. Pesˇan, L. Burget, and J. Gonzalez-Rodriguez, “Anal- ´ ysis and optimization of bottleneck features for speaker recog- nition,” in Proceedings of Odyssey 2016. 2016, vol. 2016, pp. 352–357, International Speech Communication Association.

[2] Y. Lei, N. Scheffer, L. Ferrer, and M. McLaren, “A novel scheme for speaker recognition using a phonetically-aware deep neural network,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1695–1699.

[3] S. Novoselov, T. Pekhovsky, O. Kudashev, V. S. Mendelev, and A. Prudnikov, “Non-linear plda for i-vector speaker verifica- tion,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Sept 2015, pp. 214– 218. [4] G. Bhattacharya, J. Alam, P. Kenny, and V. Gupta, “Modelling speaker and channel variability using deep neural networks for robust speaker verification,” in 2016 IEEE Spoken Language Technology Workshop, SLT 2016, San Diego, CA, USA, De- cember 13-16, 2016.

[5] O. Ghahabi and J. Hernando, “Deep belief networks for i- vector based speaker recognition,” in 2014 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2014, pp. 1700–1704.

[6] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), May 2014, pp. 4052–4056.

[7] G. Heigold, I. Moreno, S. Bengio, and N. Shazeer, “End-to-end text-dependent speaker verification,” in 2016 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp. 5115–5119.

[8] S. X. Zhang, Z. Chen, Y. Zhao, J. Li, and Y. Gong, “End-to- end attention based text-dependent speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 171–178.

[9] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero, Y. Carmiel, and S. Khudanpur, “Deep neural network-based speaker embeddings for end-to-end speaker verification,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec 2016, pp. 165–170.

[10] G. Bhattacharya, J. Alam, and P. Kenny, “Deep speaker embed- dings for short-duration speaker verification,” in Interspeech 2017, 08 2017, pp. 1517–1521.

[11] D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudan- pur, “Deep neural network embeddings for text-independent speaker verification,” in Interspeech 2017, Aug 2017.

[12] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and ˇ N. Brummer, “Discriminatively trained probabilistic linear dis- ¨ criminant analysis for speaker verification,” in Proc. of the In- ternational Conference on Acoustics, Speech, and Signal Pro- cessing (ICASSP), Prague, CZ, May 2011.

[13] S. Cumani, N. Brummer, L. Burget, P. Laface, O. Plchot, and ¨ V. Vasilakis, “Pairwise discriminative speaker verification in the i–vector space,” IEEE Transactions on Audio, Speech and Language Processing, vol. 21, no. 6, pp. 1217–1227, june 2013. [14] L. Ferrer, H. Bratt, L. Burget, H. Cernocky, O. Glem- bek, M. Graciarena, A. Lawson, Y. Lei, P. Matejka, O. Plchot, et al., “Promoting robustness for speaker modeling in the community: the prism evaluation set,” https://code.google.com/p/prism-set/, 2012.

[15] O. Plchot, P. Matejka, A. Silnova, O. Novotn ˇ y, M. Diez, J. Ro- ´ hdin, O. Glembek, N. Brummer, A. Swart, J. Jorr ¨ ´ın-Prieto, P. Garc´ıa, L. Buera, P. Kenny, J. Alam, and G. Bhattacharya, “Analysis and Description of ABC Submission to NIST SRE 2016,” in Interspeech 2017, Stockholm, Sweden, 2017.

[16] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel- let, “Front-end factor analysis for speaker verification,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. PP, no. 99, 2011.

[17] “The 2016 NIST speaker recognition evaluation plan (sre16),” https://www.nist.gov/file/325336.

[18] J. Rohdin, A. Silnova, M. Diez, O. Plchot, P. Matejka, and L. Burget, “End-to-end DNN Based Speaker Recognition In- spired by i-vector and PLDA,” ArXiv e-prints, Oct. 2017.

[19] Theano Development Team, “Theano: A Python framework for fast computation of mathematical expressions,” arXiv e- prints, vol. abs/1605.02688, May 2016.

[20] M. Karafiat, F. Gr ´ ezl, K. Vesel ´ y, M. Hannemann, I. Sz ´ oke, ˝ and J. Cernock ˇ y, “But 2014 babel system: Analysis of adap- ´ tation in nn based systems,” in Proceedings of Interspeech 2014. 2014, pp. 3002–3006, International Speech Communi- cation Association.

[21] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker ver- ification using adapted gaussian mixture models,” Digital Sig- nal Processing, vol. 10, pp. 19–41, 2000.

[22] J. Rohdin, S. Biswas, and K. Shinoda, “Robust discriminative training against data insufficiency in plda-based speaker verifi- cation,” Computer Speech & Language, vol. 35, pp. 32 – 57, 2016. [23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- mization,” CoRR, vol. abs/1412.6980, 2014.

总结：

1、说话人识别系统对于测试语句的长短很敏感。那么长句与短句的准确定义是什么？

长句>2m

短句<40s

2、多模块联合训练比单个模块隔离训练效果要好。

3、训练时加上正则化以避免过拟合。

END-TO-END DNN BASED SPEAKER RECOGNITION INSPIRED BY I-VECTOR AND PLDA

摘要

INTRODUCTION

参考文献

总结：

猜你喜欢