【Entropy】LGCCT:用于多模态语音情感识别的光门控和交叉补码变换器——SCIE

LGCCT: A Light Gated and Crossed Complementation Transformer for Multimodal Speech Emotion Recognition

Vedio
The article’s explanation could be accessed via Bilibili https://www.bilibili.com/video/bv1FU4y1v7yY

Code
Source Code could be accessed from: https://github.com/ECNU-Cross-Innovation-Lab/LGCCT

Abstract
Semantic-rich speech emotion recognition has a high degree of popularity in a range of areas. Speech emotion recognition aims to recognize human emotional states from utterances containing both acoustic and linguistic information. Since both textual and audio patterns play essential roles in speech emotion recognition (SER) tasks, various works have proposed novel modality fusing methods to exploit text and audio signals effectively. However, most of the high performance of existing models is dependent on a great number of learnable parameters, and they can only work well on data with fixed length. Therefore, minimizing computational overhead and improving generalization to unseen data with various lengths while maintaining a certain level of recognition accuracy is an urgent application problem. In this paper, we propose LGCCT, a light gated and crossed complementation transformer for multimodal speech emotion recognition. First, our model is capable of fusing modality information efficiently. Specifically, the acoustic features are extracted by CNN-BiLSTM while the textual features are extracted by BiLSTM. The modality-fused representation is then generated by the cross-attention module. We apply the gate-control mechanism to achieve the balanced integration of the original modality representation and the modality-fused representation. Second, the degree of attention focus can be considered, as the uncertainty and the entropy of the same token should converge to the same value independent of the length. To improve the generalization of the model to various testing-sequence lengths, we adopt the length-scaled dot product to calculate the attention score, which can be interpreted from a theoretical view of entropy. The operation of the length-scaled dot product is cheap but effective. Experiments are conducted on the benchmark dataset CMU-MOSEI. Compared to the baseline models, our model achieves an 81.0% F1 score with only 0.432 M parameters, showing an improvement in the balance between performance and the number of parameters. Moreover, the ablation study signifies the effectiveness of our model and its scalability to various input-sequence lengths, wherein the relative improvement is almost 20% of the baseline without a length-scaled dot product.

语义丰富的语音情感识别在一系列领域中具有很高的普及度。语音情感识别旨在从包含声学和语言学信息的语篇中识别人类的情感状态。由于文本和音频模式在语音情感识别(SER)任务中起着至关重要的作用,各种工作都提出了新的模式融合方法来有效利用文本和音频信号。然而,现有模型的高性能大多依赖于大量的可学习参数,而且它们只能在具有固定长度的数据上运行良好。因此,在保持一定的识别精度的前提下,最大限度地减少计算开销,提高对不同长度的未见过的数据的泛化能力是一个迫切的应用问题。在本文中,我们提出了LGCCT,一个用于多模态语音情感识别的光门控和交叉互补变换器。首先,我们的模型能够有效地融合模态信息。具体来说,声学特征由CNN-BiLSTM提取,而文本特征则由BiLSTM提取。融合后的模态表示由交叉注意力模块生成。我们应用门控机制来实现原始模态表示和模态融合表示的平衡整合。其次,可以考虑注意力集中的程度,因为同一标记的不确定性和熵应该收敛到相同的值,与长度无关。为了提高模型对各种测试序列长度的通用性,我们采用长度标度点积来计算注意力得分,这可以从熵的理论角度来解释。长度标度点积的操作既便宜又有效。实验是在基准数据集CMU-MOSEI上进行的。与基线模型相比,我们的模型仅用0.432M的参数就获得了81.0%的F1分数,显示了性能和参数数量之间平衡的改善。此外,消融研究标志着我们模型的有效性及其对各种输入序列长度的可扩展性,其中相对改进几乎是没有长度缩放点积的基线的20%。

阅读原文

Guess you like

Origin blog.csdn.net/lsttoy/article/details/130502793