【FCS】LMR-CBT用CB变换器学习模态融合表征，从不一致的多模态序列中进行多模态情感识别——CCF T1

LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Abstract
Learning modality-fused representations and processing unaligned multimodal sequences are meaningful and challenging in multimodal emotion recognition. Existing approaches use directional pairwise attention or a message hub to fuse language, visual, and audio modalities. However, these fusion methods are often quadratic in complexity with respect to the modal sequence length, bring redundant information and are not efficient. In this paper, we propose an efficient neural network to learn modality-fused representations with CB-Transformer (LMR-CBT) for multimodal emotion recognition from unaligned multi-modal sequences. Specifically, we first perform feature extraction for the three modalities respectively to obtain the local structure of the sequences. Then, we design an innovative asymmetric transformer with cross-modal blocks (CB-Transformer) that enables complementary learning of different modalities, mainly divided into local temporal learning, cross-modal feature fusion and global self-attention representations. In addition, we splice the fused features with the original features to classify the emotions of the sequences. Finally, we conduct word-aligned and unaligned experiments on three challenging datasets, IEMOCAP, CMU-MOSI, and CMU-MOSEI. The experimental results show the superiority and efficiency of our proposed method in both settings. Compared with the mainstream methods, our approach reaches the state-of-the-art with a minimum number of parameters.

在多模态情感识别中，学习模态融合表征和处理未对齐的多模态序列是有意义的，也是具有挑战性的。现有的方法是使用方向性的成对注意或信息枢纽来融合语言、视觉和音频模态。然而，这些融合方法相对于模态序列的长度来说往往是二次复杂的，带来了冗余的信息，而且效率不高。在本文中，我们提出了一个高效的神经网络，通过CB-Transformer（LMR-CBT）学习模态融合表征，用于从未对齐的多模态序列中识别多模态情感。具体来说，我们首先对三种模态分别进行特征提取，以获得序列的局部结构。然后，我们设计了一个创新的带有跨模态块的非对称变换器（CB-Transformer），实现了不同模态的互补学习，主要分为局部时间学习、跨模态特征融合和全局自我注意表示。此外，我们将融合后的特征与原始特征进行拼接，对序列的情绪进行分类。最后，我们在三个具有挑战性的数据集IEMOCAP、CMU-MOSI和CMU-MOSEI上进行了单词对齐和不对齐的实验。实验结果表明，我们提出的方法在这两种情况下都具有优越性和效率。与主流方法相比，我们的方法以最小的参数数量达到了最先进的水平。

阅读原文

【FCS】LMR-CBT用CB变换器学习模态融合表征，从不一致的多模态序列中进行多模态情感识别——CCF T1

猜你喜欢