【ICASSP2023】混杂或错位？用预先训练的表征进行语音情感识别的时间转移——CCF B 语音权威会议

Mingling or Misalignment? Temporal Shift for Speech Emotion Recognition with Pre-trained Representations

Abstract
Fueled by recent advances of self-supervised models, pre-trained speech representations proved effective for the downstream speech emotion recognition (SER) task. Most prior works mainly focus on exploiting pre-trained representations and just adopt a linear head on top of the pre-trained model, neglecting the design of the downstream network. In this paper, we propose a temporal shift module to mingle channel-wise information without introducing any parameter or FLOP. With the temporal shift module, three designed baseline building blocks evolve into corresponding shift variants, i.e. ShiftCNN, ShiftLSTM, and Shiftformer. Moreover, to balance the trade-off between mingling and misalignment, we propose two technical strategies, placement of shift and proportion of shift. The family of temporal shift models all outperforms the state-of-the-art methods on the benchmark IEMOCAP dataset under both finetuning and feature extraction settings. Our code is available at https://github.com/ECNU-Cross-Innovation-Lab/ShiftSER.

在自我监督模型的最新进展的推动下，预训练的语音表征被证明对下游的语音情感识别（SER）任务有效。大多数先前的工作主要集中在利用预训练的表征，只是在预训练的模型之上采用一个线性头，而忽略了下游网络的设计。在本文中，我们提出了一个时间转换模块，在不引入任何参数或FLOP的情况下混合通道信息。通过时间转换模块，三个设计好的基线构件演变成相应的转换变体，即ShiftCNN、ShiftLSTM和Shiftformer。此外，为了平衡混合和错位之间的权衡，我们提出了两种技术策略，即转移的位置和转移的比例。在基准的IEMOCAP数据集上，在微调和特征提取设置下，该系列的时间转移模型都优于最先进的方法。我们的代码可在https://github.com/ECNU-Cross-Innovation-Lab/ShiftSER。

阅读原文

【ICASSP2023】混杂或错位？用预先训练的表征进行语音情感识别的时间转移——CCF B 语音权威会议

猜你喜欢