读《NFCMF: Noise Filtering and CrossModal Fusion for Multimodal Sentiment Analysis》

2021
insert image description here
(看来是两两交互式的,v和l就像coatten一样交互出对应的两个输出,再对应的和原本的拼接,再统一拼起来。但是这样一方面v和a没有交互啊?另一方面l会有俩?)

insert image description here
First, we initialize a cross-modal weight matrix GM ∈ R dhl × dhm G^M∈R^{d^l_h×d^m_h}GMRdhl×dhmTo calculate the combined vector at M a^M_tatM, where dlh dlhd l h is the dimension of the hidden state of the language modality,dhmd^m_hdhmis the dimensionality of the hidden state of the visual or acoustic modality. The calculation result is shown in equation (5).
insert image description here
insert image description here
Taking the cross-modal interaction between language and visual modalities as an example, at L → V a^{L→V}_tatLVThe higher the value, the higher the correlation between the emotional information captured by language modality features and visual modality features. Similarly, at V → L a^{V→L}_tatVLThe higher the value, the higher the correlation between the emotional information captured by the visual modality features and the language modality features. Then the final hidden state output ˆhMt is obtained by Equation (7).
insert image description here
(感觉对多头注意力公式的理解加深了?softmax就相当于把映射的向量再归一化成加权系数,而一开始的a也是内积出来的,只不过中间加了俩学习参数)
insert image description here

Guess you like

Origin blog.csdn.net/weixin_40459958/article/details/124041544