2021
(看来是两两交互式的,v和l就像coatten一样交互出对应的两个输出,再对应的和原本的拼接,再统一拼起来。但是这样一方面v和a没有交互啊?另一方面l会有俩?)
First, we initialize a cross-modal weight matrix GM ∈ R dhl × dhm G^M∈R^{d^l_h×d^m_h}GM∈Rdhl×dhmTo calculate the combined vector at M a^M_tatM, where dlh dlhd l h is the dimension of the hidden state of the language modality,dhmd^m_hdhmis the dimensionality of the hidden state of the visual or acoustic modality. The calculation result is shown in equation (5).
Taking the cross-modal interaction between language and visual modalities as an example, at L → V a^{L→V}_tatL→VThe higher the value, the higher the correlation between the emotional information captured by language modality features and visual modality features. Similarly, at V → L a^{V→L}_tatV→LThe higher the value, the higher the correlation between the emotional information captured by the visual modality features and the language modality features. Then the final hidden state output ˆhMt is obtained by Equation (7).
(感觉对多头注意力公式的理解加深了?softmax就相当于把映射的向量再归一化成加权系数,而一开始的a也是内积出来的,只不过中间加了俩学习参数)