MIA-Net: Multimodal Interactive Attention Network for Multimodal Sentiment Analysis

MIA-Net: Multimodal Interactive Attention Network for Multimodal Sentiment Analysis

Summary: During multi-modal fusion, multiple modalities are first divided into main modalities and auxiliary modalities. By constructing an interactive attention module, information helpful to the main modality is extracted from the auxiliary modalities for fusion.

(It belongs to the feature-level attention-based fusion method)

Article information

Author: Shuzhen Li, Tong Zhang

Unit: South China University of Technology

Journal: IEEE Transactions on Affective Computing

题目:MIA-Net: Multi-Modal Interactive Attention Network for Multi-Modal Affective Analysis

Year: 2023

Research purposes

Explore how to generalize the bimodal model to trimodal or multimodal tasks under limited calculations and parameter conditions? And how to solve the different contributions of different modalities to sentiment analysis?

research content

A fusion model MIA-Net (Multi-modal Interactive Attention Network) that supports three modalities or multi-modalities is proposed to fuse the main modalities and auxiliary modalities and be used for sentiment analysis.

Research methods

image-20231206194938620

This framework is composed of a single-modal feature extractor (Feature Extractor), N-1 MIA modules, regressors and classifiers.

Feature Extractor

For text data: First, a GPT-2 (a tokenizer) is used to mark each text description as a series of symbols. Then use the RoEBRTa model to extract text features.

For audio data: first use the VQ-Wav2Vec model to convert the variable-length audio representation of the audio into a discrete audio representation, and then still use the RoBERTa model to extract acoustic features from the discrete acoustic representation.

For video data: First, the RetinaFace model is used to detect and segment the facial area in each video frame. Video frames are then extracted from each video based on the number of keyframes. Finally, the Fabnet model is used to extract visual features from the extracted video frames.

Text feature dimension: 1024, acoustic feature dimension: 768, visual feature dimension: 256

MIA module

Alt

Symbol representation:

M j n M_{j}^{n} Mjnmeans M n M^nMn 'sjjj main mode eigenvectors,A in \mathbf{A}_i^nAinmeans A n A^nAnth ii_i auxiliary modal feature vectors.

K ( A i n , M j n ) \mathcal{K}(\boldsymbol{A}_i^n,\boldsymbol{M}_j^n) K(Ain,Mjn) represents the general General Linear Kernel, used to calculateSM n A n S_{\boldsymbol{M}^n\boldsymbol{A}^n}SMA _n(represents affinity matrix).

P ( S M n A n ) \boldsymbol{P}(\boldsymbol{S}_{\boldsymbol{M}^n\boldsymbol{A}^n}) P(SMA _n) represents the interactive attention weight, which is used to improve the main modal representation.

P j i ( S M n A n ) \boldsymbol{P}_{ji}(\boldsymbol{S}_{\boldsymbol{M}^n\boldsymbol{A}^n}) Pji(SMA _n) represents the importance of the i-th auxiliary mode eigenvector to the j-th main mode eigenvector.

M ^ m ← a n n \hat{\boldsymbol{M}}_{m\gets a^n}^n M^mannRepresents updated primary mode features enhanced by auxiliary modes.


Each MIA module is composed of three sub-modules: extraction, conversion, and fusion. The function of the MIA module is to select important features from the auxiliary mode to improve the main mode features.

Extraction

First, a general linear kernel is used to capture the similarity between modal features , and then an affinity matrix SM n A n S_{\boldsymbol{M}^n\boldsymbol{A}^n} is obtainedSMA _n

Alt
k ( x , y ) = x T W y \begin{aligned}k(x,y)=x^{T}Wy\end{aligned} k(x,y)=xT Wy
To satisfy this formula, W must be a positive semidefinite matrix, which is actually a real symmetric matrix.

Every real symmetric matrix can be decomposed into an orthogonal matrix. (According to the theorem Q = PT Λ PQ=\boldsymbol{P^\mathrm{T}}\boldsymbol{\Lambda P}Q=PT ΛP) So the final calculation formula of the general linear kernel is:
K ( A in , M jn ) = ( A in ) TW n M jn = ( A in ) T ( P n ) T Λ P n M jn = ( P n A in ) T Λ ( P n M jn ) . \begin{aligned} \mathcal{K}(\boldsymbol{A}_i^n,\boldsymbol{M}_j^n)& =\left(\boldsymbol{ A}_i^n\right)^\mathrm{T}\boldsymbol{W}^n\boldsymbol{M}_j^n \\ &=\left(\boldsymbol{A}_i^n\right)^{\ mathrm{T}}(\boldsymbol{P}^n)^{\mathrm{T}}\boldsymbol{\Lambda}\boldsymbol{P}^n\boldsymbol{M}_j^n \\ &=\left( \boldsymbol{P}^n\boldsymbol{A}_i^n\right)^\mathrm{T}\boldsymbol{\Lambda}(\boldsymbol{P}^n\boldsymbol{M}_j^n). \end {aligned}K(Ain,Mjn)=(Ain)TWnMjn=(Ain)T(Pn)TΛPnMjn=(PA _in)TL ( PnMjn).
After getting the affinity matrix (the similarity between the two modalities), the affinity matrix is ​​encoded into the interactive attention weight P ( SM n A n ) \boldsymbol{P}(\boldsymbol{S }_{\boldsymbol{M}^n\boldsymbol{A}^n})P(SMA _n) , throughP ( SM n A n ) \boldsymbol{P}(\boldsymbol{S}_{\boldsymbol{M}^n\boldsymbol{A}^n})P(SMA _n) to improve the characteristics of the main mode.
P ( SM n A n ) = [ P 1 ( SM n A n ) , ⋯ , P j ( SM n A n ) , ⋯ , P dm ( SM n A n ) ] P(S_{\boldsymbol{M}^ n\boldsymbol{A}^n})=\left[P_1(S_{\boldsymbol{M}^n\boldsymbol{A}^n}),\cdots,P_j(\boldsymbol{S}_{\boldsymbol{ M}^n\boldsymbol{A}^n}),\cdots,P_{dm}(\boldsymbol{S}_{\boldsymbol{M}^n\boldsymbol{A}^n})\right]P(SMA _n)=[P1(SMA _n),,Pj(SMA _n),,Pdm(SMA _n)]

P j ( S M n A n ) = [ P j 1 ( S M n A n ) ⋮ P j i ( S M n A n ) ⋮ P j d u n ( S M n A n A n ) ] = [ e x p ( K ( A 1 n , M j n ) ) ∑ k d a n e x p ( K ( A k n , M j n ) ) ⋮ e x p ( K ( A i n , M j n ) ) ∑ k d a n e x p ( K ( A k n , M j n ) ) ⋮ e x p ( K ( A d a n n , M j n ) ) ∑ k d a n e x p ( K ( A i n , M j n ) ) ] P_j(S_{\boldsymbol{M}^n\boldsymbol{A}^n})=\begin{bmatrix}P_{j1}(S_{\boldsymbol{M}^n\boldsymbol{A}^n})\\\vdots\\P_{ji}(S_{\boldsymbol{M}^n\boldsymbol{A}^n})\\\vdots\\P_{jd_{u^n}}(S_{\boldsymbol{M}^n\boldsymbol{A}^n\boldsymbol{A}^n})\end{bmatrix}=\begin{bmatrix}\frac{exp(\mathcal{K}(\boldsymbol{A}_1^n,\boldsymbol{M}_j^n))}{\sum_k^{d_a\boldsymbol{n}}exp(\mathcal{K}(\boldsymbol{A}_k^n,\boldsymbol{M}_j^n))}\\\vdots\\\frac{exp(\mathcal{K}(\boldsymbol{A}_i^n,\boldsymbol{M}_j^n))}{\sum_k^{d_an}exp(\mathcal{K}(\boldsymbol{A}_k^n,\boldsymbol{M}_j^n))}\\\vdots\\\frac{exp(\mathcal{K}(\boldsymbol{A}_{d_a^n}^n,\boldsymbol{M}_j^n))}{\sum_k^{d_an}exp(\mathcal{K}(\boldsymbol{A}_i^n,\boldsymbol{M}_j^n))}\end{bmatrix} Pj(SMA _n)= Pj 1(SMA _n)Pji(SMA _n)Pjdun(SMA _A _n) = kdanexp(K(Akn,Mjn))exp(K(A1n,Mjn))kdanexp(K(Akn,Mjn))exp(K(Ain,Mjn))kdanexp(K(Ain,Mjn))exp(K(Adann,Mjn))

Transform

The interactive attention weight P ( SM n A n ) between the main mode and the auxiliary mode is obtained through Extraction. \boldsymbol{P}(\boldsymbol{S}_{\boldsymbol{M}^n\boldsymbol{A} ^n})P(SMA _n) , and then in the Transform module, the interactive attention weight and the auxiliary modal feature are matrix multiplied to obtain the new main modal featureM ^ m ← ann \hat{\boldsymbol{M}}_{m\gets a ^n}^nM^mann

Then this new main mode feature is improved through the self-gating mechanism GA (extracting more discriminative features and suppressing possible noise features) to obtain the improved main mode feature M gatedn M_{gated}^nMgatedn
M g a t e d n = M ^ m ← a n n ⋅ G G = σ ( K 1 ⊙ M ^ m ← a n n + b 1 ) M_{gated}^n=\hat{M}_{m\leftarrow a^n}^n\cdot\boldsymbol{G}\\ \boldsymbol{G}=\sigma(\boldsymbol{K}_1\odot\hat{\boldsymbol{M}}_{m\gets a^n}^n+\boldsymbol{b}_1) Mgatedn=M^mannGG=s ( K1M^mann+b1)
⊙ \odot represents convolution,σ \sigmaσ represents the logistic sigmoid activation function, and G represents the gating coefficient. K 1 K_1K1and b 1 b_1b1Represent convolution kernel and bias respectively.

Fusion

Finally, the improved main mode feature M gatedn M_{gated}^nMgatedn, with the initial main mode characteristics M n M^nMn to fuse. Get the final fused modal features.
M n + 1 = f N ( σ ( K 2 ⊙ [ M gatedn : M n ] + b 2 ) ) \boldsymbol{M}^{n+1}=f_N(\sigma(\boldsymbol{K}_2\odot [\boldsymbol{M}_{gated}^n:\boldsymbol{M}^n]+\boldsymbol{b}_2))Mn+1=fN( p ( K2[Mgatedn:Mn]+b2))
f N ( ⋅ ) f_{N}(\cdot) fN( ) represents the batch normalization function,[ : ] \left[:\right][ : ] indicates connection operation


By stacking multiple MIA modules, the main mode features enhanced by N auxiliary modes are finally obtained. Then through the two residual connections , the final main mode feature M ˇ \check{\boldsymbol{M}} after MIA-Net is obtainedMˇ.

residual connection

The first residual connection:
M res = M 1 + MMIA s = M 1 + f MIA s ( M 1 , A 1 , ⋯ , AN ) \begin{aligned} \boldsymbol{M}_{res}& =\ boldsymbol{M}^{1}+\boldsymbol{M}_{MIAs} \\ &=\boldsymbol{M}^1+\boldsymbol{f}_{MIAs}(\boldsymbol{M}^1,\boldsymbol {A}^1,\cdots,\boldsymbol{A}^N) \end{aligned}Mres=M1+MM I A s=M1+fM I A s(M1,A1,,AN)
M M I A S M_{MIAS} MM.I.A.S _ _ _Represents M 1 M^1 improved by multiple MIA modulesM1 main mode characteristics.

M r e s M_{res} MresRepresents the residual main mode characteristics.

Second residual connection:
M ˇ = M res + f FFN ( M res ) \check{M}=M_{res}+f_{FFN}(\boldsymbol{M}_{res})Mˇ=Mres+fFFN(Mres)
Alt

Main mode feature M after residual connection ˇ \check{M}MˇSend it to Regressor or Classifier for final emotion recognition or emotion classification.

Regression

The regression model consists of a fully connected layer and a joint loss.

Determine:
L = ( 1 − γ ) LMAE + γ LKRLLMAE = − 1 N ∑ n = 1 N ∣ y ( n ) − z ( n ) ∣ z = θ TM ˇ + b 3 \mathcal{L}=( 1-\gamma)\mathcal{L}_{MAE}+\gamma\mathcal{L}_{KRL}\\\begin{gathered}\mathcal{L}_{MAE}=-\frac1N\sum_{n =1}^N|y^{(n)}-z^{(n)}|\\z=\ballsymbol{\theta}^\mathrm{T}\check{M}+\ballsymbol{b}_3 \end{gathered}L=(1c ) LMAE+γLK R LLMAE=N1n=1Ny(n)z(n)z=iTMˇ+b3
LMAE \mathcal{L}_{MAE}LMAErepresents the mean absolute error loss. LKRL \mathcal{L}_{KRL}LK R LRepresents the kernel regularization loss ( LKRL \mathcal{L}_{KRL}LK R LThe purpose is to find the affinity matrix SM n A n S_{\boldsymbol{M}^n\boldsymbol{A}^n}SMA _nWhen, WW is guaranteedW is a positive semidefinite matrix)

Classifition

The classification model consists of a fully connected layer, a Softmax function and a joint loss.

Determine:
L = ( 1 − γ ) LCE + γ LKRLLCE = − 1 N ∑ n = 1 N ynlog ( P ( y ^ n ) ) P ( y ^ ) = softmax ( θ ™ ˇ + b 3 ) . \begin{gathered}\mathcal{L}=(1-\gamma)\mathcal{L}_{CE}+\gamma\mathcal{L}_{KRL}\\begin{aligned}\mathcal{L} _{CE}=-\frac{1}{N}\sum_{n=1}^New_nlog(P(\hat{y}_n))\end{aligned} \\ \begin{aligned}P(\hat {y})=softmax(\ballsymbol{\theta^\mathrm{T}}\check{\ballsymbol{M}}+\ballsymbol{b_3}).\end{aligned}\end{gathered}L=(1c ) LCE+γLK R LLCE=N1n=1Nynlog(P(y^n))P(y^)=softmax(θTMˇ+b3).
LCE \mathcal{L}_{CE}LCEstands for standard cross-entropy loss.

conclusion and discussion

  1. A comparative experiment was conducted between the MIA-Net model and some SOTA models on the CMU-MOSI and CMU-MOSEI data sets, and MIA-NET achieved the best results.
  2. MIA-Net is effective when processing data from three or more modalities. (Ablation Research)
  3. MIA-Net can be generalized to new data sets, tasks and modalities. (general research)
  4. The multi-modal interactive attention module can effectively achieve auxiliary fusion. (Verification of the proposed modal)
  5. Study of hyperparameters.
  6. Joint losses are valid. (Study of loss function)
  7. Comparison with different fusion methods. (Prove the practicality of this study)

Code and datasets

The code is not public

Data set: CMU-MOSI(1.5G), CMU-MOSEI(25G), MELD(10G)

GPU configuration: not mentioned

Guess you like

Origin blog.csdn.net/weixin_48958956/article/details/134909742