[IEEE ICME2022] NHFNET: Non-homogeneous fusion network for multi-modal sentiment analysis - CCF B

NHFNET: A Non-Homogeneous Fusion Network for Multimodal Sentiment Analysis

Vedio
Presented to Bilibili:https://www.bilibili.com/video/BV1XY411n7rx?vd_
source=9dad485ab167164358578deecb64a255#reply126270526800

Abstract
Fusion technology is crucial for multimodal sentiment anal- ysis. Recent attention-based fusion methods demonstrate high performance and strong robustness. However, these ap- proaches ignore the difference in information density among the three modalities, i.e., visual and audio have low-level sig- nal features and conversely text has high-level semantic fea- tures. To this end, we propose a non-homogeneous fusion net- work (NHENet) to achieve multimodal information interac- tion. Specifically, a fusion module with attention aggregation is designed to handle the fusion of visual and audio modalities to enhance them to high-level semantic features. Then, cross- modal attention is used to achieve information reinforcement of text modality and audio-visual fusion. NHFNet compen- sates for the differences in information density of different modalities enabling their fair interaction. To verify the ef- fectiveness of the proposed method, we set up the aligned and unaligned experiments on the CMU-MOSEI dataset, re- spectively. The experimental results show that the proposed method outperforms the state-of-the-art. Codes are available at https://github.com/skeletonNN/NHFNet.

Fusion techniques are crucial for multimodal sentiment analysis. Recent attention-based fusion methods exhibit high performance and robustness. However, these methods ignore the difference in information density among the three modes, i.e., visual and audio have low-level symbolic features, whereas text has high-level semantic features. To this end, we propose a non-homogeneous fusion network (NHENet) to achieve the interaction of multi-modal information. Specifically, we design a fusion module with attention gathering to handle the fusion of visual and auditory modalities to enhance their high-level semantic features. Then, cross-modal attention is used to achieve information enhancement of text modality and audio-visual fusion. NHFNet compensates for the difference in information density of different modalities, allowing them to interact fairly. In order to verify the effectiveness of the proposed method, we conducted aligned and unaligned experiments on the CMU-MOSEI data set. Experimental results show that the proposed method outperforms state-of-the-art methods. Code is available at https://github.com/skeletonNN/NHFNet.

Read the original article

Guess you like

Origin blog.csdn.net/lsttoy/article/details/130502672