Overview of Multimodal Technology

When we talk about the diversity of human perception, we realize that different sensory information is crucial to our cognition and understanding. For example, when we watch a movie, we not only understand the plot through vision, but also obtain richer information through sound, soundtrack, text and other ways. Similarly, for a picture or a piece of text, we can also understand and perceive them from multiple angles. In the field of machine learning, this multiple form of perception is known as multimodal learning.

Multimodal learning aims to combine many different forms of data for analysis and processing, such as images, sounds, text, etc. Multimodal deep learning is a deep learning-based multimodal learning method designed to process and analyze multimodal datasets through deep neural networks. Different from traditional deep learning methods, multimodal deep learning needs to solve many challenges, such as how to combine different forms of data, how to choose the appropriate network structure and loss function, etc.

In this article, we explore what multimodal deep learning is, how it works, challenges, and how deep learning models handle multimodal input. We hope that through the introduction of this article, readers can better understand the concept and application of multimodal deep learning, and inspire future research and development.

Multimodal datasets (with links) are provided at the end of the article

What is Multimodal Deep Learning

Multimodal machine learning is the study of computer algorithms that learn and improve performance by using multimodal datasets.

Multimodal deep learning is a subfield of machine learning that aims to train artificial intelligence models to process and find relationships between different types of data (patterns), typically images, video, audio, and text. By combining different modalities, a deep learning model can understand its environment more generally, since some cues are only present in certain modalities.

Imagine the task of emotion recognition. It's not just looking at faces (visual modality). The pitch and pitch of a person's voice (audio pattern) encodes a wealth of information about their emotional state that may not be seen through their facial expressions, even though they are often in sync.

Unimodal or unimodal models, i.e. models that only deal with a single modality, have been studied to a great extent and have achieved remarkable results in cutting-edge fields such as computer vision and natural language processing. However, unimodal deep learning has limited capabilities, thus requiring multimodal models.

The image below is an example of a unimodal model failing on certain tasks, such as identifying sarcasm or hate speech.
insert image description here
Combine images and text to create a sarcastic meme. Unimodal models cannot perceive this irony because each modality only contains half the information. In contrast, multimodal models that process text and images can connect the two and uncover deeper meaning.

Multimodal models typically rely on deep neural networks, although other machine learning models, such as hidden Markov models (HMMs), have been incorporated into earlier studies.

In multimodal deep learning, the most typical modalities are visual (image, video), textual and auditory (speech, sound, music). However, other less typical modalities include 3D vision data, depth sensor data, and LiDAR data (typical for autonomous vehicles). In clinical practice, imaging modalities include computed tomography (CT) and X-ray images, while non-imaging modalities include electroencephalogram (EEG) data. Sensor data such as thermal data or data from eye-tracking devices can also be included in the list.

Any combination of the above unimodal data will result in a multimodal dataset. For example, the following combinations:

  • Video + lidar + depth data creates an excellent dataset for self-driving car applications.
  • EEG + eye-tracking device data, creating a multimodal dataset linking eye movements to brain activity.

However, the most popular combination is the combination of the three most popular modes

  • Image + Text
  • Image + Audio
  • Image + Text + Audio
  • Text + Audio

From the perspective of artificial intelligence development, deep learning semiotics, imitating the neurons of the human body, hopes to achieve the same thinking ability as human beings, but the single-modal model can only target a single task, strictly speaking, it is inferior to human beings. There are too many, and the multimodal model is a bit like imitating humans (because the information that humans receive from the outside world comes from the five senses, and the sense of smell is currently difficult). This kind of model can be regarded as artificial intelligence

In Digimon 3, if the protagonist team wants to defeat the demon, it is not enough to rely on a single Digimon. The Digimon must be fused with the protagonist himself and the Digimon to defeat the Digimon
. Defeat Emperor Limo

insert image description hereinsert image description here

Multimodal Deep Learning Branch

insert image description here

We can divide multimodal deep learning into three branches:

  • Modal Federated Learning
  • cross-modal learning
  • Multimodal Self-Supervised Learning

These branches all aim to improve the performance of deep learning by integrating multiple data sources, so as to better solve complex tasks.

Modal Federated Learning

In modality federated learning, deep learning models combine information from multiple modalities (such as images, text, audio, etc.) to achieve better performance. Specifically, this approach can build a richer and comprehensive model by fusing representations from multiple modalities. Common modal joint learning models include Multimodal Compact Bilinear Pooling and Cross-Modal Retrieval.

cross-modal learning

Cross-modal learning refers to transferring the knowledge learned by a model in one modality to another modality to improve the performance of the model in the new modality. The basic idea of ​​this approach is to apply the knowledge of a model in one modality to another modality by sharing parts of the model. Typical cross-modal learning models include Deep Cross-Modal Projection Learning and Cross-Modal Transfer Learning.

Multimodal Self-Supervised Learning

Multimodal self-supervised learning refers to using the relationship between multiple modalities to train a model without explicit label information. The core idea of ​​this approach is to build self-supervised tasks across multiple modalities to obtain a common representation. Typical multimodal self-supervised learning models include Joint Audio-Visual Self-Supervised Learning and SimCLR-MultiTask. All three methods can improve the performance of the model and have a wide range of applications in different tasks

From the technical branch point of view, it can be summarized as follows: the aspects of multimodal concern, one is how to work together between the modalities, one is how to convert between the modalities, and the other is the modal data set

Multimodal Learning Challenge

Multimodal deep learning aims to address five core challenges, which are active areas of research. Solutions or improvements to any of the following challenges will advance multimodal AI research and practice.

multimodal representation

Multimodal representation is the task of encoding data from multiple modalities in the form of vectors or tensors.

Capturing the semantic information of raw data and representing it well is very important for the success of machine learning models. However, it is very difficult to extract features from heterogeneous data to exploit the synergy between them. Furthermore, it is crucial to take full advantage of the complementarity of different modalities and not focus on redundant information.

Multimodal representations fall into two categories.

Joint Representation
The joint representation concatenates eigenvectors of different modalities to form a joint eigenvector.

if mmm modalities,iiThe eigenvectors of the i modes arex ( i ) ∈ R di \mathbf{x}^{(i)}\in\mathbb{R}^{d_i}x(i)Rdi

Then the joint expression is: x ( j 1 , j 2 , … , jm ) = [ x ( 1 ) , x ( 2 ) , … , x ( m ) ] ∈ R d 1 + d 2 + ⋯ + dm ​ \mathbf {x}^{(j_1,j_2,…,j_m)}=[x^{(1)},x^{(2)},…,x^{(m)}] \in R^{d_1+ d_2+⋯+d_m}​x(j1,j2,,jm)=[x(1),x(2),,x(m)]Rd1+d2++dm

Where j 1 , j 2 , … , jm j_1,j_2,\ldots,j_mj1,j2,,jmIndicates the modal to join.

Synergistic representation
Synergistic representation is a weighted sum of the feature vectors of different modalities, and the weights can vary according to the modality. set iiThe eigenvectors of the i modes arex ( i ) ∈ R di \mathbf{x}^{(i)}\in\mathbb{R}^{d_i}x(i)Rdi, iiThe weight of i modality iswi w_iwi, then the co-expression is:

X ( w ) = ∑ i = 1 m w i X ( i ) X^{(w)}=\sum_{i=1}^m w_iX^{(i)} X(w)=i=1mwiX(i)

其中 w = [ w 1 , w 2 , … , w m ] w = [w_1, w_2, \ldots, w_m] w=[w1,w2,,wm] ∑ i = 1 m w i = 1 \sum_{i=1}^{m}w_i=1 i=1mwi=1 means that the sum of the weights is 1.

Multimodal Fusion

Multimodal fusion is the fusion of information from multiple modalities (such as image, text, audio, etc.) from different sensors or different data sources to improve the accuracy and efficiency of tasks.

The following are some common multimodal fusion techniques:

  1. Fusion based on feature extraction: Fusion of features extracted by feature extractors of different modalities (such as convolutional neural network, recurrent neural network, etc.), such as weighted summation, splicing, etc.
  2. Mapping-based fusion: By learning a mapping function, the data of different modalities are mapped into the same feature space for fusion.
  3. Fusion based on graphical models: Use graphical models (such as conditional random fields, graph convolutional networks, etc.) to model and fuse multimodal data.
  4. Fusion based on attention mechanism: weighted fusion of information of different modalities by learning attention weights to improve the impact of important information. Generative Adversarial Network (GAN)-based Fusion: By training a Generative Adversarial Network, information from different modalities is fused into a generator that can be used for task-specific output.
  5. Fusion based on sensor fusion: The output information of multiple sensors is fused to improve accuracy and robustness.

The above techniques can be used alone or in combination to suit different tasks and data types.

modal alignment

Modal alignment refers to the task of identifying direct relationships between different modalities.

Current research on multimodal learning aims to create modality-invariant representations. This means that when different modalities refer to similar semantic concepts, their representations must be similar/close in the latent space.

For example, the phrase "she jumped into the pool", the image of the pool, and the audio signal for the splash should be placed close together in the manifold representing the space.

Modal translation

Translation is the act of mapping one modality to another. Its main idea is how to translate one modality (e.g. text modality) to another (e.g. visual modality) while preserving semantics. However, translation is open-ended and subjective, and no perfect answer exists, which increases the complexity of the task.

Part of the current research on multimodal learning is building generative models that translate between different modalities. The recent DALL-E and other text-to-image models are good examples of such generative models that transform a textual modality into a visual modality.

How Multimodal Learning Works

A multimodal neural network is usually a combination of multiple unimodal neural networks.

For example, an audiovisual model might consist of two unimodal networks, one for visual data and the other for audio data.

These unimodal neural networks typically process their inputs separately. This process is called encoding. After unimodal encoding, the information extracted from each model must be fused together. Various fusion techniques have been proposed, ranging from simple concatenation to attention mechanisms. The multimodal data fusion process is one of the most important success factors. After fusion occurs, the final "decision" network receives the fused encoded information and is trained on the final task.

Simply put, a multimodal architecture typically consists of three parts:

  • A single-mode encoder that encodes a single mode. Typically, one per input modality.
  • A fusion network that combines the features extracted from each input modality during the encoding stage.
  • A classifier that takes fused data and makes predictions.

We refer to the above modules as the encoding module (DL Module), the fusion module and the classification module, as shown in the figure:
insert image description here
What kind of model does the specific multimodal task need to use for the downstream Fusion Module 3 module, if it is for generating classes model, then you need to use a module such as a decoder, such as a transformer (because it can be used as a decoder or an encoder)

The multi-modal model, after converting some information from our human perspective into a vector from the computer perspective, it is as if we have mastered this information, and the model has also mastered this information. The next task is like what we have learned based on knowledge, can act the same

Now let's take a deep dive into each component, here we take a classification model as an example

Encoding

During encoding, we try to create meaningful representations.

Typically, each individual modality is handled by a different single-mode encoder. Often, however, the input is in the form of an embedding, not the original. For example, word2vec embeddings can be used for text, and COVAREP embeddings can be used for audio. Multimodal embeddings such as data2veq, which converts video, text, and audio data into embeddings in a high-dimensional space, is one of the latest practices and outperforms other embeddings that achieve SOTA performance in many tasks.

But the transformer structure can fully adapt to multi-modal data such as text, image, video, and audio.

Deciding whether it is more appropriate to use a joint representation or a coordinated representation (explained in the Representation Challenge) is an important decision. In general, the joint representation approach works well when the modal properties are similar, which is the most commonly used approach.
In practice, when designing multimodal networks, encoders are chosen more based on what works well in each domain.

Since many multimodal research papers, the research content has emphasized more on the design of the fusion method. Therefore, in the visual modality, the more selected encoder is ResNets, using RoBERTA as the encoder of the text.

fusion

Multimodal fusion refers to the integration of information from different modalities (such as image, text, and audio) to improve the performance of the model

The simplest approach is to use simple operations such as concatenating different unimodal vector representations, or summing different unimodal vector representations.

In multimodal fusion, the cross-attention mechanism is a commonly used technique for exchanging information between different modalities to obtain richer representations. The cross-attention mechanism can establish the interrelationship between multiple modalities by cross-computing attention scores among them. Take two modalities (such as image and text) as an example, assuming their feature representations are respectively xxxyyy . The cross-attention mechanism can calculate the relationship between them by the following formula:

e i , j = f ( x i , y j ) e_{i,j}=f(x_i,y_j) ei,j=f(xi,yj)
α i , j = e x p ( e i , j ) ∑ k = 1 n e x p ( e i , k ) \alpha_{i,j}=\frac{exp(e_{i,j})}{\sum_{k=1}^{n}exp(e_{i,k})} ai,j=k=1nexp(ei,k)exp(ei,j)

Among them, ei , j e_{i,j}ei,jis the feature xi x_ixiand features yj y_jyjThe similarity score between fff is a function that computes similarity, such as a dot product or a bilinear function. α i , j \alpha_{i,j}ai,jIs xi x_ixiyj y_jyjThe attention score of , by ei , j e_{i,j}ei,jBring it into the softmax function to calculate.

The calculated attention score α i , j \alpha_{i,j}ai,jIt can be used to weight and fuse features from different modalities to obtain a richer and comprehensive representation. Specifically, for the modal xxx and modalyyThe feature representation of y , their cross-attention fusion representation can be calculated by the following formula:

x ~ i = ∑ j = 1 m α i , j y j \tilde x_i =\sum_{j=1}^{m}\alpha_{i,j}y_j x~i=j=1mai,jyj
y ~ i = ∑ i = 1 m α i , j x i \tilde y_i =\sum_{i=1}^{m}\alpha_{i,j}x_i y~i=i=1mai,jxi

Among them, x ~ i \tilde{x}_ix~iy ~ j \tilde{y}_jy~jRespectively modal xxx and modalyyThe representation of y after cross-attention fusion. This method can exchange information between different modalities, thereby improving the performance and generalization ability of the model.

Classification

The cross-attention fused vectors are fed into the classification model, and the softmax function can be used to calculate the probability score of each category.

Suppose the vector output by the model is zzz , then the following formula can be used to calculate each categorykkProbability scorepk p_k for k
pk p k = exp ⁡ ( w k T z + b k ) ∑ j = 1 K exp ⁡ ( w j T z + b j ) p_k = \frac{\exp(w_k^T z + b_k)}{\sum_{j=1}^K \exp(w_j^T z + b_j)} pk=j=1Kexp(wjTz+bj)exp(wkTz+bk)

Among them, wk w_kwkand bk b_kbkis the parameter of the classification model, KKK is the number of categories. What this formula means is, first calculate the categorykkk , and then normalize the scores across all categories to get a probability score for each category. In practical applications, the cross-entropy loss function can be used to train classification models to minimize the gap between model predictions and actual labels. Specifically, the cross-entropy loss can be calculated using the following formula: L = − 1 N ∑ i = 1 N ∑ k = 1 K yi , k log ⁡ ( pi , k ) \mathcal{L} = -\frac{1 }{N}\sum_{i=1}^N \sum_{k=1}^K y_{i,k} \log(p_{i,k})L=N1i=1Nk=1Kyi,klog(pi,k) among them,NNN is the number of samples,yi , k y_{i,k}yi,kis sample iiLabel vector for i , pi , k p_{i,k}pi,kis the model for sample iicategory kkin iThe predicted probability score for k . By minimizing the cross-entropy loss, a high-performance classification model can be trained for classifying multimodal data.

Multimodal Deep Learning Datasets

To advance the field, researchers and organizations have created and distributed multiple multimodal datasets. Here is a comprehensive list of the most popular datasets:

  • COCO-Captions Dataset : A multimodal dataset containing 330K images with short text descriptions. This dataset was released by Microsoft to advance the research of image captioning.
  • VQA : A visual question answering multimodal dataset containing 265K images (visual) with at least three questions (text) per image. These questions require an understanding of vision, language, and common sense to answer. Suitable for visual question answering and image captioning.
  • CMU-MOSEI : Multimodal Opinion Sentiment and Emotion Intensity (MOSEI) is a multimodal dataset for human emotion recognition and sentiment analysis. It contains 23,500 sentences, read by 1,000 YouTube speakers. This dataset combines video, audio and text modalities. A perfect dataset for training models on the three most popular data modalities.
  • Social-IQ : A perfect multimodal dataset for training deep learning models on visual reasoning, multimodal question answering, and social interaction understanding. Contains 1250 audio-videos, strictly annotated (at the action level) with questions and answers (text) related to the actions that occur in each scene.
  • kinetics 400/600/700 : This audiovisual dataset is a collection of Youtube videos for human action recognition. It contains videos (visual modality) and sounds (audio modality) of people performing various actions such as playing music, hugging, exercising, etc. This dataset is suitable for action recognition, human pose estimation, or scene understanding.
  • RGB-D Object Dataset : A multimodal dataset combining vision and sensor modalities. One sensor is RGB, encoding the colors in the picture, while the other is a depth sensor, encoding the distance of the object from the camera. The dataset contains 300 videos of household objects and 22 scenes, equivalent to 250K images. It has been used for 3D object detection or depth estimation tasks.

Other multimodal datasets include IEMOCAP , CMU-MOSI , MPI-SINTEL , SCENE-FLOW , HOW2 , COIN , and MOUD .

Guess you like

Origin blog.csdn.net/weixin_42010722/article/details/129675633