Machine Learning Notes - What is Multimodal Deep Learning?

I. Overview

        Humans use the five senses to experience and interpret the world around them. Our five senses capture information from five different sources and in five different ways. A modality refers to the way something happens, is experienced, or is captured.

        Artificial intelligence is seeking to imitate the human brain, after all, it cannot escape the limitation of this body.

        The human brain consists of neural networks that can process multiple modalities simultaneously. Imagine having a conversation - your brain's neural networks process multimodal input (audio, vision, text, smell). After a fusion of deep subconscious modes, you can reason about what your interlocutor said, their emotional state, and your/their surroundings. This allows for a more holistic view of the situation and a deeper understanding of the situation.

        For AI to match human intelligence, it must learn to interpret, reason, and fuse multimodal information. One of the latest and most promising trends in deep learning research is multimodal deep learning. In this article, we demystify multimodal deep learning. We discuss multimodal fusion, multimodal datasets, multimodal applications, and explain how to build machine learning models that more fully perceive the world.

two,

Guess you like

Origin blog.csdn.net/bashendixie5/article/details/132645917