Introduction
Title: Multimodal Deep Learning
URL: https://arxiv.org/abs/2301.04856
Included in: arxiv 2023
Rather than saying that this is a paper, it is better to say that this is a "book". The full text is 239 pages in total, not including the cover, table of contents, references, etc.
This book is the result of a workshop in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with SOTA approaches in two subfields of deep learning. In addition, modeling frameworks for transforming one modality into another are discussed, as well as models that leverage one modality to augment representation learning of another. To conclude the second part, architectures for handling both modalities are presented. Finally, we also discuss other modalities as well as generic multimodal models capable of handling different tasks in different modalities within a unified architecture. The booklet ends with an interesting application (Generative Art).
This paper introduces and summarizes in detail the data sets, models, evaluation indicators, etc. of some tasks in the fields of multimodality, CV and NLP. It mainly focuses on multimodal content, but there are also many elaborations on CV and NLP. In general, it is a very good overview, with comprehensive and detailed content.