Recommend a more comprehensive multimodal review that has just been published recently: Multimodal Deep Learning

Introduction

Title: Multimodal Deep Learning
URL: https://arxiv.org/abs/2301.04856
Included in: arxiv 2023

  Rather than saying that this is a paper, it is better to say that this is a "book". The full text is 239 pages in total, not including the cover, table of contents, references, etc.

  This book is the result of a workshop in which we reviewed multimodal approaches and attempted to create a solid overview of the field, starting with SOTA approaches in two subfields of deep learning. In addition, modeling frameworks for transforming one modality into another are discussed, as well as models that leverage one modality to augment representation learning of another. To conclude the second part, architectures for handling both modalities are presented. Finally, we also discuss other modalities as well as generic multimodal models capable of handling different tasks in different modalities within a unified architecture. The booklet ends with an interesting application (Generative Art).

  This paper introduces and summarizes in detail the data sets, models, evaluation indicators, etc. of some tasks in the fields of multimodality, CV and NLP. It mainly focuses on multimodal content, but there are also many elaborations on CV and NLP. In general, it is a very good overview, with comprehensive and detailed content.


Article structure


1 Introduction

  1.1 Introduction to Multimodal Deep Learning

  1.2 Outline of the Booklet

2 Introducing the modalities

  2.1 State-of-the-art in NLP

  2.2 State-of-the-art in Computer Vision

  2.3 Resources and Benchmarks for NLP, CV and multimodal tasks

3 Multimodal architectures

  3.1 Image2Text

  3.2 Text2Image

  3.3 Images supporting Language Models

  3.4 Text supporting Vision Models

  3.5 Models for both modalities

4 Further Topics

  4.1 Including Further Modalities

  4.2 Structured + Unstructured Data

  4.3 Multipurpose Models

  4.4 Generative Art

5 Conclusion

6 Epilogue

  6.1 New influential architectures

  6.2 Creating videos

7 Acknowledgements

Guess you like

Origin blog.csdn.net/Friedrichor/article/details/128681756