Fundamental Models Define a New Era of Vision: Overview and Prospect

In this review, we provide a comprehensive review of vision-based models, including typical architectural designs combining different modalities (visual, text, audio, etc.), training objectives (comparison, generation), pre-training datasets, fine-tuning mechanisms, and the common prompt mode.

Enter the NLP group —> join the NLP exchange group

05dde1ced81954c57cf0fdf475d68a56.jpeg

论文:Foundational Models Defining a New Era in Vision: A Survey and Outlook

Address: https://arxiv.org/pdf/2307.13721.pdf

项目:https://https://github.com/awaisrauf/Awesome-CV-Foundational-Modelsesome-CV-Foundational-Models

The visual system for observing and reasoning about the compositional properties of visual scenes is fundamental to understanding our world. The complex relationships between objects and their positions, ambiguities, and changes in real-world environments can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth.

043d84bed2789ba1597fd9bf7c8c4911.jpeg

These models learn to bridge the gaps between these patterns and, combined with large-scale training data, facilitate contextual reasoning, generalization, and cueing at test time. These models are called base models.

9cd423077bfac46457917d15389d5ab6.jpeg

The output of such models can be modified without retraining by human-provided cues, for example, by providing bounding boxes to segment specific objects, by asking questions about images or video scenes to engage in interactive dialogue, or by language instructions to Manipulate the behavior of the robot.

153872e43dbc1e16aecbd5b32d7555bc.jpeg

In this survey, we provide a comprehensive review of such emerging foundational models, including typical architectural designs combining different modalities (visual, textual, audio, etc.), training objectives (comparative, generative), pre-trained datasets, fine-tuning Mechanisms, and common modes of prompting; textual, visual, and heterogeneous.

3a520ef486088c218050de0fdbda27b7.jpeg
6ca57f1a487ac9d7f34cba106f914dde.jpeg

We discuss open challenges and research directions for fundamental models of computer vision, including difficulties in evaluation and benchmarking, gaps in real-world understanding, limitations in contextual understanding, biases, vulnerabilities to adversarial attacks, and interpretability issues.

25041ed663e63d71dd144fffd7500d3f.jpeg

We review recent developments in the field, systematically and comprehensively covering a wide range of applications of the underlying models.

215b590b708c5ca7894358fe01b04b37.jpeg 731c4441d45a41fbb3782f332178033e.jpeg 7481f0b023e2aad8fa82d8160ae572b2.jpeg 5db672b40d8ac7b53064bd691747adee.jpeg
f0b6b63348d4013d0231312e561dafaf.jpeg f4b214b23012fd8b23fa8f2cd9e5cf98.jpeg
d8c8820cea417c0776efc9fe3c9ad513.jpeg 2f36ab60edfd82d1aa46d86820b34917.jpeg a34ee003eda3d2b3c155b129f04af7b9.jpeg b07bc196dadcc4e50bb75c1cbdf26b58.jpeg f1e4321ad10a25494c5a037ae92c9e7b.jpeg
2c9c826397d1ecd74d381e5cdcf99025.jpeg

Enter the NLP group —> join the NLP exchange group

Guess you like

Origin blog.csdn.net/qq_27590277/article/details/132033577