VALSE 2023 | General artificial intelligence towards CV: Inspiration from GPT and large models

Author | Xie Lingxi Editor | Jishi Platform

Original link: https://zhuanlan.zhihu.com/p/620631150

Click the card below to pay attention to the " Automatic Driving Heart " public account

ADAS Jumbo dry goods, you can get it

Click to enter → Heart of Autopilot【AIGC】Technical Exchange Group

guide

 

Has CV already done a good job in realizing the task and system of development unification? What exactly do current vision base models (including recent SAMs) do and don't?

introduction

It has been more than half a year since the last long speech. During this period of time, it is thrilling for the AI ​​industry and even the whole world. During this time, the most impressive event was the release of ChatGPT and GPT-4. There is no doubt that GPT-4 is the most powerful AI program ever created. In a subsequent paper [1], scholars referred to GPT-4 as the spark for the birth of general artificial intelligence (AGI). It is true that everyone has a different definition of AGI, and GPT at this stage has not really solved all the problems in the AI ​​field; however, the technology based on large models is indeed close to completing the unification of the NLP field, and even people can vaguely see The dawn of AGI. Perhaps, within 3-5 years, we will be able to see the AGI computing architecture based on the von Neumann architecture; if so, the large model will occupy a core position in it.

In addition to the GPT series, the field of computer vision is also very lively, with amazing progress in several cutting-edge directions. For the public, the deepest feeling is AI painting. The emergence of technologies such as Stable Diffusion[2] and ControlNet[3] has greatly reduced the threshold for training and applying diffusion models. In communities such as Midjourney, the development speed of AI painting can be described as rapid, and many technical difficulties (such as AI can't draw hands, can't count, etc.) have also been initially improved. Today, anyone with an entry-level GPU or a small subscription fee can create their own AI creations. Even the field of visual perception, which has been quite dull for a while, has been stirred up by a method called SAM [4]: ​​Although SAM still has many defects (such as limited semantic recognition ability), it allows people to see the basic model of vision. More possibilities. According to the statistics of Google scholar, in just two months, SAM has received more than 200 citations, which shows the high degree of attention and the involution of research.

Faced with such a shock, many researchers, myself included, would feel lost. Obviously, under the guidance of NLP, the development of a unified task and system will become the core topic of the entire CV field in the next 3-5 years. However, is CV ready to achieve this ambitious goal? What exactly do current vision base models (including recent SAMs) do and don't? This article, which has been written intermittently for two months, is my thinking on these issues.

Part of the article was also compiled into a mini-survey by me, and together with the PPT reported on VALSE, it was placed at the following address:

https://github.com/198808xc/Vision-AGI-Survey

6dcfe45570a3cd7122f1aedeed645802.jpeg
Figure 1: Screenshot of the homepage of the research report.

The arXiv link is as follows:

https://arxiv.org/abs/2306.08641

In this article, I will start with the definition of AGI. Afterwards, I will briefly review the transformations that the NLP field has undergone. The GPT series based on large language models has brought epoch-making changes to natural language processing and ignited the spark of AGI. Next, I will enter the discussion of the field of CV. As the next important battlefield of AGI, the CV field is moving towards a unified model, but there are still great difficulties. I will review existing work, analyze the intrinsic difficulties, and propose a new research paradigm inspired by GPT. Finally, I will also share some personal views.

Artificial Intelligence and General Artificial Intelligence

People today are no strangers to the term artificial intelligence (AI). AI in the modern sense was born at the Dartmouth Conference in 1956, and then experienced decades of development, with several ups and downs. The fundamental goal of AI is to reproduce human intelligence using mathematical methods. In recent years, driven by deep learning, the field of AI has made great progress and profoundly changed people's production and lifestyle.

Artificial General Intelligence (AGI) is the highest goal of AI development. There are many definitions of AGI, among which the most popular one is that AGI is an algorithm that can have the capabilities of any human and animal. Since the early Turing test (before the Dartmouth conference), the pursuit and debate about AGI has never stopped. The emergence of deep learning has greatly accelerated the process of AGI; and the recent GPT series is considered by scholars to ignite the spark of AGI [1]. Deep learning itself provides a general methodology that allows people to use statistical learning methods to construct a neural network (a hierarchical mathematical function) to approximate the relationship between input and output when the input and output forms are determined. . As long as there is enough data, deep learning can be applied to many AI subfields such as CV, NLP, and reinforcement learning.

Regarding the formal definition of AGI, we can refer to the viewpoints of the book "General Artificial Intelligence" published in 2007. Put the agent in an environment, when it observes a series of states, it can choose the corresponding action from a certain set, and get the corresponding reward. The goal of AGI is to learn a mapping that maximizes the cumulative rewards it gets when it acts in the environment. Although the definition of AGI is very simple, it is very difficult to implement. The main difficulties include but are not limited to: the actual data has a high dimension, human intelligence has complex characteristics, and the theory of neuroscience and cognitive science is lacking, etc.

GPT: Ignite the AGI spark in the NLP field

Since its release, the GPT series has broken numerous records, including the miracle of reaching 100 million users within 2 months. The importance of this record is that it shows that the AI ​​​​algorithm has the ability to face ordinary users (to consumers, or 2C), which is the first time in history. In order to achieve 2C, AI algorithms must have strong general capabilities and be able to meet most of the requirements of users. Surprisingly, GPT does this. GPT basically solves common problems in the NLP field. On many problems (such as writing code), the ability of GPT even exceeds that of specially designed algorithms. In other words, GPT has realized the great unification of the NLP field: each task that seemed to be isolated before can be unified under multiple rounds of dialogue tasks. It is true that GPT is not perfect, and it will make mistakes or speak nonsense on many issues, but within the foreseeable range, the research paradigm of NLP will not undergo major changes. This protracted (nearly 70 years since the Dartmouth conference) NLP war has been won, and the next step is to clean up the battlefield, such as solving vertical domain problems, logical reasoning, and improving user experience. etc.

Regarding the capability display of GPT, I won’t go into details here. You can refer to the vast amount of information on the Internet, or the systematic and detailed analysis in the article "AGI Sparks" [1]. I just want to quote a sentence from the official GPT-4 news:


As a result, our GPT -4 training run was (for us at least!) unexpectedly stable, becoming our first large model whose training performance we were able to accurately predict ahead of time. for us) is unprecedentedly stable, and it also becomes the first large model we've trained to accurately predict performance in advance.

In other words, the essence of GPT-4 is a neural network, a probabilistic model; but the behavior it exhibits (whether it is training or testing) is no longer stable like a probabilistic model. This is indeed a remarkable technological breakthrough!

Regarding the implementation principle of the GPT series, many excellent articles have already been analyzed, so I will not repeat them here. Briefly, GPT training is divided into two stages. The first stage is called generative pre-training , which is mainly performed on unlabeled general-purpose corpora. Large-scale language models can adapt to new tasks with a small number of examples by predicting the next word, fitting the data distribution of general text, and obtaining in-context learning capabilities. The second stage is called instruction fine-tuning , which is mainly performed on the annotated dialogue database. In the process, a large-scale language model aligns the general text distribution to the question answering data, significantly improving the ability to solve targeted problems. At the same time, the model can also learn the reward function from the feedback of human users, which further enhances its ability to satisfy user preferences. If you are interested in more specific analysis, you can search for the implementation principle of ChatGPT by yourself.

CV: The next battleground for AGI

Since humans understand the world through multiple modalities, true AGI must be achieved with a combination of CV and NLP. However, implementing AGI in CV is much more difficult than implementing AGI in NLP. According to the previous definition, a real AGI should have the ability to solve general problems and interact with the environment, not just complete primary tasks such as object recognition and multimodal dialogue. However, as shown in Figure 2 (example source: UberNet[6]), compared with GPT, which uses a unified dialogue system to solve all problems, the current common methodology of CV is still relatively preliminary, and most of them are limited to using independent models or even algorithms to solve all problems. Solve different problems, including image classification, object detection, instance segmentation, attention detection, image description, grammatical graphs, etc.

5543f73c0bb77affdc27e0d63111079c.jpeg

Figure 2: Current CVs mainly use independent algorithms to solve problems.

The Difficult Road to CV Unification

In order to approach the level of GPT, CV must move toward a unified system, that is, use a system to solve various visual problems. There have been quite a few such attempts in the industry recently, which we group into five main directions. Among them, the first three directions mainly solve the unification of task forms, the fourth one mainly solves the unification of visual task logic, and the last one mainly solves the unification of visual and language interaction. Below we briefly review representative work in each direction and analyze their strengths and weaknesses.

  • Open-domain visual recognition:  the algorithm is required to not only recognize concepts that have appeared in the training set, but also recognize concepts that have not appeared in the training set through natural language and other methods. At present, the main foundational work of open domain recognition is CLIP[7], which provides a cross-modal feature alignment method between text and images, enabling people to use natural language to refer to target semantics, thus completing classification and detection . , segmentation, positioning, on-demand recognition and other tasks. Although natural language provides enough flexibility to make open-domain recognition possible, it is difficult for natural language to refer to fine-grained information in visual signals, thus limiting the recognition ability to some extent.

  • Segment Anything task:  By designing a unified prompt system and completing data closure at the labeling level, SAM [4] can segment all basic units in the image and demonstrate generalization capabilities in a wide range of visual domains. Without retraining, SAM can provide the basic semantic unit and be applied to segment 3D objects, remove and fill objects, segment medical images or hide objects, etc. The important idea conveyed by SAM is: by reducing the difficulty of visual tasks (here mainly refers to the segmentation without semantic labels), to unify the definition of visual task form and enhance the inter-domain migration ability of the model. From the perspective of form, SAM is like a part of the general visual recognition process, but how to build reasonable upstream and downstream modules to cooperate with it (to build a complete process) is still an open problem.

  • Universal visual coding:  a series of attempts to integrate multiple tasks through a unified coding form. Although they have different forms, they all point to the same goal, that is, by encoding the data of different modalities and different tasks into a unified form, a single neural network model can complete as many tasks as possible. The representative methods are divided into three categories: first, Gato[8], which verifies that a single transformer model can complete tasks such as CV, NLP, and reinforcement learning; secondly, pix2seq[9] and OFA[10], which verify different visual Tasks (such as detection, segmentation, and description) are unified in the form of natural language, so that they can be sent to a single model for training; finally, Painter[11] and SegGPT[12], borrowing from the in-context learning method in NLP, combine a The sequence of vision tasks is encoded as different forms of image-intensive prediction tasks, and a single vision-only model is trained to solve the problem. Compared with the traditional visual recognition framework, these methods are closer to the goal of unification, and also demonstrate that the current neural network model, especially the transformer, can adapt to a large class of cross-modal tasks. However, these only pursue formal unity, and the boundary with multi-task learning is not clear, and the benefits of unification have not been fully demonstrated.

  • Visual understanding guided by a large language model:  With the assistance of a language model, complex visual problems are disassembled into a unified logical chain and the problem is solved step by step. In fact, this kind of method is not new: at least in 2017, there have been attempts to use LSTM to split the problem and call the vision module [13]. Only the emergence of large language models has greatly enhanced the versatility of this methodology. A recent body of work has a common denominator of using GPT to convert textual questions into step-by-step executable logic. This kind of logic can be code, it can be connected to a search engine, or it can be expressed as a flow chart or natural language. When necessary, the program will call the vision module to provide basic capabilities such as detection, counting, OCR, and description. This kind of method enriches the logic of visual question answering and improves the interpretability of answers, but it strongly relies on large language models and basic vision modules. In many cases, the visual tasks represented by detection itself also require complex logic to complete. Obviously, the current method is difficult to drill down to this depth.

  • Multimodal dialogue:  Introducing images or videos as references in dialogue tasks allows a unified form of visual understanding to be accomplished through dialogue tasks. On the basis of vision, language, and cross-modal pre-training models, only a small number of parameters need to be fine-tuned to complete the question answering task [14]. Inspired by the GPT series, researchers feed visual annotations into language models, and generate question-answer data with simple prompts [15]. After the multimodal dialogue model is fine-tuned on these question-answering data, it has the ability to answer complex questions. At present, the question-answering results generated in this way are already comparable to the examples mentioned in the GPT-4 technical report [16]. However, most of the capabilities of current multimodal dialogue systems come from large language models. This means that, like open domain recognition, multimodal dialogue has limited ability to refer to fine-grained visual information. When using complex images as a reference, it is difficult for the algorithm to ask questions about a specific person or object, which also limits the ability to solve specific problems.

Research in the above directions has brought the field of CV to a new level. Judging from the current progress, the CV algorithm can complete visual recognition under certain conditions, and can also carry out preliminary multi-modal dialogue, but it is still far away from a unified model and a general task solution. The latter is exactly what AGI needs.

So we can't help but ask: why is it so difficult to achieve unification in CV?  The answer to this question has to be found in NLP.

Inspiration from NLP to CV

We try to understand what GPT accomplishes from another perspective. We might as well imagine that we live in a plain text world like GPT. In such a world, multiple rounds of dialogue tasks are both sufficient and necessary : ​​on the one hand, we can only communicate with other agents through text; on the other hand, we can complete all tasks through multiple rounds of dialogue. In other words, in the field of NLP, the learning environment is complete: we train the algorithm through multiple rounds of dialogue, and the algorithm only needs to master multiple rounds of dialogue, which is an AGI that can complete all tasks. I call this quality " what you learn is what you need " - a term coined after "what you see is what you get".

From this point of view, the dialogue task defined by GPT is more important than the implementation method of GPT!  This definition enables AI algorithms to learn by interacting with the environment, which fits the definition of AGI: interacting with the environment and maximizing the reward. In comparison, CV has not formed a clear route: there is no environment for pre-training, and various algorithms cannot solve problems in the actual environment. Obviously, this deviates from the basic principles of CV and AGI. In fact, as early as the 1970s, David Marr, a pioneer of computer vision, proposed that vision algorithms must build models of the real world and learn from interactions [17]; subsequently, other scholars pointed out the importance of interaction. Today, however, most vision algorithms do not study how to interact with the environment, but how to improve accuracy on various tasks.

Why is this? Of course, the difficulty of environment construction is too great!

Agency tasks: a compromise between ideals and reality

To construct a scene for a CV task, there are two main approaches:

  • Build a real environment:  Placing a large number of agents in the real world allows them to learn by interacting with other agents, including humans. The disadvantage of this method is that the cost is too high and it is difficult to ensure safety.

  • Build a virtual environment:  simulate or reconstruct a 3D environment through visual algorithms, and train agents in the virtual world. The disadvantage of this method is the lack of authenticity, including the authenticity of scene modeling and the authenticity of agent behavior, so it is difficult for the trained agent to effectively migrate to the real world.

In addition, the simulation of the behavior of other agents in the environment is also very important, which determines the adaptability of the CV algorithm in real application scenarios. If you want the environment to interact with the agent (such as placing a real robot in the real world), the cost of collecting data will increase significantly. On the other hand, the action mode of the agent in the environment is often relatively single, and it is difficult to simulate the rich and open (open-domain) behavior in the real world.

In general, the currently constructed scenes are not enough to meet the needs of large-scale training of CV algorithms. When the environment cannot be simulated, people can only settle for the next best thing. They do not directly interact with the environment, but sample a large amount of data from the real environment, and define the capabilities that may be required to interact with the environment as a series of agent tasks (that is, by completing task, close to the final goal), such as object recognition, tracking, etc. It is hypothesized that by improving the accuracy of these proxy tasks, CV algorithms can be brought closer to AGI.

But the question is, is this assumption correct?

Figure 3 expresses our point of view. Before the advent of deep learning, CV algorithms were relatively weak, and the accuracy of proxy tasks was not high. At that time, the pursuit of agency tasks largely promoted the development of AGI. However, in the past decade, with the development of deep learning, various agent tasks have been highly saturated. On the ImageNet-1K dataset, the top-1 classification accuracy has increased from less than 50% in the previous deep learning era to more than 90%. At this time, if we continue to improve the accuracy of the agent task, we may not be able to approach AGI, or even run counter to it. The emergence of GPT further confirms this point of view: after the emergence of a model close to AGI, the original isolated NLP agent tasks, such as translation and named entity extraction, become no longer important.

Proxy Checkmate!

377c40af40dc9a4d77bcb0a1c2d24ef6.jpeg
Figure 3: The proxy task of CV is losing meaning, even moving us away from AGI.

The Future Paradigm: Learning from the Environment

The learning process we envision is shown in Figure 4 (source: Habitat[18] and ProcTHOR[19]), which is divided into the following stages:

  • Phase 0, environment construction.  Build a virtual environment in various ways to enhance the richness, authenticity and interactivity of the environment as much as possible.

  • Phase 1, generative pre-training.  Let the agent explore the environment, combine its own actions, and predict what it will see in the future. This corresponds to the pre-training phase of GPT, where the task is to predict the next word. In this process, the CV algorithm memorizes the distribution of the real world and is well equipped to learn the task with a small number of samples.

  • Phase 2: Instruction fine-tuning.  Train agents to perform specific tasks, such as finding specific objects, or even interacting with other agents. This corresponds to the instruction fine-tuning task of GPT, which is also based on rich task descriptions and manual instruction data. In this process, in order to complete the task, the CV algorithm must master various visual concepts and acquire the ability to process visual signals on demand.

  • Downstream stage (optional).  AGI models can be used for traditional vision tasks in a prompt-based manner.

3824041a70eadd7ad79273d4f865c5aa.jpeg
Figure 4: The envisioned future CV training process, exploring the environment, completing tasks, and migrating to downstream perception tasks.

It should be noted that in such a process, the proxy task is only the ability of the algorithm to "smoothly" acquire after training on general tasks. However, most of the current CV research regards the agency task as the only pursuit, which is actually putting the cart before the horse.

To achieve the above process, there are many difficulties. We analyze from three stages.

  • More complex virtual environments.  Currently, there are two main methods for constructing a virtual environment. One is a virtual environment based on real data:  collect actual scene data and model it into data structures such as point cloud, mesh, and Neural Radiation Field (NeRF), and support high-speed, large-scale rendering. The cost of this approach is relatively high, and it is difficult to scale up the production environment. Currently available 3D datasets (such as Habitat [18]) are several orders of magnitude smaller than 2D datasets, and are still limited to certain special scenes (such as indoor or street scenes). The second is to construct a virtual environment through simulation methods:  directly sample virtual data and render the 3D environment through 3D modeling and generative algorithms (including GAN and diffusion models). Although this approach can generate environments in batches (such as ProcTHOR [19]), it is not easy to restore the real-world data distribution. On the one hand, images usually contain artifacts that affect algorithm learning (even if it is difficult to observe with the naked eye), making it difficult to guarantee the transferability of models trained on virtual data. However, no matter which method is used, the size and realism of the virtual environment cannot meet the requirements, and it is difficult to allow AI algorithms to interact with other agents in the environment.

  • More complex data structures.  The data structure of NLP is relatively simple. It naturally has basic and inseparable semantic units such as "words" [20], and a structure such as transformer is naturally designed to deal with these discrete units; at the task level, NLP defines pre-training is context generation (commonly known as cloze), while also modeling all downstream tasks as context generation. Such a seamless framework makes the gap between NLP pre-training tasks and downstream tasks very small. However, the data structure of CV is much more complicated: this complexity is not only reflected in the higher dimension of the image, but also in the difficulty of defining the basic semantic unit of the image. In this case, it is obviously not the best solution to blindly "copy homework" and forcibly divide the image into tokens to apply the transformer architecture. At present, I am more and more inclined to think that token is just an illusion, an expedient method, which is really suitable for the mathematical nature of visual representation, and more work is needed to reveal it.

  • more complex practical tasks.  Obviously, after introducing the CV signal, the agent can complete more and more complex tasks by interacting with the environment. Compared with the multi-turn dialogue of NLP, these tasks are more complex in form, richer in data modalities, and more diverse. It can be expected that if the instruction fine-tuning method is used, more data will be collected, and even the behavior patterns of real agents will be introduced. This has higher requirements for data volume and data complexity.

Recently, we have noticed some exciting work. One such work is PaLM-E [21], which uses a cross-modal base model to guide embodied vision algorithms and enhance their capabilities. Another work that is not as famous as PaLM-E but more exciting is ENTL[22], which models both environment modeling and instruction learning as a form of sequence prediction, realizing the prototype of the above framework. These works light the way for learning in the environment; on this basis, along with system design and engineering optimization, we will see a bright future of CV unification.

summary

In their proposals for the Dartmouth conference, the pioneers of AI wrote about a seemingly mundane but incredibly difficult problem: How do computers learn to speak human language? After decades of hard work, researchers finally saw the dawn of AGI in the NLP field, but the CV field is still far from this goal. The essential reason for the current dilemma of CV is that the CV field has not established a paradigm of "learning from the environment", so it can only sample the environment and design agent tasks, and cannot form a closed loop at the system level. In the future, in order to achieve the unification of CV, we must abandon the existing framework and design a new embodied paradigm, so that the CV algorithm can enhance its ability and continuously evolve in the interaction with the environment.

some emotional thinking

Recently, I have seen a lot of slightly impetuous arguments. The most common one is that AI will revolutionize everything, even eliminate most AI practitioners, and eventually achieve common unemployment (strike out the last sentence). As a rational practitioner, I know that the ability of the CV algorithm is still relatively limited, and there are still many hard bones. But one thing is certain: the large-scale language model (LLM) already possesses powerful intention understanding and preliminary logical reasoning capabilities, thus meeting the conditions for becoming the "central system" for AI to communicate with humans. Once this is the case, this technical route will be solidified. In the next 3-5 years or even longer, the industry has only two things to do: continue to strengthen the central system (enhance LLM or build its multimodal variant, Improve its capabilities in a modular form), and replicate this paradigm into the CV field. Today, it is meaningless to discuss whether large models are the future. What we have to do is to pave the way and prepare for the real use of large models in CV.

At present, it seems that large models are likely to become a revolutionary technology comparable to deep learning itself, and we are likely to be experiencing a technological revolution. In the new era defined by large models, each of us is a beginner. The remnants of the old era represented by agency tasks will soon lose their value; those who cannot bravely embrace new methods will also perish along with agency tasks.

appendix

The following text is a supplement to the above viewpoint, and it is also some thinking that has not yet formed a system.

Talking about the Fundamental Difficulty of CV

In last year's article, I explained the three fundamental difficulties of CV, namely information sparsity, inter-domain differences, and infinite granularity, and pointed out that they are the side effects of the paradigm of sampling + agent tasks . The link to the article is as follows:

https://zhuanlan.zhihu.com/p/558646681

Key passages are excerpted as follows:

Fundamentally speaking, natural language is a carrier created by humans to store knowledge and exchange information, so it must have the characteristics of high efficiency and high information density; while images are optical signals captured by humans through various sensors, which can objectively reflect the real situation, but correspondingly do not have strong semantics, and the information density may be very low. From another perspective, the image space is much larger than the text space, and the structure of the space is also much more complex. This means that if you want to sample a large number of samples in space and use these data to represent the distribution of the entire space, the sampled image data must be many orders of magnitude larger than the sampled text data. By the way, this is also the essential reason why natural language pre-training models are used better than vision pre-training models - we will mention this later. According to the above analysis, we have introduced the first basic difficulty of CV, that is, semantic sparsity , through the difference between CV and NLP . The other two difficulties, inter-domain diversity and infinite granularity, are also somewhat related to the above-mentioned essential differences. It is precisely because image sampling does not take semantics into account, so when sampling different domains (i.e. different distributions, such as day and night, sunny and rainy, etc.), the sampling results (i.e., image pixels) are strongly correlated with domain characteristics, resulting in domain difference between . At the same time, since the basic semantic units of images are difficult to define (while text is easy to define), and the information expressed by images is rich and diverse, humans can obtain nearly infinitely fine semantic information from images, which is far beyond any current CV field. The ability defined by an evaluation index, which is infinite granularity [23].

Further analysis leads us to an interesting conclusion: the essential difficulty of CV lies in the limited human understanding of visual signals. Humans have never really grasped the structure of visual signals, or even defined a special language for it, but can only refer to and express visual signals through natural language. A lot of clear evidence can express this point: it is difficult for most people to draw realistic images without training (this shows that humans do not grasp the data distribution of images); at the same time, it is difficult for most people to draw realistic images through Language communication, to accurately express the meaning of the image to another person - even if two people are talking by voice and looking at the same picture on the computer, if they want to refer to some detailed elements in the picture through pure language communication, it is difficult Not always easy.

If we re-examine the three fundamental difficulties of CV, we will find that they can be unified, reflected in the subjective and uncertain representation granularity of visual signals , or the contradiction between the pursuit of objective visual signals and the pursuit of concise semantic signals . When the granularity of representation is large (that is, the pursuit of the simplicity of semantic signals), humans can express visual information in a relatively concise way, so the visual signal is considered to have semantic sparsity; when the granularity of representation is small (that is, the pursuit of objectivity of visual signals) When human beings can recognize the rich visual information in the image, they think that the visual signal has infinite granularity; when the representation granularity is uncertain, it is difficult for human beings to correspond the continuously changing visual signal to the discrete semantic space, so in the visual In the range where the signal changes but the semantics remain unchanged, inter-domain differences arise [24].

In addition, it needs to be pointed out that the contradiction between information sparsity and infinite granularity is mainly reflected in traditional agent tasks. At this time, if the efficiency of representation is pursued (such as using information compression as an indicator), it is difficult to ensure the fine-grained and accurate recognition. In order to circumvent such contradictions, the only solution is to construct a realistic interactive environment that allows the agent to adjust the granularity of the visual signal according to the task.

Comparing CV and NLP again, you will find that NLP avoids the problem of uncertain granularity well. Since the text signal processed by NLP is artificially created, its granularity is the granularity of the text itself. Although this granularity is variable (for example, when describing an object or scene in language, it can be described accurately or roughly), humans determine this granularity and ensure that it matches the actual needs.

Now that the granularity of NLP is relatively clear, can it help CV complete the task? We found that almost all previous CV methods used NLP to define granularity. There are two typical examples: classification-based tasks and language reference tasks. I have also analyzed the defects of these two methods in the previous article, and the excerpts are as follows.

Classification-based methods: This includes methods such as classification, detection, and segmentation in the traditional sense. Its basic feature is to assign a category label to each basic semantic unit (image, box, mask, keypoint, etc.) in the image. The fatal flaw of this method is that when the granularity of recognition increases, the certainty of recognition will inevitably decrease, that is to say, granularity and certainty are in conflict. For example, in ImageNet, there are two categories of "furniture" and "electrical appliances"; obviously "chair" belongs to "furniture", and "television" belongs to "home appliances", but "massage chair" belongs to "furniture" or It is difficult to judge "home appliances" - this is the decrease in certainty caused by the increase in semantic granularity. If there is a "person" with a small resolution in the photo, and the "head" or even "eyes" of the "person" is forcibly labeled, then the judgment of different labelers may be different; but at this time, even one or two pixels The deviation will also greatly affect indicators such as IoU-this is the decrease in certainty caused by the increase in spatial granularity. Language-driven methods: This includes the visual prompt method driven by CLIP, and the longer-standing visual grounding problem, etc. The basic feature is to use language to refer to and identify semantic information in images. The introduction of language has indeed enhanced the flexibility of recognition and brought a natural open domain property. However, the language itself has limited referring ability (imagine referring to a specific individual in a scene with hundreds of people), which cannot meet the needs of infinitely fine-grained visual recognition. In the final analysis, in the field of visual recognition, language should play the role of assisting vision, and the existing visual prompt methods are somewhat overwhelming.

Having said so much, I still come back to the fundamental crux of the problem: vision does not define its own language. The currently visible methods all use NLP to define CV. These methods can solve the primary problems of CV, but if you want to explore it in depth, you will have to break your head!

The stage of development of CV

Obviously, the great success of the GPT paradigm in the NLP field has made CV researchers a little itchy. Following the development path of NLP, NLP built a large model in the GPT-1 stage, observed the emergence of capabilities in the GPT-3 stage, and then used instruction learning to solve specific tasks in the ChatGPT stage.

So a very important question is: what stage has the current CV research reached?

At the end of April, I participated in a panel session of a VALSE online seminar, and one of the questions was: Does SAM solve computer vision problems, or has it reached the level of GPT-3 (so that a strong CV can be built on this basis algorithm). The conclusion I gave is very pessimistic: SAM has not reached the level of GPT-3, and is even far from GPT-1. The most important reason is that CV did not build a suitable learning environment. As mentioned earlier, NLP builds a dialogue environment, and designs a learning paradigm of cloze and instruction fine-tuning for dialogue tasks. If CV wants to follow this process, it should also build interactive tasks and design corresponding pre-training and fine-tuning tasks. Obviously, the current CV learning paradigm does not do this, so we always feel that the upstream and downstream tasks of CV are disconnected: even the current best-performing MIM method seems to have little to do with downstream tasks. To solve this problem, it is likely to start from the source and build a real learning environment.

Then we discuss the issue of capability emergence. The industry seems to have doubts about why the large model of NLP can observe the emergence of capabilities. I myself have a bold hypothesis: the premise for the emergence of capabilities is that the pre-training data has covered a certain proportion of the real world. In this case, the pre-training model does not have to worry about overfitting, because its task is to memorize the data distribution, which in a sense is overfitting-this hypothesis also reveals why NLP can pursue large models: because in no In the case of worrying about overfitting, the fitting ability of large models is stronger. Here, the advantages of NLP's small feature space and simple data form are reflected, and CV needs more data and greater computing power to achieve such a state.

I have a loose analogy: NLP is like chess, and CV is like Go. In 1996, the supercomputer Deep Blue defeated the human world champion in chess through heuristic search, but a similar method cannot be reproduced in Go, because the state space of Go is much larger than that of chess. Later, with the help of deep learning, the heuristic function of Go has been improved non-trivially, and finally it can support the exploration of more complex state spaces. If there is no deep learning, it may take decades for human beings to achieve the same achievement through the accumulation of super-large calculations. The advent of deep learning has greatly accelerated this process.

Back to the development of CV. It is true that, according to the current speed of data collection and calculation increase, perhaps after a long enough time, CV will also be able to stumble to the current level of NLP. However, I believe that before that, there must be a technological breakthrough that will accelerate the process of CV catching up with NLP. And the mission of our CV researchers is to find this technology, or at least find the right direction.

Prospects for future research directions

After the above discussion, the future CV pipeline has already taken shape: a generative pre-training and instruction fine-tuning method based on an interactive environment. This is not necessarily the only route , but the most likely route inspired by NLP. There are many difficulties in realizing it, but as long as the direction is identified, the current difficulties correspond to the most promising research directions.

Taking a step back, if the above-mentioned pipeline is difficult to achieve in the short term, then CV should absorb the capabilities of NLP as much as possible in order to improve the general capabilities. Obviously, CV research based purely on image signals will become less and less, and cross-modal research on language integration will become the absolute mainstream: as long as CLIP or similar multi-modal basic models are used for feature extraction, it is equivalent to accepting Cross-modal thinking. On this route, the most important research direction can be summarized as "finding the way images interact with natural language", or further, "finding the language of images themselves": this is also crucial for interactive tasks .

Some important research directions include:

  • [Environment Construction] The new 3D representation method, combined with NeRF, point cloud and other data structures, aims to build a large-scale, realistic, movable and interactive embodied environment.

  • 【Environment construction】Agent behavior simulation , including evolutionary algorithm based on the evolution of the behavior pattern of the agent.

  • [Generative pre-training] The new autoregressive pre-training method, in which the neural network architecture design needs to pursue the effect of pre-training rather than the accuracy of proxy tasks. To address the redundancy of visual signals, the dynamic compression rate may be a good indicator.

  • [Generative pre-training] The image-text generation algorithm can not only assist in environment construction, but also become an evaluation index for pre-training.

  • [Instruction fine-tuning] Unify various types of visual tasks in the form of prompts , so that the same set of computing models can adapt to as many tasks as possible. By the way, SAM provides a decoupling idea, which proves that the basic unit of segmentation is highly versatile under the premise of weakening the semantics. Under the traditional framework, I am more optimistic about decoupling complex tasks into basic units.

  • [Instruction fine-tuning] Define a new way of human-computer interaction , and collect enough instruction data through human demonstrations.

reference

  1. ^abcBubeck S, Chandrasekaran V, Eldan R, et al. Sparks of artificial general intelligence: Early experiments with gpt-4[J]. arXiv preprint arXiv:2303.12712, 2023.

  2. ^Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 10684-10695.

  3. ^Zhang L, Agrawala M. Adding conditional control to text-to-image diffusion models[J]. arXiv preprint arXiv:2302.05543, 2023.

  4. ^abKirillov A, Mintun E, Ravi N, et al. Segment anything[J]. arXiv preprint arXiv:2304.02643,

  5. ^Goertzel B Artificial general intelligence[M]. New York: springer, 2007.

  6. ^Kokkinos I. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 6129-6138.

  7. ^Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PMLR, 2021: 8748-8763.

  8. ^Reed S, Zolna K, Parisotto E, et al. A Generalist Agent[J]. Transactions on Machine Learning Research.

  9. ^Chen T, Saxena S, Li L, et al. Pix2seq: A language modeling framework for object detection[J]. arXiv preprint arXiv:2109.10852, 2021.

  10. ^Wang P, Yang A, Men R, et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[C]//International Conference on Machine Learning. PMLR, 2022: 23318-23340.

  11. ^Wang X, Wang W, Cao Y, et al. Images speak in images: A generalist painter for in-context visual learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 6830-6839.

  12. ^Wang X, Zhang X, Cao Y, et al. Seggpt: Segmenting everything in context[J]. arXiv preprint arXiv:2304.03284, 2023.

  13. ^Johnson J, Hariharan B, Van Der Maaten L, et al. Inferring and executing programs for visual reasoning[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2989-2998.

  14. ^Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[J]. arXiv preprint arXiv:2301.12597, 2023.

  15. ^Liu H, Li C, Wu Q, et al. Visual instruction tuning[J]. arXiv preprint arXiv:2304.08485, 2023.

  16. ^Zhu D, Chen J, Shen X, et al. Minigpt-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint arXiv:2304.10592, 2023.

  17. ^Marr D. Vision: A computational investigation into the human representation and processing of visual information[M]. MIT press, 2010.

  18. ^abSavva M, Kadian A, Maksymets O, et al. Habitat: A platform for embodied ai research[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 9339-9347.

  19. ^abDeitke M, VanderBilt E, Herrasti A, et al. Procthor: Large-scale embodied ai using procedural generation[J]. arXiv preprint arXiv:2206.06994, 2022.

  20. ^For the convenience of processing, people divide words into sub-tokens, but there are still inseparable basic units.

  21. ^Driess D, Xia F, Sajjadi M S M, et al. Palm-e: An embodied multimodal language model[J]. arXiv preprint arXiv:2303.03378, 2023.

  22. ^Kotar K, Walsman A, Mottaghi R. ENTL: Embodied Navigation Trajectory Learner[J]. arXiv preprint arXiv:2304.02639, 2023.

  23. ^Tang C, Xie L, Zhang X, et al. Visual recognition by request[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15265-15274.

  24. ^Imagine a block of ice gradually melting into water, or a black cube gradually turning white. In these processes there is often some kind of (imprecise) quantitative boundary after which the semantics change. But the semantics are discrete, while the change of the visual signal is continuous, within the scope of the unchanged semantics, the visual signal reflects the difference between domains.

(1) The video course is here!

The Heart of Autonomous Driving brings together millimeter-wave radar vision fusion, high-precision maps, BEV perception, multi-sensor calibration, sensor deployment, autonomous driving cooperative perception, semantic segmentation, autonomous driving simulation, L4 perception, decision planning, trajectory prediction, etc. Learning videos in each direction, welcome to take it yourself (scan the code to enter the learning)

a4f08974db7aa30bc5fac3debc1596a5.png

(Scan the code to learn the latest video)

Video official website: www.zdjszx.com

(2) The first autonomous driving learning community in China

A communication community of nearly 1,000 people, and 20+ autonomous driving technology stack learning routes, want to learn more about autonomous driving perception (classification, detection, segmentation, key points, lane lines, 3D object detection, Occpuancy, multi-sensor fusion, object tracking , optical flow estimation, trajectory prediction), automatic driving positioning and mapping (SLAM, high-precision map), automatic driving planning and control, field technical solutions, AI model deployment implementation, industry trends, job releases, welcome to scan the QR code below, Join the knowledge planet of the heart of autonomous driving, this is a place with real dry goods, exchange various problems in getting started, studying, working, and job-hopping with the big guys in the field, share papers + codes + videos daily , look forward to the exchange!

a8f77f5d20b75583484e45ffd6c5e231.jpeg

(3) [ Heart of Automated Driving ] Full-stack Technology Exchange Group

The Heart of Autonomous Driving is the first developer community for autonomous driving, focusing on object detection, semantic segmentation, panoramic segmentation, instance segmentation, key point detection, lane lines, object tracking, 3D object detection, BEV perception, multi-sensor fusion, SLAM, light Flow estimation, depth estimation, trajectory prediction, high-precision map, NeRF, planning control, model deployment, automatic driving simulation test, product manager, hardware configuration, AI job exchange, etc.;

b1543ce857ca92407816caf76969dcd6.jpeg

Add Autobot Assistant Wechat invitation to join the group

Remarks: school/company + direction + nickname

Guess you like

Origin blog.csdn.net/CV_Autobot/article/details/131356286