RACV2023 opinion collection | The impact of large models & ChatGPT on computer vision

Source | CCF Computer Vision Committee

Introduction With the ChatGPT craze, a series of issues such as how large models play an important role in computer vision, how to use large models to serve various visual tasks, and how to break through the upper bound of algorithm performance with the help of massive data have become common issues in academia and industry. hot spots of concern. To this end, RACV2023 organized a special seminar on "The Impact of Large Models and ChatGPT on Computer Vision", inviting experts and scholars in related fields to jointly conduct a full study of the future development trends of large models and vision, as well as scientific and technological issues that need to be solved urgently. Communication and discussion.

Large model research test portal

GPT-4 portal (no wall, can be tested directly, if you encounter the browser warning point, just advance/continue accessing):
https://gpt4test.com

Topic organizers : Wang Jingdong, Wang Xinggang, Li Hongyang Discussion time : July 24, 2023 Introductory speeches : Chen Xilin, Shen Chunhua, Dai Jifeng, Xie Lingxi Participating panellists (in order of speaking) : Lu Huchuan, Wang Xinggang, Shan Shiguang, Wang Hanzi , Wei Yunchao, Lin Liang, Huang Gao, Wang Tao, Hu Shimin, Xu Kai, Xie Lingxi, Cao Xiaochun, Wu Xiaojun, Chen Baoquan, Jin Lianwen, Zha Hongbin, Xiao Bin, Wu Lifeng, Liu Jing, Dai Jifeng, Tang Jin , Wu Baoyuan Text editing : Li Hongyang, Wang Bangjun Proofreading and publishing : Yang Jufeng
This article is authorized to be published by the CCF-CV Special Committee (public account: CCF Computer Vision Special Committee)

Record of speeches by keynote speakers

1. Chen Xilin (Institute of Computing Technology, Chinese Academy of Sciences)

This is my second time at RACV, and I remembered a song called "Stupid Kid", which talks about people born in the 1960s, probably our generation, so there are some things that they can't think clearly because they are "stupid". Today I would like to take this opportunity to ask everyone for advice and pool their wisdom for discussion.

Teacher Cha said I was lazy on this topic. Why lazy? Added 'About' at the front and 'Thinking' at the end. Let’s talk about computer vision first. We have seen some mixed things over the years. Why is it called a mixed blessing?

On the one hand, you have seen a lot of articles, but on the other hand, looking back, what exactly has the field of computer vision been doing in these years? Research questions are increasingly fragmented. The benefit of our field is that it is open and open source, and the barrier to entry seems to be very low. The other one is that it often revolves around list tasks. Why do you do them? Sometimes you ask students, and students say that because someone has done it, because there is such a data set, because there is such a test, this is actually a bad trend.

But now we see some new changes, such as SAM. I wonder how many people have actually run some experiments with it. If you actually tested it, the results might not be the same. During the holidays, my students helped me run a lot of results, and every time I could see some different results. In fact, they were far from as good as we imagined. Even some large synthetic models were not that good, but it provided a items possible.

Essentially, the arbitrary segmentation provided by SAM is a multi-scale segmentation, because segmentation is always a task with logical meaning. Rather than a physically meaningful task, it provides the opportunity to converge fragmented tasks. Of course, some scholars are now trying to unify visual tasks in a way similar to ChatGPT, but whether visual tasks can be unified is worth thinking about. Some people also say that after the arrival of the large model, will our research end? Is the impact of SAM, CLIP, etc. on our field final? There is a sentence in "Three-Body" that "physics seems to no longer exist." Does our field no longer exist?

Here are ten questions for everyone to think about. The first one is the current large model. How big is considered big? I recall a sentence I said about big data back then. What is big data? Data that exceeds the upper limit of current equipment processing capabilities is called big data. I strongly agree with this view. Is this the same for our large models today? One million parameter, one billion parameter, or 10 billion parameters, so how big is a big model? Is the large model we are talking about today in the visual field a large model or a 'big model'? I personally think it is a 'big model', because pre-training is often aimed at a small number of tasks. The large model we are talking about now is because it can complete multiple tasks at the same time, or find a core task, such as our sister fields. For speech recognition, the tasks are relatively unified.

The problem we are talking about today is essentially confusing. Has the big model changed? If the big model doesn’t change, how far can the big model go? Or do large models solve the problem before the curse of dimensionality occurs? We know that if you build a house with unlimited height, it will collapse sooner or later. If you only need to build a 100-story house, will it collapse? Same with the big model, can we complete our mission before the dimensional disaster exceeds our ability to withstand it?

The third question is about the complexity of large models. Who will reach the boundary first, complexity or brute force? If you have reached the boundary of the problem by relying on brute force, complexity is not an issue. When we sort 100 numbers, no one will consider the complexity of the sorting algorithm, but if we sort massive data on the Internet, we must consider it.

So who will reach the boundary first, complexity or brute force? My personal considerations about model complexity: the first is measured by the complexity of the algorithm, and the second is the scale of the required training data. What is development data? Nurture data dominates the performance of most large models. Therefore, when reproducing large models today, the algorithm itself is almost public. However, I think the relationship between the size of the nurturance data and performance needs to be considered.

How mature is the model? When can the model be used? I define a measure of model maturity - computing power divided by the complexity of the model. If the maturity level reaches a certain level, you can move towards large-scale applications. I divided it into four levels, called thinking level, research level, industrial level, and individual user level. When the maturity of the model is very low, only researchers realize its value. When it reaches a certain level, the academic community can use it. When it reaches a certain level, like today's large models, the industry can use it. If the computing power continues to increase, With an increase of one or two orders of magnitude, it may be possible to easily spread it to personal devices.

I personally think that visual and language models are different, and we cannot completely compare them with language models. Language is something we are taught a long time after we are born, but our vision is something we can learn as soon as we open our eyes. In this sense, to what extent human vision is introduced by a master, it should basically be learned by oneself. The language system is the other way around. What exactly are visual tasks? If we only consider vision, I have always believed that vision is the support for animal survival.

Vision is for survival. Only animals have vision, plants do not need it. Why to survive? Several basic needs, the first is to prevent natural enemies from eating, the second is to find food, and the third is to avoid accidents, all of which require the support of visual ability.

The core of ChatGPT is chat, so what is visual chat? In my opinion, visual chat is some basic functionality. These functions are activated according to tasks, react according to responses, and are in a loop. Most of the time they are not working. We look at many objects every day and turn a blind eye, but once they threaten survival or are driven by needs, , we will respond and handle it. On the other hand, if we blindly use a set of methods in language processing to try to solve computer vision problems, is it okay? So I think visual "chat" may be a componentized demand-response model. It is an action taken by an agent because of a certain need, so its basic function is relatively simple. In addition, the explanation we want today is actually basic chat. One is to talk about the changes ChatGPT has brought to us. Is chat here ChatGPT? There are probably several types of questions. One is that it allows us to move from seeing to thinking. That is, with ChatGPT, many researches can now be connected to ChatGPT. This morning Lin Liang talked about an important thing, how to form a loop. With chat, the loop and response-demand model itself is a way, and ChatGPT is also a loop. Another question is what drives the visual function? In the late 1980s and early 1990s, researchers proposed a similar concept - Purpose Vision. Another use of Chat is to help us do generalized data and annotation. With chat, we can do this very well.

The eighth issue discussed is about multimodality. Collaboration between modes is important, but are there boundaries between modes? Where are the boundaries of the modal? For example, purely from a visual perspective, visual tasks are sometimes expanded without limit. From a brain or intelligence perspective, expansion is correct, but from a visual perspective, where is the boundary? Where is the boundary with other modes, and where is the synergy? Therefore, I think AI will have an architecture, which is the interface of AI’s basic capabilities or the structural support of AGI. In the past, most AI researchers were limited to a specific intelligence capability. The intelligent agent that needs to be developed today is a complex, so there will be some new problems and challenges. Regarding the large model and ChatGPT, I think it is the light of AGI, but one thing should be noted. What is the biggest change brought about by the large model now? We used to compete with each other to see who could make the algorithm more sophisticated. One trend today is from constructing increasingly complex algorithms to constructing algorithms that can accommodate complex capabilities. The general model itself will tend to be simplified. Only in this way can capabilities be continuously amplified. After the advent of ChatGPT, it was undoubtedly a huge impact on researchers. Some previous problems may no longer need to be studied. Like the rise of sea levels, these problems have become underwater parts, or they may just appear. Solved, like a tsunami hitting, the problem remains after the tide recedes. For us, is ChatGPT bringing about a tsunami or a rise in sea levels? This is something we need to think about. thank you all.

2. Shen Chunhua (Zhejiang University)

Thank you Mr. Jingdong for the invitation. I am very happy to share with you the impact of large models and ChatGPT on vision. In fact, it's a bit of a rush to get it on the shelves. I don't make ChatGPT, nor do I make large models. I basically haven't made any large models, although I really want to make large models. Let me briefly talk about some of my understanding. I think ChatGPT has had a huge impact on whether it is vision, NLP or speech recognition from the beginning of this year to now.

One impact is that through big data training of large models, people have begun to realize that this is a feasible way to achieve artificial general intelligence (AGI). At present, we have seen some examples in visual tasks, and Teacher Chen also mentioned that these are developments in this direction. If we can train large models with big data and achieve optimal results on tasks such as image understanding, then we are no longer limited to designing various new algorithms, because this may be a dilemma. On the contrary, leveraging big data and simple algorithms to achieve optimal performance may be a superior path.

This is what we say in the visual circle: the development of vision in the past two or three decades is basically the development brought about by data sets. From the earliest MNIST to Pascal VOC to ImageNet, it has dominated vision for many years. In the 2015 AlexNet paper, Jupiter and several of his students, Alex and Suscuva, first trained the neural network on the Pascal VOC data set. AlexNet did not get good results because the dataset was too small. He went to use ImageNet, and then came up with the AlexNet paper. So basically the data set brings improvement.

When training large-scale models, sufficient computing power is indeed required to process large-scale data sets. According to statistics on a PPT, in the six years from 2012 to 2018, the computing power increased by 300,000 times. If computing power continues to improve, we can foresee that larger data sets and larger models will be used for training in the future. It can also be seen from the data in another PPT that in the time span from 2012 to 2018, the computational complexity required to train the model has dropped significantly. Even compared with AlphaGo, many current models can no longer reach it. complexity previously required.

Now, looking back at ChatGPT, the earliest GPT model used the Transformer architecture for self-supervised learning, and its training method was to predict the next word. In the early days, the training of GPT was limited by computing power and the Transformer architecture was difficult to parallelize compared to LSTM, so it was difficult to continue to expand. However, ChatGPT’s success is partly due to predicting the next word as a key factor in pre-training.

Teacher Chen mentioned before that in NLP, the biggest difference between text and data obtained by images and other sensors is that the text itself is annotated, because the text can be recorded. Images and other data do not have such built-in annotations, so it is not feasible to directly apply the ChatGPT method to the image field. Image data is a "smart sensor" that does not have inherent semantic annotations, so we cannot directly learn from ChatGPT.

A recently implemented knowledge distillation method. In the process of knowledge distillation, the T-ball network is used to represent human knowledge. It was originally supposed to fit probability distributions, but now due to the lack of exact probability information, we can use a delta function to approximate the probability, thereby achieving the goal of predicting the next word.

For the vision field, unlike NLP, vision problems need to solve a variety of different tasks. Unlike GPT's wide applicability to NLP, for models in the visual field, there is no unified model that is capable of all tasks. Among them, DINO v2 is an unsupervised self-supervised learning model that performs well on many downstream tasks in the visual field. Even if you just fix the DINO v2 model, use it directly for feature extraction, and then build the encoder and decoder, it can perform well on many downstream tasks. The second type of model, a model like SAM is called a task-specific base model, which is related to a specific task. In the field of vision, tasks like image calibration may currently be less studied, but more important are pixel-level predictions such as segmentation, and some other tasks such as 3D reconstruction.

The third type of model may be somewhere in between, or a weakly supervised learning model. Weakly supervised learning refers to many models such as CLIP and captioning, which use image-level supervision signals to train and complete tasks through image descriptions. Although image-level supervision information is used, they actually learn pixel-level information. Especially for generative models, it actually learns pixel-level distribution information. I won’t go into details about the content of DINO v2, but what is clear is that it performed very well in the experimental results.

SAM is Facebook's research result this year. They used 10 million pictures for training. In 2021, I worked at Amazon for a year and tried a similar job, collecting about 2 million images and training a SegFormer model. Unfortunately, I applied for this job for two years and was rejected for CVPR. I remember the scores were all 455. Despite this, the results of this work are still very good, although it is not as large-scale as SAM (10 million images), and the work at that time only involved semantic segmentation, and did not do instance level Instant segmentation like SAM.

In the academic world, I would like to emphasize that simply training a large model does not necessarily ensure that the paper will be easily published. Although the results of our research were not bad, it took us more than half a year to complete the paper and consumed a lot of GPU resources to upload it to arXiv. But even if the results are good, getting the paper published can be difficult. Apart from this, there is another project I would like to mention. Our research on 3D reconstruction has collected approximately 10 million images. This project was carried out by one of my former students. Now that DJI has GPU resources, they are also conducting similar research. Although this project has high technical requirements, it was rejected when it was submitted to CVPR in 2023. Afterwards, we improved and got acceptance at ICCV. In any case, what I want to say is that indeed my method itself is not particularly innovative. It mainly uses big data to train a larger model. But judging from the final result, the effect is indeed very good. In addition, the model we trained is not large. We only trained a ConvNeXt model and did not use Transformer. Therefore, this also emphasizes the importance of big data. Even if you train a relatively small model, the performance can be very good with the help of big data. In addition, the third category of methods can be seen as somewhere between unsupervised-learning and pixel level models like SAM. This method can use the image level supervision signal to learn pixel level information.

CLIP is a very good example and is also a research result of OpenAI. Like ChatGPT's previous approach, CLIP's method is also based on big data and puts the entire architecture in a framework such as Transformer. It is similar to GPT in NLP and uses a similar sequence to sequence learning method. Therefore, this formulation is very simple, including generative models like stable AI, and the entire approach is very straightforward.

Personally, I prefer the third method, which is to use image level annotations to train the model and hope to obtain pixel level information. Because theoretically, this method can indeed learn pixel level information.

In 2021, we worked on dance share, which is actually an unsupervised learning aimed at obtaining good pixel wise correspondence. Because many data tasks ultimately require finding correspondence. This year we did a job using a diffusion model to process a large number of generated images and use algorithms to obtain their pixel level annotation. Theoretically, once a large amount of such training data is obtained, it can be used to train downstream vision tasks. A key observation here is that diffusion mode or other generative models need to label pixel assignments.

Yes, once you spread out the pixels, you can theoretically get better annotations. I personally predict that large-scale self-supervised learning methods like DINO v2, which can train large models without the need for labeled data, may prompt more research efforts to emerge, because DINO v2 itself is very useful. But it is still unclear whether a method similar to visual GPT will be implemented. It may be possible, but it may not be necessary. Specific foundation models like SAM may see more similar work in the future, including models such as 3D reconstruction and monocular depth estimation, which may also be publicly released. Of course, I am also very interested in how to use weakly supervised learning. Training a very large model using image level labels is a challenging problem because the training data itself lacks pixel level information. But through training, we hope to get richer pixel-level information.

3. Dai Jifeng (Tsinghua University)

Today I am very happy to share my report, titled "From Perception Cornerstone Model to Agent Cornerstone Model". Recently, everyone has been talking about large models, but I think this concept should actually be called cornerstone models. Recently, Wang Naiyan also mentioned this issue in a post. I attended a panel with him before and we kept talking about this topic. In fact, large models often embody a form of intelligence, and they become the more essential concept of the cornerstone model. The so-called cornerstone model refers to a model that can learn a lot of useful knowledge from massive Internet data. At the same time, this model has strong versatility and generalization, can be applied in many downstream tasks, and can obtain reasonable results on unseen data or tasks.

This is a very important feature of the foundation model. However, I won’t go into the details of this just yet. I think the definition of opportunity and risk foundation models in 2021 is still relatively advanced. For example, the previous BERT performed very well on predefined NLP tasks, but the later GPT series may not perform as well as BERT on these predefined tasks in the same comparative experiments. But the power of the GPT series lies in its versatility, and you can ask it any questions. This is also an important reason why it has received such widespread attention.

In the field of natural language, we first achieved a goal that brought about huge productivity changes. Training a large model is very expensive and requires thousands of GPUs. OpenAI, for example, spends a lot of resources. Last year, dozens of elite researchers spent several months managing data annotation, requiring high salaries to be paid to researchers at prestigious universities. This is a very high expense. However, the benefit of such an investment is that application editing costs are very low. Because when you ask ChatGPT any question, the backend no longer needs to spend a lot of cost to label new data or train a new model, it infers directly on the server.

Next, we need to think about, if a model like ChatGPT has the potential to lead to AGI, how can we expand this capability and apply it to more industries? We can compare ChatGPT to a smart person locked in a dark room. Every time you hand him a note, he will reply with several notes. We hope to add eyes to him to see the world, and arms and legs to let him interact with the environment.

I can partially answer the questions raised by Teacher Xilin before, but time is limited and I won’t elaborate on them for the time being. We want to build an embodied intelligent robot that can operate in the real world. Vision is like language, and our ChatGPT has implemented a closed loop on all human words. You can train it to answer any question and get the correct answer. For example, you can ask it "What is TOKEN?" However, In the field of vision, these tasks are specific cases defined by humans, and the ultimate goal of vision is to enable machines to survive in the real world.

Teacher Chen just summarized it very well, so I won’t say more. I think a lot of the supervision signals should come from interactions with the real world, but to conduct experiments efficiently we can have these interactions in a virtual environment. Currently computer vision is still task-specific backpropagation, which is why the previous generation of AI companies have been losing money because their marginal costs are very high.

The previous visual model, no matter how many Backbone networks are designed, I feel that it cannot form a so-called foundation model, because when applied to any new downstream tasks, task-specific fine-tuning and decoders are required. This results in an increasing number of model parameters as the number of tasks increases, making it impossible to directly share parameters for processing different tasks.

The foundation model for understanding vision has many characteristics. One of them is that it requires a general decoder so that it can cover various vision tasks. You can train on a set of predefined vision tasks, but you can also train on compiled vision tasks. However, if you want to handle new tasks, some manual work may still be required. These methods may be earlier than ChatGPT's previous methods, and we have also made some attempts. There are also some other jobs, such as software work center, perceiver, reperceiver, etc., which are all good attempts.

Another technical route is to directly apply prompt engineering methods in natural language processing to the visual field. In natural language processing, prompt engineering is a great approach that solves the problem of no longer requiring finetune. The idea was to replicate this concept directly into the visual realm. For example, after training a model, give it a hint, such as an original natural image and the segmentation result of a sheep, and then give it a new picture of a person, and it should get enough hints to segment the person. This method has been applied in some work, such as Google's visual-prompt image impainting in 2022 and Dr. Wang Xinlong's image speak in image. However, under current technical conditions, this route is still at a relatively early stage, because our capabilities in visual modeling are not yet very clear. But another relatively practical technical route is to strongly couple large-scale language models and visual models. Now we can directly use natural language to define vision tasks and use natural language to output the results we want. Specifically, we can teach a backbone network such as a powerful visual backbone model to understand our task definition and describe the input and output of the task in natural language. We have done a lot of work in this direction. This method is called "Vision LLM", and we think the effect is quite good.

Now we can handle various visual tasks by defining a powerful visual backbone model, just like we used to rely on object detectors or segmentation networks to handle tasks. Now we define the task through natural language and use natural language to output the results we want. This method has been applied in some works.

You can use natural language to tell the agent what you want it to do. For example, you can tell it to detect what a child is eating and what coordinate system to use to express the picture, and then it will output a result for you, and it can also do some complex reasoning problems. This method has been used in some experiments and is very interesting.

In this technical line, there are still some people who try to use both verbal and visual cues to define tasks to improve the accuracy of task definition. Conducting experiments in a virtual environment can greatly enhance experimental efficiency.

Previous agents were based on reinforcement learning, such as Alpha Go, playing StarCraft, playing Dota, etc. But when they faced the advancement of the open world, they encountered huge generalization challenges. For example, if you play a simple game Tetris and move the entire pixel up a few pixels after training, the trained model will have no idea how to play it. This shows that reinforcement learning faces some difficulties in such tasks.

We try to use large language models to solve this problem. First of all, we chose "Minecraft", the world's largest-selling and most open game, because in this game, players can create worlds, build cities, and even create CPUs and storage. Our basic idea is to place a large language model in a virtual environment that has certain AGI-like capabilities because it has seen the world's corpus. Then we give it some cues and instructions in this environment to see if it can perform well in a virtual environment.

In this direction, there are two related works. One of them is "Voyager" proposed by NVIDIA, which outputs instructions as code to control the agent. Compared with the previous agents that purely used reinforcement learning algorithms, it has made huge progress. The other article is our work "Ghost in the Minecraft". We output slightly more advanced text commands, which requires some more engineering, but in exchange for better performance. Our method achieved a 3.7x performance improvement in testing, while previous methods using reinforcement learning required a large amount of computing resources and had limited capabilities.

The last thing to mention is universal embodied robots, of which Google's "Palm-E" work may be representative. It converts images, actions, postures and other content into TOKEN, and fine-tunes it on a pre-trained large language model for controlling robots. This approach resulted in very interesting demos. By using natural language to talk to the robot, such as asking it to place blocks of different colors in different locations, after training, the robot can perform such tasks and have generalization capabilities. Once it has learned some tasks, you can ask it to do some new tasks, and it will probably perform very well. This is a demo of it. The above is the content of my report today, thank you all.

4. Xie Lingxi (Huawei)

Hello everyone, my name is Xie Lingxi. I am very happy to share our recent experience with you here. I previously gave a report at VALSE, titled "Toward General Artificial Intelligence in Computer Vision." While the title hasn't changed, it fits today's topic very well, so I'll share it again.

My report today will be divided into several parts. First, I will briefly introduce the progress and achievements of AGI and NLP in this field. Next, I will delve into why the CV field seems to be still far away in achieving AGI, what are the main difficulties faced, and what are the current solutions and what are the real essential difficulties hidden therein. Finally, I will briefly share my personal thoughts on how to resolve these difficulties.

Before discussing AGI, we need to clarify what it means. AGI is the highest ideal of artificial intelligence and can be defined as a system that can reach or exceed the capabilities of humans and animals. From a more formal perspective, we can adopt the definition from the 2007 book AGI, which defines AGI as a system or algorithm that interacts with an environment to maximize rewards by taking actions. Under this definition, AGI is very general because it does not know what the specific environment is, what is contained in the environment, and what tasks need to be completed.

In recent years, with the development of deep learning, AGI has made great progress. Deep learning is a general methodology that can use a unified method to model the relationship between input and output as long as input and output are given. Therefore, driven by deep learning, we can use a unified approach to solve a series of problems such as CV, NLP and reinforcement learning. However, there is still a considerable distance to go before true AGI.

In the field of NLP, the emergence of ChatGPT and GPT-4 allows us to see the dawn of AGI. Some foreign researchers call it the "spark" of AGI because the GPT series can complete various general tasks and can even serve as a logical connector to connect different modules together.

To give an example of my own use, I used ChatGPT to write a reinforcement learning program. Although I have not learned reinforcement learning, it helped me write the program and answered my questions, such as why the module did not run correctly, etc. . After a few interactions, I successfully ran the program and got the desired results. This shows that ChatGPT or GPT-4 has become a general problem solver and not just a toy example that we often see in the field of computer vision.

Regarding the training process of GPT, it is divided into two stages. First, unsupervised learning is performed on a large-scale unlabeled corpus to master the data distribution. The instructions are then fine-tuned on the labeled corpus to accomplish specific tasks. This is a two-step process that everyone may be familiar with, so I won’t explain it in detail.

In the field of CV, compared with NLP, although NLP has seen the dawn of AGI, CV still faces some challenges. CV problems are more complex and involve more diverse solutions. Each CV downstream task, such as detection, tracking and segmentation, may require different methods and fine-tuning, far from the AGI goal. Therefore, the CV field needs to move towards unification, that is, unification or unification. At present, the industry has proposed five categories of trends, aiming to achieve the unification of CV. The first one is to achieve formal unification through open domain visual recognition and introduce other modalities to achieve the unification of visual recognition. The second category is that the emergence of SAM becomes a base to achieve relative unification of various downstream tasks. The third category: unified visual coding, which defines different visual tasks into similar or identical forms through coding, so that one model can handle a variety of visual tasks. The fourth point is unified logic, which provides logical support through a large language model, from which the logic of complex visual tasks is discovered, and basic modules are called to solve it. The fifth category is the unified interaction method, which makes the CV system more flexible and intelligent through the unified interaction method. Although the uniformity of the CV field faces challenges, the emergence of these works and continuous efforts are expected to promote the development of the CV field in a more unified and universal direction.

From the perspective of tasks such as recognition and dialogue, computer vision performs well in identifying fine-grained information and extracting content in images. However, going back to the definition of AGI, which is to achieve general intelligence in an environment that interacts with the environment and maximizes reward, we are currently unable to achieve this. Why is AGI currently not possible? This involves the essential difficulties currently faced by computer vision. Although CV is developing rapidly, it still faces big bottlenecks. We need to analyze why these methods cannot be completely universal and what are the essential difficulties?

During this analysis process, the conclusion is that the biggest inspiration that GPT brings to us lies not in its pre-training and fine-tuning methods, but in thinking about the task of dialogue. To further explain, the biggest inspiration of GPT lies in the exploration of the dialogue task itself. Although GPT adopts pre-training and fine-tuning methods, this is not its greatest inspiration. The nature of conversational tasks led us to think about how to implement more general methods in computer vision. Therefore, we need to draw inspiration from dialogue tasks to solve the essential difficulties faced by computer vision.

The dialogue task is a perfect task in NLP because it allows us to achieve very complex behaviors through dialogue. Assuming we live in a text-only world, conversation becomes a complete task in which you can learn anything and achieve anything you want to do. But for CV, we need to think about what kind of tasks can achieve completeness similar to dialogue tasks.

Looking back at the definition of AGI, it requires the ability to interact with the environment and maximize rewards in order to achieve general artificial intelligence. However, CV does not currently implement this. Decades ago, David Marr and others have proposed that CV must build a model of the world and interact with it to achieve broad recognition and versatility. However, so far, this idea of ​​building a universal environment has not been fully realized. The main reason is that it is very difficult to build a common environment in CV.

In CV, there are two possibilities for building environments: real environments and virtual environments. Real environment means placing an agent in the real world to interact with the surrounding environment, but the requirements for scale, cost and safety are very high. Another approach is to build a virtual environment, but the realism of this virtual environment may not meet the requirements, especially whether the behavior of other agents and objects conforms to real patterns. There is currently no good virtual environment that can solve this problem. Therefore, CV faces difficulties in achieving universality and building a common environment.

Currently, we cannot fully simulate or simulate a real general environment, so in computer vision we can only obtain data by sampling the world. Sampling the world means that we use existing photos and videos as discrete sampling points, and train the model on these discrete points. However, since these sampling points are discrete, we cannot truly interact with the environment, which makes it difficult to build a general environment.

To solve this problem, we can imagine the capabilities that an agent needs when realistically interacting with the environment, and use these imaginations to define agent tasks, such as detection and tracking. Over the past two to three decades, or even longer, researchers in the field of CV have been working hard to build proxy tasks, and believe that by achieving better performance on these proxy tasks, we can get closer to achieving the goal of AGI. However, the problem now is that all agent tasks are almost saturated and we cannot get more benefits on these tasks.

Before deep learning, our journey towards agent tasks was in the direction of implementing AGI, but now we seem to have reached a bottleneck. As we move toward agent tasks, we may be moving away from the goal of achieving AGI. One of the biggest problems we face now is that the familiar research paradigms of the past few decades may no longer be applicable or difficult to continue to develop today. This is a very important issue.

Build a real and complex enough environment for the agent to learn in it. In this pipeline, we first need to build an environment with sufficient realism, which includes not only visual realism, but also the real behavior of other agents and objects. In this environment, pre-training tasks can be performed. Just like GPT predicts the next word, the agent can predict what it will see in the next frame, thus performing true pre-training. After completing pre-training, specific fine-tuning tasks can be performed to enable the agent to accomplish the tasks we want it to accomplish, such as recognition or perception.

However, the current CV industry as a whole may be putting the cart before the horse, ignoring the importance of building real environments and focusing too much on specific agent tasks. This leads to a limitation that hinders the development of general artificial intelligence. In order to implement the proposed pipeline, we need to focus on related research work, including environment construction and pre-training. In particular, it should be emphasized that the pre-training method itself has no problem, but the current pre-training method may have certain limitations in building a general environment.

Therefore, in order to achieve general artificial intelligence, we need to focus on how to build a more realistic and complex environment and explore more general pre-training methods so that agents can learn in such an environment and gain stronger perception and understanding capabilities. . Such research will bring important breakthroughs in realizing AGI.

In the future, if there is a good environment, all pre-training methods will be updated. Including some algorithms brought by deep reinforcement learning in games and other aspects in the past decade, we can also learn from embodied vision in vision, which defines many tasks, such as interaction and question and answer. But the most essential problem is that the current definition of embodied vision is too simple and cannot allow us to achieve better performance in real environments. Recently there have been some works like PaLM that bring us more inspiration. However, we can only try our best to simulate this environment. I think this is a very big problem, even as difficult as computer vision itself.

To sum up, the paradigm we have been familiar with in the past two or three decades may be unsustainable today. If so, the field of computer vision may face greater difficulties than we imagine. This field has now reached a critical point, which may not be suitable for me because I am relatively young after all. However, we all need to think hard about this issue.

Now I want to share my personal thoughts. I think the emergence of large language models, including GPT, has raised serious problems for computer vision. We need to find ways to catch up with this trend. On the one hand, we need to explore some new methodologies that no longer rely solely on data sets, but this may take a while to change. But I think this is the trend. On the other hand, as researchers, we need to have the courage to embrace some new methodologies and stop sticking to the old methods, otherwise we may be eliminated by this era. Anyway, thank you all for listening.

The full version of this speech can be found at the link: https://github.com/198808xc/Vision-AGI-Survey/. If you are interested, you can download it and check it out. thank you all.

discussion stage

Q1: What is a large model?

Lu Huchuan (Dalian University of Technology)

After listening to Teacher Chen’s question, I would like to share my thoughts. I think this issue should be viewed from the perspective of AGI. AGI requires a model to have universal capabilities, be able to handle multi-modal input and output, and solve a variety of tasks just like a human being. This means that we need to change the past research paradigm from one model to solve one task to one model to handle diverse data and solve multiple tasks. To achieve this general task capability, we need big models and big data. Only in this way can the model emerge with the ability and innovation ability to solve various downstream tasks and solve problems in unknown fields. Therefore, large models and large models are completely different. Of course, the amount of data will also have an impact on the performance of the model.

One of the benefits of a large model is its powerful ability to express general features because it absorbs large amounts of data. In addition, large models can also complete multiple tasks, which was also introduced by Ji Feng and Ling Xi. The solution in the past has been to use multiple Adapters or Transformers to handle each task, but this can become cumbersome. We need a general solution, and there are currently some related work being conducted. Therefore, based on the large model, we should think about the research paradigm changes of the new foundation model. In the past, we thought that the computing power requirements were very high, but with the increase in computing power, the current foundation model’s computing power requirements are not high, which means that the research direction has changed. As for generalist abilities and specialist abilities, I think they develop interactively. General abilities are very important for large models, but professional knowledge and abilities are still necessary. Just like us humans, everyone has general abilities, but also has their own professional knowledge and abilities, so they develop interactively. In summary, large models and big data are the keys to realizing general artificial intelligence, and the general feature expression capabilities and multi-task processing capabilities brought by large models are its advantages. Changes in research paradigms and the interactive development of generalist and specialist abilities are current issues that need to be considered.

Wang Xinggang (Huazhong University of Science and Technology)

The generality and representation ability of large models in the field of computer vision is an important issue. While universal visual representations are currently the goal to pursue, achieving unification across the visual field in the short term may take longer. Today's large models, such as models like CLIP, have strong generality and representation capabilities for certain tasks, but may not be suitable for other tasks. From a representation perspective, general visual representations are very important in solving various tasks. Although there are many different tasks, we hope to be able to apply a unified feature extraction method to all tasks without the need to design a specific feature extractor for each task. This saves time and resources and improves the model's efficiency and generalization capabilities. In this regard, some large models such as CLIP continue to evolve, and new versions continue to appear. Through different training mechanisms and more training data, their performance capabilities on various tasks continue to increase.

Therefore, the number of parameters of the model is not the only measurement criterion, but should comprehensively consider the training mechanism, data volume and performance on various tasks. On the other hand, some companies and projects abuse large models and use them as an excuse to achieve their goals. From a scientific perspective, this misuse is not correct. Large models should be applied in appropriate scenarios instead of blindly pursuing an increase in the number of parameters. Scientists should advance research responsibly and use appropriate models based on actual circumstances. Overall, large models play an important role in the field of computer vision, but they must be used under scientific guidance to achieve better research and application results.
Shan Shiguang (Institute of Computing Technology, Chinese Academy of Sciences)

I prefer to call it a "foundation model" rather than simply distinguishing between sizes. I think the base model could be very large, maybe even larger than the language model. However, we may not yet find a suitable way to implement such a very large-scale base model, but eventually the base model will become larger. Regarding the role of the base model, its main goal is to model knowledge that cannot be expressed in language or symbols. Analogy to human visual development, before babies learn language, they have received a large amount of information through visual and other sensory systems and formed some common sense knowledge about the physical world. This common sense knowledge may not be fully expressed through language, but exists in our visual cortex and related brain areas. Therefore, the basic model should be able to represent these common sense knowledge that can only be understood but not expressed. It is a distributed knowledge representation method that is completely different from the human symbolic knowledge system.

The greatest success of ChatGPT is that it provides an important progress in solving the most difficult knowledge representation problem in the field of artificial intelligence, especially common sense knowledge representation. Traditional symbolic knowledge representation methods, such as knowledge bases and knowledge graphs, have great limitations in common sense knowledge representation. Distributed representation methods based on large models, such as ChatGPT, successfully break through these limitations and provide a new way of representing and utilizing knowledge.

I believe that basic models are essential in expressing common-sense tacit knowledge. In addition, visual-language bimodal models such as CLIP provide the semantic knowledge necessary to solve "understanding" tasks for a large number of visual understanding tasks by aligning the semantics of vision and language at the representation level.

Wang Hanzi (Xiamen University)

As the scale of the model continues to increase, from the initial few simple parameters to the current hundreds of billions of parameters, we are indeed facing a series of new challenges. One of the challenges is the validation and interpretation of large model results. In the past, we could easily verify and interpret simple models, such as straight-line models. But now, the number of parameters of a huge model like ChatGPT, as well as the trusted artificial intelligence mentioned in the morning, has exceeded our human understanding, and it is difficult to explain its internal operations and results through traditional methods.

This raises the question of how to verify whether the results of large models are correct and credible. Because large models are like black boxes, it is difficult for us to directly explain their inner workings, and thus it is difficult to determine the accuracy of their results. In this case, we need to explore new methods to validate the output of large models and evaluate the accuracy of their results. In addition, one of the challenges that college teachers face is insufficient computing power. Unlike some companies that have strong computing power, conducting research on large models without sufficient computing power may face certain difficulties. With limited computing power, how to study large models in depth and even creatively develop new methods, rather than simply adjusting model parameters, is a question worth exploring.

Overall, the development of large models brings many new challenges, including verifying the credibility and interpretability of results, as well as the difficulty of conducting research with limited computing power. These problems require us to constantly seek new solutions and innovations in the field of artificial intelligence.

Wei Yunchao (Beijing Jiaotong University)

I am very happy to participate in this seminar. I personally think that the concept of "large model" is actually relative and depends on the limitations of our current hardware resources and computing technology. For example, a small model today may have been considered a large model 10 or 20 years ago. In the future, with the advancement of hardware equipment or quantum computing and other technologies, all current models may become small models.

Therefore, both "big" and "small" have time limitations. In addition, we are currently pursuing a universal vision model, but the vision task itself is very complex and the application scenarios are also very diverse, including industrial vision, medical vision, etc. If there is a model that performs extremely well in a specific field and has excellent problem-solving capabilities, then for this specific visual field, this model can be considered a large model or basic model under our current cognition.

Q2: What help does the large model bring?

Lin Liang (Sun Yat-sen University)

Large language models, especially models like GPT, have brought a lot of help to the vision field. They solve common-sense understanding and reasoning challenges often faced in the visual field. In the past, the visual field often faced the problem of integrating domain knowledge into visual models. Now, models like CLIP can truly integrate the understanding and common sense of the world into a large model, making it more versatile. Therefore, we should take full advantage of the capabilities these large language models bring to the vision domain for inference and cross-domain tasks.

In addition, the development of large-scale multi-modal models has also promoted the development of the field of artificial intelligence generation. They have given us a lot of inspiration, allowing us to consider whether we can design the model in the visual field into a general architecture, gradually evolving from the generality of the bottom layer to the task orientation of the upper layer. This idea can promote research and development in the field of vision and thus be better applied to practical problems. Large language models bring many possibilities to computer vision and inspire the development direction of the field of vision. We should take full advantage of these opportunities to further advance research in the fields of computer vision and artificial intelligence generation.
Huang Gao (Tsinghua University)

I share two perspectives. The first point of view emphasizes the qualitative change of large models. In the field of NLP, the emergence of large models has brought about qualitative improvements in capabilities, allowing us to see greater potential and imagination. In the past, in the field of vision, the data set may become saturated after being brushed to a certain extent, without significant performance improvement. Now, large models have also brought new inspiration in the visual field. When the number of parameters increases to a certain level, there may be a qualitative improvement in capabilities. This makes it possible to evaluate and debug large models, looking for key points in performance.

The second perspective emphasizes the design of general models. Models like CLIP show us the ability to learn with few samples in an open setting. However, real-world visual challenges go beyond this and involve the perception of a three-dimensional and dynamic world. While we have made some progress, we are still in the early stages of this. Truly integrating the perception of the three-dimensional and dynamic world into the visual model is an important goal in the future. Similar to how vision has brought great progress in biological evolution, we believe that in future research, the understanding of the three-dimensional and dynamic world will also bring important breakthroughs in computer vision. These perspectives provide useful inspiration for the development and research directions in the field of computer vision.

Wang Tao (Aerospace Grand Plan)

We had a lot of discussions today that I thought were very valuable. I would like to raise a question about the parameter storage of ChatGPT. Everyone thinks it has 175 billion parameters, but what do these parameters actually store? I think on the one hand it includes human language reasoning abilities, such as symbolic reasoning and other understanding abilities. On the other hand, it also contains a lot of human knowledge. Because the human knowledge base is very large, ChatGPT can answer various questions. ChatGPT has brought us a lot of enlightenment. First of all, we have discussed mask learning before, but ChatGPT can answer questions, do summaries, and do language translation, which is amazing. It also learned Chinese and English and has cross-language capabilities. We need to think further about how it opens up these links. The basic module mentioned by Teacher Chen seems to be a key.

Another question is about video chat, how should we design it? Everyone mentioned the issues of computing power and data. How should we train and label data? How to open up the relationship between semantics and images? Give him a clip and mask out some content in the middle. Is this feasible? There are also innate basic modules of human vision, such as recognizing colors, points, lines, surfaces, faces and other structures, but acquired learning is also very important. Can we combine these basic modules with ChatGPT's modules to form a system that combines language reasoning with basic visual capabilities to obtain better results? This issue deserves our in-depth discussion.

Hu Shimin (Tsinghua University)

One thing I want to talk about is, what are our thoughts after the launch of ChatGPT? First of all, the large model has shown a state of multi-party competition, mainly divided into three factions. One group is OpenAI and Claude, which is derived from OpenAI. They are in a leading position in technology and develop very quickly through self-iteration. The other faction is the open source faction represented by Meta's LLaMA. Many units are using large open source models to fine-tune and improve. The third one is the local group. Some domestic universities and companies are conducting their own research, such as Tang Jie of Tsinghua University, Qiu Xipeng of Fudan University, etc., as well as companies such as Baidu and iFlytek. For large natural language models, the Transformer architecture is now mainly used, but it still has problems because human knowledge representation and reasoning are not well integrated and there are gaps. When it comes to large visual models, we have discussed a lot, such as SegmentAnything, CLIP, etc., but compared with phenomenal applications such as ChatGPT, there is still a big gap. The reason why ChatGPT is influential is that it implements a very good conversation application and has attracted great attention from the whole society.

But for large visual models, we need to explore more issues, such as what kind of architecture to use, and whether an application scenario like ChatGPT is needed to implement large visual models. I personally feel that we can try to create a scene where humans interact with robots, allowing robots to understand humans and integrate common functions of computer vision such as detection and recognition; only such a scenario can lead to the development of large visual models. To solve this problem, we face two challenges. The first one requires a unified backbone network for visual content, similar to the Transformer in natural language processing. The second is the problem of computing power. Because the computing power of large visual models far exceeds that of large natural language models, it is necessary to develop a computing power network and actively utilize domestic chips through collaborative optimization of software and hardware.

Xu Kai (National University of Defense Technology)

What Teacher Hu just said is very enlightening. A recent OpenAI blog discussed an implementation idea for general AI, which I very much agree with. They proposed to divide the intelligent system into three layers: the top layer is a large language model, which is used to interact with users, accept user instructions, understand user intentions, and at the same time decompose a complex task specified by the user into a series of relatively simple sub-tasks. , similar to task dismantling based on Chain-of-Thought. Each subtask may correspond to a piece of code, a tool, or a software or a service, such as a calculator, a browser, various apps, etc., which are used to perform specific tasks. This layer is called tool utilization. Between the large language model layer and the tool utilization layer, there is also a memory layer.

Why is there such a memory layer? We know that large models have a certain memory ability (note: the memory here refers to the memory related to specific users or situations, rather than the general knowledge memorized by large models in the pre-training stage). Multiple rounds of dialogue itself reflects a certain degree of memory. Memory capacity, but this is short-term memory capacity. If an intelligent agent wants to complete more complex tasks, such as staying with the user for a long time like a home robot, it needs to long-term memorize the user's habits, hobbies, temperament, etc. Only based on the long-term memory mechanism can it gradually "tame" a customized robot. Intelligent agents to improve user experience. The article proposes that long-term memory can be implemented based on vector databases because it can support efficient queries.

However, I think that this long-term memory should not only be symbolic and knowledge-based memory, but also need to have memory of 3D space, that is, efficient modeling and coding of the 3D environment where the agent is located and the 3D objects around it. , to support agents to complete complex interactive tasks in a 3D environment, thereby achieving embodied intelligence. This is necessary for the robot: for example, if the robot knocks over the cup when pouring water, GPT can tell the robot to get the rag. To specifically execute the action of "getting a rag", further detailed action instructions need to be output. Without the memory of the 3D space, the action instructions issued by GPT may not be completed at all.

So, how to achieve efficient modeling and representation of 3D scenes to support the above-mentioned spatial intelligence? There isn't a good answer yet. A teacher just mentioned the prospect of NeRF. NeRF is indeed very powerful, but is it appropriate here? I have no idea. NeRF is proposed for new perspective synthesis (rendering), which encodes geometry in a neural network. This is indeed a very compact representation, but it is not friendly to efficient queries. Given a position in 3D space, The neural network can only tell you whether this point is inside the object (occupied) or outside the object (free space), as well as the color information, but it cannot tell you the nearby geometric information (for this you need to query point by point), let alone the overall situation. topological and geometric information, which is problematic for achieving spatial intelligence. I think that a more structured 3D scene expression that is also differentially learnable may be needed, while supporting efficient association, query, and update of geometric and semantic information. This direction will be very interesting.

Xie Lingxi (Huawei)

I want to explain my definition of what a foundation model is. I think as a foundation model, its size or the tasks it completes are not its essence. Its essence lies in establishing the data distribution of the world. As long as a model establishes a good data distribution, it can be called a foundation model. In this case, we need to respond to the question raised by Mr. Kaishan before, that is, the number of parameters of the visual model. I think the number of parameters of the visual model will be much larger than that of the NLP model, because the visual world is more complex than the natural language world. Therefore, if we want to use a model to model the data distribution of the visual world, the number of parameters it requires will be at least larger. In this case, our key problem at the moment is not that we don't know what architecture is appropriate, because I think the current Transformer is powerful enough. What we really don’t know now is what the world is really like. We don’t have a good understanding. Is the Internet a reflection of our entire world?

So I want to emphasize again that we need to define the model of this world ourselves. Whether it is using Yann LeCun's method to define a world model, or using embodied methods to achieve highly realistic simulations to define the world, this is what we need to do most urgently now. Once we have this definition, we can establish natural and artificial alignments, and then run various models on this basis to align the distributions to various tasks to achieve various complex tasks. As for the memory that Teacher Xu just mentioned, I think NLP is a good example of memory, and it can remember a lot of knowledge. why? Because we give it enough data, it really remembers this data distribution, and it discovers the correlation between the data and the task. If we can truly subscribe to a real enough world, then the model trained in such an environment must also have a strong enough memory, otherwise it will be impossible to remember this distribution. So I think it is very important to understand this world, or to establish a simulation environment.

Cao Xiaochun (Sun Yat-sen University)

From the perspective of unsupervised fill-in-the-blank in NLP research just mentioned, we commonly use 6,000 Chinese characters, so the complexity is 1/6000. If it is two words, the maximum is 6000 to the power of 2. Because language is created by humans, there are strong regularities. The complexity of this fill-in-the-blank optimization is far lower than the exponential power of violence, so it should be relatively easy in this kind of fill-in-the-blank problem. For example, those of us who have learned Chinese characters can fill in words. It doesn’t matter whether we fill them well or not, but at least we can fill them in.

The second point is why visual large models are relatively difficult? Teacher Ling Xi just said that the complexity is 2 to 3 orders of magnitude. I feel that it may be more than that, because we can do a simple calculation. Take NLP fill-in-the-blank and visual completion as examples. An object may be next to an object of any geometric shape (occlusion) and color. The possibilities are almost unlimited. For example, a person may be next to a cup, a wall, a broom, or even a dinosaur. Or Thai elephants, so in this case, we should not make too many assumptions, otherwise it will be easily affected by pre-train and bias, so the complexity is much higher than NLP. At present, the early large-scale visual basic models are not powerful enough. Just filling in the blanks through methods similar to NLP does not work well, at least it has not been successful yet. Therefore, my intuition may be that we need to proceed step by step. In some specific fields, such as visual tasks for cartoons or Japanese comics, due to the large number of priors and simplifications, the problem space is greatly reduced. The large visual model of the cartoon task may first be A large model of ubiquitous vision that surpasses the human visual system.

Finally, there is the backbone network that unifies the large basic visual model that Teacher Hu mentioned. Someone just mentioned the Transformer framework, and I have been discussing it with students. Why is it successful in NLP? Because it has a grammatical structure designed by humans or established by convention, it has components that are easy to separate, such as vowels, consonants, subject, predicate, object, etc. However, it is not easy to disconnect visually, and its semantic components and the relationship between components are more complicated. Although Transformer has also made a lot of progress in vision, it is still unclear whether it will become the backbone network of the large visual model required by Teacher Hu.

Wu Xiaojun (Jiangnan University)

I think large models can be understood using the concept of subspace in mathematics. Every large model is actually a modeling of a subspace of the physical world. Existing models such as SAM and CLIP can be viewed as subspaces of the real world, and possibly subspaces of subdivided industries. Therefore, the basic model can be regarded as an instance of the subspace, and the nature of the subspace is determined by its characteristic quantities. To find the large model of our data, we need to find these characteristic quantities of the data. Because the large visual model is related to some basic functional modules, and these basic functional modules are actually the characteristic quantities of the subspace.

These subspaces may appear in different ways. For example, they may be features in the Euclidean space, or they may appear as a comprehensive space in the remaining management proposed by Teacher Zhang earlier, which also corresponds to the real world. It may be a bit whimsical to use the concept of subspace to understand ChatGPT, but it can be thought that overall we have gone from linear algebra to functional analysis to the dimensions of the functional, from finite to infinite. Maybe the current ChatGPT is a relatively large subspace. , even a bit like a subspace of a functional.

The subspaces in our current visual model are small subspaces in some specific fields. But after study, these subspaces finally converged into a large model of our vision. This point of view can explain the point made by Teacher Chen and similar points made by other experts. As for the internal memory and external memory proposed by Teacher Xu, I don’t quite agree. Because I think the memory mechanism in the model is more affected by the macro-computing architecture, and the information storage mechanism in our brains is actually content-related.

Chen Baoquan (Peking University)

Problems in the visual domain are very complex because the space is so large. The current basic model only meets two or three parts of the needs. Only then will we be able to see something like video ChatGPT. AIMP has provided great inspiration for vision. Now everyone is studying large visual language models. After the emergence of models like SAM, the next step may be to improve the accuracy through increasing the amount of data and annotation, so as to surprise people. Similar to how ImageNet annotated large data sets promoted the development of vision, there may be large visual language models next, similar to ChatGPT, which uses many people to annotate visual tasks with language to complete the construction of larger data sets and promote large visual language models. The birth of , produces amazing video ChatGPT and so on.

In addition, the current large visual model allows us to move from closed sets to open sets. In the past, all tasks, including RTC tasks, were performed in closed data sets. But now that SAM has emerged, although it does not perform well on closed sets, it is recognized because of its strong generalization ability and its universal expression that is close to human cognition. As technology continues to advance, it may be possible to truly realize open set verification in reality. Finally, these large models will be of great help to future robots and embedded devices, including mobile phones. Because they can complete general tasks and interact with the environment, ChatGPT can continuously interact and learn to continuously improve its capabilities, which will bring greater potential.

Q3: Which problems in computer vision are no longer worth pursuing in the context of large models?

Jin Lianwen (South China University of Technology)

Based on my experience in the OCR field, I would like to talk about some experiences. In the field of OCR, we need to deal with visual and language tasks. For example, given an invoice, we need to identify its content, extract key information, and generate structured output. This also applies to other tasks that contain structured documents. Regarding the GPT model, it performs well on comprehension and language-related tasks, making us feel as effective as the cerebral cortex. We have conducted some experiments, such as using open source large models (such as V3 engine, GPT-3.5 API) in groups or learning, to complete understanding tasks, such as extracting key text from videos. These models perform very well, especially on language-related tasks.

Next, I would like to add something. As researchers in the CV field, we hope to develop our own basic models in the CV field, not just influenced by the NLP field. The biggest inspiration that the GPT model brings to us is not a breakthrough in language understanding or text generation, but a simple method of integrating all human knowledge into one model. While it's not perfect yet, we can already see its potential. Through internal training and reinforcement learning techniques, we can make the model learn the knowledge of all humans. Whether a similar model can be implemented in the video field requires further exploration and research.

Cao Xiaochun (Sun Yat-sen University)

Following Teacher Jin’s answer, I will add something. Thanks to the technological innovations in OCR technology by Teacher Jin, Teacher Bai Xiang and others here, many tasks in other fields are no longer needed. For example, in the past, if we wanted to extract text from PDF and PPT, we needed to study the formats of PDF and PPT, which required constant updating of technical versions. However, with OCR, we can simplify this process and directly use OCR to efficiently extract the required key text information from Screen Capture documents.

Xie Lingxi (Huawei)

Let me talk about two better research directions. The first is the design of the visual network architecture, and the second is video pre-training. Of course, I am not saying that these two directions must not be pursued, because both fields are very huge, but we feel that both fields are currently facing great challenges. For example, in the past, in terms of network architecture design, we could design a new network architecture and evaluate it on the baseline. If the classification effect is better, it means that the network effect is better. The same is true for pre-training. We can design the pre-training structure and algorithm and test it on the corresponding data set. If the accuracy is higher, it means that our algorithm is better.

But why does progress in these two areas seem to be slow right now? This is because the existing data set is already quite saturated, and even if we switch to a larger data set, it will be difficult to obtain significant improvements. So I think it is not that there is no need to continue research in these two directions, but because they face some difficulties, we urgently need a better evaluation index to guide research. This evaluation indicator may be multi-faceted, either a larger and more difficult data set, but I am not very optimistic about this; or a better interactive environment. The report focused on environmental issues. If we solve the evaluation or environmental issues, I believe that whether it is a better network architecture or a better pre-training algorithm, it will emerge.

A teacher just said that some of the current models are blind attempts. I very much agree with this view. But why is no one focusing on this aspect now? Because even if you design a very clever method, it may only increase the click-through rate by 0.2 percentage points in actual application, and the influence of this paper may not be very great. Therefore, the evaluation pressure we face is greater. I think that once the evaluation or environmental problems are solved, many visual fields will usher in new breakthroughs. That's what I think.

Shan Shiguang (Institute of Computing Technology, Chinese Academy of Sciences)

What Teacher Chen said makes sense. What I want to express is that there was a period when sparse coding and sparse representation were particularly popular. The editor-in-chief of the journal where I was on the editorial board did send an email to all editorial boards, suggesting that submissions on such topics be cautiously accepted. Of course, be cautious in accepting submissions. , not directly rejecting the manuscript, but implying that it is not encouraged, because it is difficult to judge the actual value of many submissions. Going back to Jingdong's question, I find it difficult to enumerate all the issues that you probably shouldn't do. But it is true that when we have a larger context model or a stronger basic model, the definition of many problems in the past does not seem so meaningful. For example, in small sample learning tasks, especially methods based on Meta learning, the benchmark setting assumes a base set consisting of dozens of categories or even only a few categories, and then we need to do N-way K-shot of new categories. I think this problem is The definition is wrong. By analogy with people, people are not in this setting. People's small sample learning ability is built on a large amount of learning. Without learning a large number of "base classes" during evolution and development, people may not have "small sample learning ability." ".

The AI ​​problems we want to solve cannot break through the bottleneck of information theory. To solve the underdetermined solution problem of large-scale unknown parameters, we must either introduce more data to form more equations, or introduce more knowledge or information as constraints, otherwise there will be no reasons to obtain a more certain solution. Now that we have the foundation model, I think it brings more implicit knowledge or contextual information, which greatly enhances the possibility of us obtaining a satisfactory solution. Without this knowledge or information, it would be difficult to obtain satisfactory solutions using previous methods. I think this is the essence of the foundation model. It provides a powerful cornerstone for solving downstream data-insufficient tasks - solutions other than the cornerstone do not need to be considered, or you only need to add some bricks to the cornerstone. The tile is OK. Therefore, in this sense, the solution to a large number of problems may require rethinking the benchmark and resetting the test protocol from this perspective.

Wang Tao (Aerospace Grand Plan)

There are still many problems that large models have not solved in the industry. The general large model, industry large model and scene large model are developed in stages. In industrial scenarios, scene-level large models are very difficult, especially due to the small amount of data, such as road and power line defect identification in drone inspections, or in sub-meter-level remote sensing images of aircraft and ship types. Weak and small target recognition, etc. The practical application of these large scene-level models faces great challenges. To solve large model problems at the scene level, sample collection, small sample learning and transfer learning must be considered.

In industrial scenarios, collecting samples is a challenge. For example, for the detection of accident vehicles, a large number of various types of accident vehicle photos are required. Transfer learning is also a problem, how to transfer the knowledge of large models to specific industrial application scenarios. In text recognition, when encountering complex situations such as artistic fonts, the difficulty of the video increases dramatically. Large models have the basic capabilities of ordinary people, but it is equally difficult to become a scene-level expert. This kind of scene-level expert model can be applied in specific industrial scenarios, while the general large model can be used in ordinary people's life scenarios.

Zha Hongbin (Peking University)

Everyone is discussing what to do, what not to do, and what to do in the current era of big models. Large models such as ChatGPT do play an important role in text and speech processing, and can effectively utilize the structural and abstract knowledge obtained from big data. However, large models are only part of our solution to real-world problems. In the development process of human intelligence, it has two parts: innate and acquired. Among them, genes are an important source of innate information. It can be said that genes are a large model of knowledge accumulated during millions of years of human evolution, but genes alone cannot make people intelligent. To become an intelligent individual, the human brain and body system also needs to be flexible and plastic, that is, able to learn through interaction with the environment and feedback, so that the human system itself can change in the real environment. Therefore, large models provide basic information similar to genes, but to make the system flexible and adaptable, online learning and processing capabilities are needed to change the model itself through interaction with the environment so that it can better adapt to the real environment and its changes. Therefore, in addition to studying large models, we also need to consider how to make the system have such flexibility and adaptability.

Jin Lianwen (South China University of Technology)

I just listened to the explanations of many teachers and was very inspired. I will talk about it again and hope that everyone can think about it together. First of all, ChatGPT allows us to see some surprising aspects in natural language processing, such as issues such as limited capability and scale definition, and the thinking chain of language models. I have been thinking that although there have been some related studies, there seem to be relatively few studies involving the visual field. I wonder if there are problems with limited capabilities and scale definition similar to NLP. Researching this problem in the general vision field may be difficult because it requires a lot of computing resources, which is what most teachers have said before. But in some vertical fields, maybe everyone can discuss this issue.

Secondly, the inspiration that ChatGPT brings to us is not how good its performance is in specific tasks. Our experience lies in its very strong general ability to solve practical problems. For example, a model can be placed and used anywhere. When we did research or solved practical application problems in the past, we may need to reacquire and process data from different manufacturers. But with such a universal basic model, this situation will be greatly reduced. So I want to ask, is it possible to find similar scaling theorems, thinking chains, or problems with limited visual abilities in the vertical field? There are many studies in the field of NLP, but there are few related articles in the visual field.

Q4: How should scientific research be carried out in the context of large models?

Xiao Bin (Chongqing University of Posts and Telecommunications)

I think that under the current trend of large model development, the CV direction has been greatly impacted. We should avoid being overly affected and instead change our thinking. Recently, I have participated in several surveys in the field of NLP and large models, and invited many experts in this field to participate. A common problem is that the challenge faced by university researchers in large-scale model research is mainly due to computing power. The parameters of large models in the NLP field are smaller than those in the CV field. Even teams that do better in NLP only have hundreds or even thousands of GPU cards. For our universities, this is already the limit. This brings solve the huge challenge in computing power. I think we can try to start from the following aspects. First, data sharing prevents each team from repeatedly cleaning large amounts of data when making large models. Secondly, model structure sharing and open source, research on basic structures suitable for computer vision, and open source these structures. Third, algorithm optimization is to optimize the algorithm for specific tasks so that it can run well under limited computing power. Finally, for model measurement, a unified and objective measurement standard is developed to determine the performance of the large model and avoid misleading local indicators. In general, we currently need to focus on how to change our thinking, solve computing power problems, promote the sharing and optimization of data, model structures and algorithms, and establish unified model performance evaluation standards. This will help us make better progress in large model research in the field of computer vision.

Wu Lifang (Beijing University of Technology)

First of all, regarding the role of large models, large models are, to a large extent, the summary and application of existing knowledge. They can be viewed as a knowledge transfer tool based on large amounts of data and experience. However, large models have limited adaptability, and they may not be able to handle problems in some specific scenarios. How to further improve the adaptability and flexibility of large models in different scenarios can be a meaningful research direction to explore how to make large models better understand and respond to different environments and situations. Secondly, regarding the issue of limited resources. While many people may not have the computing resources to build large-scale models, we can improve problem solving by bringing in external knowledge, background information, or domain expertise. This method of knowledge introduction can help us solve problems better in specific scenarios, even traditional tasks, thereby making up for limited resource situations. This method may involve technologies such as knowledge graphs, transfer learning, and domain knowledge fusion, and is a direction worthy of further research.

Liu Jing (Institute of Automation, Chinese Academy of Sciences)

In a concise description, I believe that the ability evaluation of large models is a complex issue and plays an important role in guiding the direction of scientific research. In order to ensure that our work performance reaches the optimal level, we need to establish effective evaluation methods. Taking GPT-4 as an example, it performed well in comprehensive evaluations, including retrieval, visual question answering, and open domain tasks. This shows that the task requirements of large models in open environments are increasingly prominent. Traditional research is more biased towards closed-set tasks, and this needs to change. The capabilities of large models may not be suitable for simple single sample evaluation, but more attention should be paid to comprehensive performance in an open environment. This leads to a question worthy of further study, that is, how to evaluate the comprehensive capabilities of large models, and how to compare the performance of large models, small models, and more capable models, rather than just relying on existing benchmarks. At the same time, the value of open tasks also requires deeper exploration and thinking.

Dai Jifeng (Tsinghua University)

I think this question can be considered from another angle, which may be more beneficial. I think the focus is actually not so much on model parameters or bugs, but rather on the importance of constructive models and achieving task versatility or strong generalization capabilities. Model parameters and bugs can be seen as means to achieve this goal, or the current stage. If we revolve around the concept of "Action Model", we will find that there are many things worth exploring.

For example, in the field of ChatGPT or NLP, what is important is that they have found a way to essentially model the NLP world. They directly model the distribution of each TOKEN and each word in NLP instead of modeling by predicting the next TOKEN. For the visual field, we used to give up the complexity of the physical world when processing static images and only focus on the static pixel distribution. We need to think more essentially about how to model distribution, representation, and supervision in images more effectively. Although existing image methods are effective in some aspects, they are not essential enough, because the real world is not only the performance of the image moment, but also includes the passage of time, causality, etc. These aspects are not essentially modeled in the Master operating model in the image field, so I think there are still many issues worth exploring in this field. Yes, that's my opinion.

Tang Jin (Anhui University)

We also need to add that many people are worried about this issue, especially when the network uses GPUs. I would like to illustrate with an example, like if the aviation academy still manufactures its own aircraft. The aviation academy focuses on educating students rather than building its own aircraft. Likewise, in our past science and technology fields, there was a gap between engineering and systems. Therefore, we used to think that we should do everything, but when things develop to a certain extent, when it becomes a deterministic event, we only need to take action to achieve our goals. We can consider whether it is worth fixing bugs in the industry, because sometimes we will encounter losses before doing anything. So I suggest that you think carefully about the motivations behind what we are doing today. In addition, we also need to consider historical developments and possible future situations. Just like after a tsunami or wave, will we stay where we are and not move forward? If sea levels rise, will we be able to continue doing our current jobs? Maybe we need to make some adjustments to adapt to the new situation.

Wu Baoyuan (The Chinese University of Hong Kong (Shenzhen))

I think about this issue from two angles. The first angle is to study the security issues of large models, because security issues of large models may cause more serious security consequences. The first consideration is whether our previous research methods for small models are suitable for large models. There have been some explorations in this area, and some new challenges have been initially discovered, such as efficiency issues. Another consideration is whether large models present entirely new security vulnerabilities, such as excessive data retention and hallucination issues. Another perspective is to use the large model as a tool. For example, we can use it as a supporting tool rather than completing the task autonomously. For example, when we want to complete a specific complex task, although the general ability of the large model may not be enough to solve the complex problem well, we can decompose the problem into multiple simpler sub-problems, and then use the large model to solve the complex problem. The model completes part of this to speed up the entire process. We have already conducted some explorations in this regard and have initially verified the feasibility of this idea, and its potential is worthy of further exploration.

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132995187