Lessons learned from GPT and large language models

Click on the blue word to follow us

Follow and star

never get lost

Institute of Computer Vision

99365a3c03dd8df30934343be7715c3f.gif

68b0937599cc838e18a4ccd167fb4679.gif

Public IDComputer Vision Research Institute

Learning groupScan the QR code to get the joining method on the homepage

494b152fac57a61ac9c889a1220cbd0a.png

Paper address: https://arxiv.org/pdf/2306.08641.pdf

Computer Vision Research Institute column

Column of Computer Vision Institute

The AI ​​community has been pursuing algorithms known as artificial general intelligence (AGI), which are applicable to any type of real-world problem.

a95e1a6191a843eb06bfd9d24557bc7c.gif

01

Overview

Recently, chat systems powered by large language models (LLM) emerged and quickly became a promising direction for AGI in natural language processing (NLP), but the road to AGI in computer vision (CV) still remains Not sure. One might attribute this dilemma to the fact that visual signals are more complex than verbal signals, but we are interested in finding the specific reason and drawing lessons from GPT and LLM to solve this problem.

5825fb07170264023a333318a4b2906b.png

In today's sharing, starting from the concept definition of AGI, a brief review of how NLP can solve a wide range of tasks through chat systems. This analysis enlightens us that unification is the next big goal for CV. However, despite various efforts in this direction, CV is still far from a system that naturally integrates all tasks like GPT. We point out that the essential weakness of CV lies in the lack of a paradigm for learning from the environment, while NLP has accomplished the task in the text world. We then imagine a pipeline that places a CV algorithm in a world-wide interactive environment, pre-trains it to predict future frames of its actions, and then fine-tunes it with instructions to accomplish various tasks. We hope to push this idea forward and scale it up through substantial research and engineering efforts, and to this end we share our vision for future research directions.

7e29d3094e864e876f0cbde53066454a.gif

02

background

The world is witnessing an epic journey towards artificial general intelligence (AGI), which we conventionally define as a computer algorithm that can replicate any intellectual task that humans or other animals can perform. Specifically, in natural language processing (NLP), computer algorithms have advanced to the point where they can solve a wide range of tasks by chatting with humans. Some researchers believe that these systems can be seen as the early sparks of AGI. Most of these systems are built on large language models (LLMs) and enhanced by instruction tuning. Equipped with external knowledge bases and specially designed modules, they can complete complex tasks such as solving mathematical problems and generating visual content, reflecting their strong ability to understand user intentions and execute preliminary thought chains. Despite known weaknesses in some areas (e.g., the relationship between telling scientific facts and naming people), these pioneering studies have shown a clear trend towards unifying most tasks in NLP. As a system, this reflects the pursuit of AGI.

9db724c8e44a9aa7832e9d92b4171495.jpeg

Compared to the rapid progress of unification in NLP, the computer vision community is far from the goal of unifying all tasks. Conventional CV tasks, such as visual recognition, tracking, generation, etc., are mostly processed using different network architectures and/or specially designed channels. Researchers look forward to a system like GPT that can handle a wide range of CV tasks through a unified hinting mechanism, but there is a trade-off between achieving good practice on a single task and generalizing across a wide range of tasks. For example, to report high recognition accuracy in object detection and semantic segmentation, the best strategy is to design a specific head module on a strong backbone for image classification, and this design usually does not transfer to other problems.

Thus, two questions arise: (1) Why is the unification of CVs so difficult? (2) What can be learned from GPT and LLM in order to achieve this goal?

To answer these questions, GPT is revisited and understood as establishing an environment within the text world and allowing algorithms to learn from interactions. CV research lacks such an environment. Therefore, algorithms cannot model the world, so they sample the world and learn to perform well in so-called proxy tasks. After an epic decade of deep learning, proxy tasks are no longer meaningfully indicative of the capabilities of CV algorithms; it is increasingly clear that continuing to pursue high accuracy on them could take us away from AGI.

46926e9912196916a14b63e2adb26d6d.gif

03

Overview

In short, AGI is to learn a generalized function a=π(s). Despite their simplicity in form, old-fashioned AI algorithms struggle to handle all of these problems using the same methods, algorithms, and even models. Over the past decade, deep learning has provided an efficient and unified approach: one can train deep neural networks to approximate the function a = π(s), without knowing the actual relationship between them. The advent of powerful neural network architectures such as transformers has even enabled researchers to train a single model for different data patterns.

There are huge difficulties in realizing AGI, including but not limited to the following problems.

  • Data complexity. Real-world data is multifaceted and plentiful. Some data modalities (eg, images) may be quite high-dimensional, and the relationships between different modalities may be complex and latent.

  • The complexity of human intelligence. The goal of AGI is not only to solve problems, but also to plan, reason, react to different events, etc. Sometimes, the relationship between human actions and goals is fuzzy and difficult to express mathematically.

  • Lack of neural or cognitive theory. Humans do not yet understand how human intelligence is achieved. Currently, computer algorithms offer one pathway, but with future research in neurology and/or cognition, more possibilities may emerge.

2066f7d8439fc5188f356f305ebdb9f1.gif

04

GPT

The Spark of AGI in NLP

Over the past year, ChatGPT3, GPT-4, and other AI chatbots such as Vicuna4 have made significant progress in AGI. They are computer algorithms developed for Natural Language Processing (NLP). Through chat programs with humans, they can understand human intent and complete a wide range of tasks, as long as these tasks can be presented in plain text. In particular, GPT-4 has strong capabilities in general problem solving and is considered to be an early spark of AGI in the NLP field.

ab2bb0bb479e8584821c7f7c894ee46e.png

Although GPT-4 has not yet opened the visual interface to the public, the official technical report shows several exotic examples of multimodal dialogue, that is, chat based on input images as a reference. This means that GPT-4 already has the ability to combine language features with visual features, so it can perform basic visual understanding tasks. As we will see later, the vision community has developed several alternatives for the same purpose, the key lies in using ChatGPT or GPT-4 to generate (coach) the training data. Furthermore, with simple hints, GPT-4 is also able to invoke external software for image generation (e.g., Midtravel, shown below) and external libraries for solving complex problems in computer vision (e.g., the HuggingFace library).

22253ac9f5a65a75bd2028abe72184c5.png

These AI chatbots are trained in two stages. In the first stage, large language models (LLMs) are pretrained on large text databases using self-supervised learning, most of which are based on transformer architectures. In the second stage, the pre-trained LLM is supervised by human instructions to complete specific tasks. If necessary, collect human feedback and perform reinforcement learning to fine-tune the LLM for better performance and more data efficiency. CV: The Next Battlefield of AGI

fcf55561303235051bf949f004b88d77.gif

05

AGI's Next Battlefield

CV: The Next Battlefield of AGI

Humans perceive the world based on multiple data modalities. We all know that about 85% of what we learn is done through our visual system. Therefore, given that the NLP community has already shown promise for AGI, it is natural to see computer vision (CV) or multimodality (at least including vision and language domains) as the next battleground for AGI.

Two additional comments are offered here to complement the above statement. First, it's clear that CV is a superset of NLP, since humans reading articles first recognize the characters in the captured image and then understand the content. In other words, AGI (or multimodality) in CV should cover all the capabilities of AGI in NLP. Second, I think that language alone is not enough in many cases. For example, when one is trying to find detailed information about an unknown object (e.g., an animal, fashion, etc.), the best approach is to capture an image and use it for online searches; relying solely on textual descriptions can lead to uncertainty and uncertainty. accuracy. Another situation is that, as I mentioned earlier, it is not always easy to refer to fine-grained semantics (for recognition or image editing) in the scene, and it is more efficient to think in a visually friendly way, for example, using dots or box to locate the target, instead of saying "the person in the black jacket standing in front of the yellow car, talking to another person."

ideal and reality

It is desirable to have a CV algorithm that can solve general tasks by interacting with the environment. Note that this requirement is not limited to recognizing all content or performing dialogue based on images or video clips. It should be a holistic system that takes common commands from humans and produces desired results. However, the current state of CV is still very preliminary. As shown in the figure below, CV has been using different modules and even systems for different vision tasks.

4df3b870e8ae5bf06346bdd3ae2d119a.png

Unity is the trend

Below, I summarize recent research themes on CV unification into five categories.

  • Open-world Visual Recognition

8c9ac0959b23858e6c53542df4606ae4.png

For a long time, most CV algorithms could only recognize concepts that were present in the training data, resulting in a "closed world" of visual concepts. In contrast, the concept of "open world" refers to the ability of a CV algorithm to recognize or understand any concept, whether it has been presented before or not. Open-world capabilities are often introduced by natural language because it is a natural way for humans to understand new concepts. This explains why language-related tasks such as image captioning and visual question answering led to the earliest open-world settings for visual recognition.

  • The Segment Anything Task

c86159cf1df52e3953b65849c2fc300f.png

The Segment Anything task is a recently introduced general module for clustering raw image pixels into groups, many of which correspond to basic visual units in the image. The proposed task supports multiple types of hints, including points, contours, text, etc., and generates some masks and scores for each hint or each hint combination. After training on a large-scale dataset with about 10 million images, the derived model SAM can be transferred to a wide range of segmentation tasks, including medical image analysis, disguised object segmentation, 3D object segmentation, object tracking, and image restoration applications Scenes. SAM can also be used with state-of-the-art visual recognition algorithms, such as refining bounding boxes produced by vision-based algorithms into masks, and feeding segmentation units into open-set classification algorithms for image labeling.

  • Generalized Visual Encoding

f92e00f84936bf70533c575e1594e3ee.png

Another way to unify CV tasks is to provide them with a common visual encoding. There are several ways to achieve this. A key difficulty lies in the large differences between vision tasks, e.g. object detection requires a set of bounding boxes, while semantic segmentation requires dense predictions over the entire image, both of which are very different from the single labels required for image classification. Natural language provides a unified form to represent everything, as everyone can understand it. An early study called pix2seq showed that object detection results (i.e., bounding boxes) can be formulated into natural language and coordinates, and then converted into labels as the output of vision models. In a later version, pix2seq-v2, they generalized the representation to the output of object detection, instance segmentation, keypoint detection, and image captioning. Similar ideas are used for other image recognition, video recognition, and multimodal understanding tasks.

  • LLM-guided Visual Understanding

a86e9641c4e2f06184cbd46af2a43094.png

Visual recognition can be complex, especially when it involves composing relationships between concepts and/or visual instances. End-to-end models (visual-language pre-trained models for visual question answering) struggle to produce answers following a program that humans can easily understand. To alleviate this problem, a practical approach is to generate interpretable logic to aid visual recognition. The idea is not new. A few years ago, before the advent of the transformer architecture, researchers proposed using long short-term memory (LSTM) models to generate programs that invoke visual modules as modules for complex question answering. At the time, the capabilities of LSTMs largely limited the idea to relatively simple and templated problems.

More recently, the advent of large language models (especially the GPT family) has enabled the transformation of arbitrary problems. Specifically, GPT can interact with humans in different ways. For example, it can summarize basic recognition results into a final answer, or generate code or natural language scripts to call basic vision modules. Therefore, vision problems can be decomposed into basic modules. This works especially well for logical questions, such as questions asking about spatial relationships between objects or that depend on the number of objects.

  • Multimodal Dialog

Multimodal dialogs extend text-based dialogs into the visual realm. Early work involved visual question answering, where various datasets with simple questions were constructed. With the rapid development of LLM, multi-round question answering can be achieved by fine-tuning pre-trained vision and language models together. Research also shows that it is possible to answer a wide range of questions through multimodal contextual learning or using GPT as a logic controller.

88725d4d72feffe89e1510c1758d78d6.png

Recently, a new paradigm developed in the GPT family, named supervised learning, was inherited to improve the quality of multimodal conversations. The idea is to provide some reference data (e.g., target, description) from GT ground truth annotations or recognition results, and ask the GPT model to generate instruction data (i.e., rich question-answer pairs). By fine-tuning on these data (without reference), the underlying models of vision and language can interact with each other through lightweight network modules (such as Q-former). Multimodal dialogue provides an initial interaction benchmark for computer vision, but as a language-guided task, it also has weaknesses analyzed in open-world visual recognition. We hope that enriching query forms (e.g., using general visual encoding methods) can push multimodal conversations to a higher level.

2be01104ac957248a0e5d594cfc382d7.gif

06

future

learn from the environment

An Imaginary Pipeline

d74701eab2cdb3d1cbf96703777a8a67.png

The image above shows our idea. The pipeline consists of three stages: stage 0 is used to build the environment, stage 1 is used for pre-training, and stage 2 is used for fine-tuning. When necessary, the fine-tuned model can be prompted to perform traditional visual recognition tasks.

Comments on Research Directions

Finally, the future research direction is prospected. As the primary goal shifts from performance on agent tasks to learning from the environment, many popular research directions may have to adjust their goals. Here is a disclaimer: All statements below are our personal opinions and could be wrong.

On creating environment

A stated goal is to continue to increase the size, variety, and fidelity of virtual environments. There are various techniques that can help. For example, new 3D representations (e.g., neural rendering field, NeRF) may be more effective in achieving a trade-off between reconstruction quality and overhead. Another important direction is rich environments. Defining new, complex tasks and unifying them into a hint system is a non-trivial task. Furthermore, AI algorithms can greatly benefit from better modeling the behavior of other agents, as it can greatly increase the richness of the environment and thus improve the robustness of the training algorithm.

On generative pre-training

There are mainly two factors affecting the pre-training stage, namely neural architecture design and agent task design. The latter is obviously more important, and the former should be based on the latter. Existing pre-training tasks, including contrastive learning and masked image modeling, should be modified for efficient exploration in virtual environments. We want newly designed agents to focus on data compression, since redundancy in visual data is much heavier than in linguistic data. The new pre-trained agent defines the requirements of the neural architecture, for example, in order to achieve a trade-off between data compression and visual recognition, the designed architecture should have the ability to extract different levels (granularity) of visual features upon request. Furthermore, cross-modal (e.g., text-to-image) generation will be a direct metric to measure pre-training performance. When a unified tokenization method is available, it can be formulated as a multimodal version of the reconstruction loss.

On Guidance Fine-tuning

We have not yet entered the scope of defining tasks in the new paradigm. Since real-world tasks can be very complex, we conjecture that some basic tasks can be defined and trained first so that complex tasks can be decomposed into them. To this end, a unified prompt system should be designed and rich manual instructions should be collected. As a reasonable guess, the amount of command data may be orders of magnitude larger than the data collected for training GPT and other chatbots. This is a whole new story for CV. The road ahead is full of unknown difficulties and uncertainties. We don't see much right now, but a clear path will emerge in the future.

© THE END 

For reprinting, please contact this official account for authorization

614eaefd00e9dc5035f4b610a10b1331.gif

The Computer Vision Research Institute study group is waiting for you to join!

ABOUT

Institute of Computer Vision

The Institute of Computer Vision is mainly involved in the field of deep learning, and is mainly committed to research directions such as object detection, object tracking, image segmentation, OCR, model quantization, and model deployment. The research institute shares the latest paper algorithm new framework every day, provides one-click download of papers, and shares actual combat projects. The research institute mainly focuses on "technical research" and "practice implementation". The Institute will share the practice process for different fields, so that everyone can truly experience the real scene of getting rid of the theory, and cultivate the habit of loving programming and brain thinking!

94c99047845cf434818693d7aea3abeb.png

Past review

01

ICLR 2023 | RevCol: A New Paradigm for Large Model Architecture Design

02

Tsinghua University proposes LiVT to solve unbalanced labeling data

03

The big AI model is coming soon

04

Huawei Noah's minimalist network achieved 83% accuracy with 13 layers

Guess you like

Origin blog.csdn.net/gzq0723/article/details/132094807