Expansion of large language models to solve visual tasks through contextual learning

Editor of the heart of the machine in this article: Xiaozhou

The capabilities of LLM can also be extended to more subfields of machine learning.

Currently, large language models (LLMs) have set off a wave of change in the field of natural language processing (NLP). We have seen that LLM has a strong emergent ability and performs well on complex language understanding tasks, generation tasks, and even inference tasks. This inspired people to further explore the potential of LLM in another subfield of machine learning - computer vision (CV).

One of the great talents of LLMs is their ability to learn in context. Context learning does not update any parameters of the LLM, yet it shows impressive results in various NLP tasks. So, can GPT solve visual tasks through contextual learning?

A recent paper by researchers from Google and Carnegie Mellon University (CMU) shows that this seems to be feasible, as long as we can translate images (or other non-linguistic modalities) into language that LLMs can understand.

eee69e3bb90c583aa7843944cfd4798d.png

Paper address: https://arxiv.org/abs/2306.17842

This paper reveals the ability of PaLM or GPT to solve visual tasks through context learning, and proposes a new method SPAE (Semantic Pyramid AutoEncoder). This new approach enables LLMs to perform image generation tasks without any parameter updates. This is also the first successful approach to enable LLMs to generate image content using contextual learning.

Let's first look at the experimental effect of LLM on generating image content through context learning.

For example, in a given context, given 50 images of handwriting, the paper asks PaLM 2 to answer a complex query that needs to generate an image of a digit as output:

2d315fce81d448739cf995a6ea0ca958.png

It is also possible to generate photorealistic images given an image context input:

e81ebbc78aeaf65266c6551fa19a5ec6.png

In addition to generating images, PaLM 2 can also perform image description through context learning:

aad721e04ee72a7faf2d3ff68ceeb87a.png

There are also visual Q&As for image-related questions:

0dc031c8e1de1789d04d14f41b7a1635.png

It is even possible to denoise generated video:

ead42427f54c1dee82b1618d5717bcb1.gif

Method overview

In fact, converting images into a language that LLM can understand is a problem that has been studied in the Visual Transformer (ViT) paper. In this paper from Google and CMU, they take it to the next level - using actual words to represent images.

This approach is like building a tower full of words, capturing the semantics and details of images. This text-filled representation allows easy generation of image descriptions and allows LLMs to answer questions about images and even reconstruct image pixels.

ed93b3d52ce366310de3de0957fd2c1d.png

Specifically, the study proposes to convert an image into a token space using a trained encoder and a CLIP model; then utilize LLM to generate suitable lexical tokens; and finally convert these tokens back to pixel space using a trained decoder. This ingenious process converts images into a language that LLMs can understand, allowing us to exploit the generative power of LLMs in vision tasks.

b6cfe25a0302f2287772e3aaa03de6ec.png

Experiment and Results

This study compared SPAE experimentally with SOTA methods Frozen and LQAE, and the results are shown in Table 1 below. SPAEGPT outperforms LQAE on all tasks while using only 2% of tokens.

2038330415499317f47070957de67a57.png

Overall, testing on the mini-ImageNet benchmark shows that the SPAE method improves performance by 25% over the previous SOTA method.

6e5476846cab5f6165bc6c4517488c15.png

In order to verify the effectiveness of the SPAE design method, the study carried out ablation experiments, and the experimental results are shown in Table 4 and Figure 10 below:

6fada64a81e86e1b8958aba8a664544e.png

1f63c3adef87b0a258f4c4cb8ecd3141.png

Interested readers can read the original text of the paper to learn more about the research content.

One-minute video to understand the latest research on GAN-based Vincent graph: CVPR2023 GALIP

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

138db6b9765324eea90d945f2e3c3ed7.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/131672061