Editor of the heart of the machine in this article: Xiaozhou
The capabilities of LLM can also be extended to more subfields of machine learning.
Currently, large language models (LLMs) have set off a wave of change in the field of natural language processing (NLP). We have seen that LLM has a strong emergent ability and performs well on complex language understanding tasks, generation tasks, and even inference tasks. This inspired people to further explore the potential of LLM in another subfield of machine learning - computer vision (CV).
One of the great talents of LLMs is their ability to learn in context. Context learning does not update any parameters of the LLM, yet it shows impressive results in various NLP tasks. So, can GPT solve visual tasks through contextual learning?
A recent paper by researchers from Google and Carnegie Mellon University (CMU) shows that this seems to be feasible, as long as we can translate images (or other non-linguistic modalities) into language that LLMs can understand.
Paper address: https://arxiv.org/abs/2306.17842
This paper reveals the ability of PaLM or GPT to solve visual tasks through context learning, and proposes a new method SPAE (Semantic Pyramid AutoEncoder). This new approach enables LLMs to perform image generation tasks without any parameter updates. This is also the first successful approach to enable LLMs to generate image content using contextual learning.
Let's first look at the experimental effect of LLM on generating image content through context learning.
For example, in a given context, given 50 images of handwriting, the paper asks PaLM 2 to answer a complex query that needs to generate an image of a digit as output:
It is also possible to generate photorealistic images given an image context input:
In addition to generating images, PaLM 2 can also perform image description through context learning:
There are also visual Q&As for image-related questions:
It is even possible to denoise generated video:
Method overview
In fact, converting images into a language that LLM can understand is a problem that has been studied in the Visual Transformer (ViT) paper. In this paper from Google and CMU, they take it to the next level - using actual words to represent images.
This approach is like building a tower full of words, capturing the semantics and details of images. This text-filled representation allows easy generation of image descriptions and allows LLMs to answer questions about images and even reconstruct image pixels.
Specifically, the study proposes to convert an image into a token space using a trained encoder and a CLIP model; then utilize LLM to generate suitable lexical tokens; and finally convert these tokens back to pixel space using a trained decoder. This ingenious process converts images into a language that LLMs can understand, allowing us to exploit the generative power of LLMs in vision tasks.
Experiment and Results
This study compared SPAE experimentally with SOTA methods Frozen and LQAE, and the results are shown in Table 1 below. SPAEGPT outperforms LQAE on all tasks while using only 2% of tokens.
Overall, testing on the mini-ImageNet benchmark shows that the SPAE method improves performance by 25% over the previous SOTA method.
In order to verify the effectiveness of the SPAE design method, the study carried out ablation experiments, and the experimental results are shown in Table 4 and Figure 10 below:
Interested readers can read the original text of the paper to learn more about the research content.
One-minute video to understand the latest research on GAN-based Vincent graph: CVPR2023 GALIP
Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read
In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm!
Classic GAN has to read: StyleGAN
Click me to view GAN's series albums~!
A cup of milk tea, become the frontier of AIGC+CV vision!
The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models
ECCV2022 | Summary of some papers on generating confrontation network GAN
CVPR 2022 | 25+ directions, the latest 50 GAN papers
ICCV 2021 | Summary of GAN papers on 35 topics
Over 110 articles! CVPR 2021 most complete GAN paper combing
Over 100 articles! CVPR 2020 most complete GAN paper combing
Dismantling the new GAN: decoupling representation MixNMatch
StarGAN Version 2: Multi-Domain Diversity Image Generation
Attached download | Chinese version of "Explainable Machine Learning"
Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"
Attached download | "Mathematical Methods in Computer Vision" share
"A review of surface defect detection methods based on deep learning"
A Survey of Zero-Shot Image Classification: A Decade of Progress
"A Survey of Few-Shot Learning Based on Deep Neural Networks"
"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."
Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join the planet of AI-generated creation and computer vision knowledge!