Multimodal Model GILL: Generation + Understanding, a new work by CMU Chinese Ph.D.

b91fcca52aadd4cacadcd02bc21d0f2c.png


This article is edited by Xin Zhiyuan: Tao Zi

[New Wisdom Guide] CMU's new multi-modal model GILL can generate images, retrieve images, and conduct multi-modal dialogues.

Recently, researchers from CMU proposed a new multimodal model GILL.

d605a9269d1bd8cd146631fd04715d3b.png

Paper address: https://arxiv.org/pdf/2305.17216.pdf

It can use text or images as prompts to complete multimodal conversations. Specifically, it can generate text, retrieve images, and generate new images.

3237f4cc27d8fdc00e8047c8fa55807c.png

Even, GILL can perform image retrieval from a pre-specified dataset, and decide whether to retrieve or generate at inference time.

It is worth mentioning that, through the mapping between embedding spaces, the CMU team combined the frozen large model with the pre-trained Vinsen graph model.

In this way, GILL can achieve a wide range of applications and outperform generative models such as Stable Diffusion in multiple text-to-image tasks.

Let's take a look at a demo first.

demo


GILL is able to generalize LLM pre-training and freezing capabilities to many different tasks. Specifically include:

2dda8a224dec47391f99af213b7de4fc.png

https://huggingface.co/spaces/jykoh/gill

Multimodal Dialog Generation

You can prompt GILL to generate dialogue-like text, which can do image retrieval, image generation, and even multimodal dialogue.

For example, you can ask it how to make ramen more nutritious? GILL gave suggestions for adding vegetables.

I want a tattoo. GILL instantly generates a pattern that meets the requirements for you.

How to advertise these cakes in the market? GILL suggests a simple sign with the name of the business and a picture of a cupcake.

d7ba6d146bd37bbb4b9d1c421ef7dfaf.png

Generate images from visual stories

In addition, GILL can also generate more relevant images based on interleaved image and text inputs.

06baf1f2ad93a0909cd8371534cf3563.png

Multimodal Large Model GILL


The full name of GILL is: Generating Images with Large Language Models, which uses large language models to generate images.

It is capable of processing arbitrarily interleaved image and text inputs to generate text, retrieve images, and generate new images.

3f1c437c247c18e4199f42379cafcc99.png

An overview of the GILL model architecture. Training with a description loss to learn to process images (left), and training with an image retrieval and image generation loss to learn to generate images (right)

Studies have shown that although the two models use completely different text encoders, they can effectively map the output embedding space of the frozen plain text LLM to the embedding space of the frozen text-image generation model, namely Stable Diffusion.

In contrast to other methods that require interleaved image-text training data, the researchers achieved this by fine-tuning a small number of parameters on image-description pairs.

This method is computationally efficient and does not require running the image generation model at training time.

9640351724b66d63e965aa0cbb9182e0.png

GILL's inference time course. The model takes image and text inputs and generates text interleaved with image embeddings. After deciding whether to retrieve or generate a specific set of tokens, and return the appropriate image output

During inference, the model takes arbitrarily interleaved image and text inputs and produces text interleaved with image embeddings. After deciding whether to retrieve or generate a specific set of tokens, it returns the appropriate image output (retrieve or generate)

During inference, the model receives arbitrarily interleaved image and text inputs and generates interleaved image-embedded text. After deciding whether to retrieve or generate a specific set of tokens, it returns the corresponding image output (retrieved or generated).

Experimental results


Context image generation

To test the model's capabilities against novel image generation baseline methods, the researchers conducted experiments on the VIST and VisDial datasets.

These datasets are the same ones used in previous studies to benchmark image retrieval in multimodal text and image context.

The GILL model combines multimodal information to produce relevant image and text output, outperforming baseline models limited to image retrieval.

2918d93eca702071ddb9cdf53f021b9e.png

Evaluation Index

The evaluation focuses on the ability of the generative model to handle complex language descriptions. Therefore, the researchers calculated metrics that measure the relevance of the content of the generated images.

Here, there are 2 metrics to evaluate the model:

1. CLIP similarity: Use the CLIP ViT-L image encoder to generate a merged representation of the generated image and the corresponding real image, and derive their cosine similarity. A higher score indicates that the generated image is more similar to the real image.

2. Learning Perceptual Image Patch Similarity (LPIPS): LPIPS evaluates the distance between image patches. Measure LPIPS between real and generated images. Lower values ​​indicate that 2 images are closer in perceptual space, while higher values ​​indicate that 2 images are less similar.

Generate from Visual Story

VIST is a dataset for sequential vision and language tasks that contains examples of 5 sequences of images and text that make up a story.

Evaluation results are shown, comparing GILL to text-to-image generation baselines.

When both models are fed a single story description, performance is comparable, with SD achieving a better CLIP similarity score and both models achieving similar LPIPS.

However, when all 5 story descriptions are provided as input, GILL outperforms SD, improving CLIP similarity from 0.598 to 0.612 and LPIPS from 0.704 to 0.6.

Interestingly, when further provided with the full multimodal context, GILL is significantly improved, achieving a CLIP similarity of 0.641 and an LPIPS of 0.3.

226c640750ae2efe754e0948cc4b1aa4.png

Generated from visual dialogue

The researchers also tested the model on the VisDial dataset.

Similar to VIST, the model is evaluated on its ability to accurately synthesize described images, given increasing context for question-answer dialogues as input.

The evaluation results show that SD outperforms GILL when the input length is short.

However, GILL gradually improves when the input context increases and can synthesize images that are more similar to real images.

When provided with the full 10 rounds of dialogue, GILL outperforms SD significantly, outperforming both CLIP similarity (0.622-0.645) and LPIPS (0.723-0.714).

These results further highlight the effectiveness of GILL in handling long dialogue-like text inputs.

ce55aafa70aedd55c6b462cc045d4326.png

The researchers also introduced the GILLMapper module, which allows the model to be efficiently mapped to a Stable Diffusion image generation backbone, outperforming or matching SD in many examples from PartiPrompts.

a959d71d3fa390cb55f0ea7f7e97da1d.png

The GILLMapper model architecture is conditioned on a hidden [IMG] representation and a learned sequence of query embedding vectors.

b318a91a2f1d29652b4902bcf384a4ce.png

limitation

While GILL introduces many exciting features, it is an early research prototype with several limitations.

- Many features of GILL depend on the main LLM architecture. As such, it also inherits many of the problems typical of LLMs:

- GILL doesn't always produce an image when prompted, or when it's useful for dialogue.

- The limitation of GILL is its limited visual processing. Currently, studies only use 4 visual vectors to represent each input image (due to computational limitations), which may not capture all relevant visual information required for downstream tasks.

- GILL inherits some unexpected behaviors of LLM, such as the potential to hallucinate what it generates is wrong, or irrelevant to the input data. It also sometimes generates repetitive text, and doesn't always generate coherent dialogue text.

about the author


Jing Yu Koh

Jing Yu Koh is a second-year Ph.D. student in the Department of Machine Learning at CMU, supervised by Daniel Fried and Ruslan Salakhutdinov.

Currently, his main research direction is basic language understanding.

Daniel Fried and Ruslan Sarakutinov advised me. I work on basic language understanding, usually in the context of vision and language problems.

Before that, he was a research engineer at Google Research, where he worked on vision and language problems and generative models.

4574ae2f2630deff5ce9089c9007ab27.png

References:

https://www.cxs.cmu.edu/news/2023/gill

https://jykoh.com/gill

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Lying down, 60,000 words! 130 articles in 30 directions! CVPR 2023's most complete AIGC paper! read it in one go

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

26f607793983ea2d9e0b9c7810e6933f.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/132486273