VisorGPT: How to customize a controllable generative model based on GPT and AIGC models

Title: VisorGPT: Learning Visual Prior via Generative Pre-Training
Paper: https://arxiv.org/abs/2305.13777
Code: https://github.com/Sierkinhane/VisorGPT

guide

Controllable diffusion models such as ControlNet, , T2I-Adapterand GLIGEN, etc. can control the specific layout of the content in the generated image through additional spatial conditions such as human body pose and target box. Using human poses extracted from existing images, object boxes, or annotations in the dataset as spatial constraints , the above methods have achieved very good controllable image generation effects.

So, how to obtain space constraints more friendly and conveniently? Or how to customize the spatial conditions for controllable image generation? For example, the category, size, number, and representation (object boxes, keypoints, and instance masks) of objects in custom spatial conditions .

In this paper, the shape, position and relationship between objects in the space conditions are summarized as visual prior ( Visual Prior), and the above visual prior is modeled in the way of Transformer Decoder . Therefore, we can sample spatial constraints Generative Pre-Trainingfrom the learned prior by sampling at multiple levels, such as representation (object boxes, keypoints, instance masks), object category, size, and number.Prompt

We envision that with the improvement of the generation ability of the controllable diffusion model, images can be generated in a targeted manner for data supplementation in specific scenarios, such as human pose estimation and target detection in crowded scenes.

motivation

An overview of the problem of visual prior (top) and VISORGPT (bottom)

First look at the schematic diagram above, here:

(a): The concept of visual prior refers to elements such as the position , shape and relationship of objects in the scene.

(b): Shows a failed case of synthesizing images where the spatial conditions for image synthesis do not meet the prior requirements. Specifically, a "donut" is not square in shape, and it is suspended in the air rather than resting on the "dining table".

©: Shows a successful case where the conditions sampled VISORGPTfrom lead to more accurate synthetic results.

(d): Illustrates VISORGPTthe learning of visual priors by converting from the visual world to a sequence corpus.

(e): An example is given showing how the user can customize the results sampled VISORGPTfrom .

These content generally aim to clarify the author's research goals and methods, as well as VISORGPTthe application of learning visual priors and the ability of customized sampling.

method

Table 1 Training data

This paper organizes and collects seven kinds of data from the currently public data sets, as shown in Table 1. In order to learn the visual prior in the way of Generative Pre-Training and add the customizable function of sequence output.

Here are the two templates proposed by the paper Prompt:

Using the above template, the labels of each picture in the training data in Table 1 can be formatted into a sequence xxx . During training, we use the BPE algorithm to convert each sequencexxx editingtokens = u 1 , u 2 , … , u 3 tokens={u_1,u2,…,u3}tokens=u1,u2 , _,u 3 , and learn the visual prior by maximizing the likelihood as follows:

Finally, we can customize the sequence output from the model learned in the above way, as shown in the following figure:

Effect

Summarize

This article mainly introduces VISORGPTthe method, which is a mechanism for learning visual priors through generative pre-training. It uses sequential data and language modeling methods to learn prior knowledge about the relationship between location, shape and categories, and provides a method for custom sampling of learned priors.

Guess you like

Origin blog.csdn.net/CVHub/article/details/131270614