InstructPix2Pix: Talking about it, surpassing PS

Abstract: InstructPix2Pix proposes a method for editing images using text: given an input image and editing instructions, tell the model what to do, and the model will follow these instructions to edit the image.

This article is shared from the HUAWEI CLOUD community " InstructPix2Pix: Move your lips and surpass PS ", author: Du Fu builds a house.

InstructPix2Pix: Learning to Follow Image Editing Instructions

 arXiv Code

InstructPix2Pix proposes a method for editing images using text: given an input image and editing instructions, tell the model what to do, and the model will follow these instructions to edit the image, such as:

We have released a notebook in ModelArts for everyone to play with, and will also briefly introduce the implementation of the model:

Method overview

In InstructPix2Pix, the authors implement a supervised approach to text editing images. The method mainly consists of two parts:

  1. Dataset generation: The author integrated the language model GPT-3 and the Vincent graph model Stable Diffusion to generate a dataset for image editing;
  2. Model Training: The authors used the generated dataset to train a conditional diffusion model for text editing images:

During inference, the image is edited during the model forward process without fine-tuning, so the inference speed is fast and the playability is high.

data preparation

When preparing data, first generate some text editing instructions through GPT-3, and then generate text-image pairs before and after editing through Stable Diffusion and Prompt-to-Prompt.

Generate text editing commands

The authors used 700 human-annotated text editing instruction triples to fine-tune GPT-3, and then used the fine-tuned GPT-3 to generate large-scale text triples. As shown in the figure below, the text triplet includes (1) input description; (2) editing instruction; (3) edited description.

When using GPT-3 to generate a dataset, you only need to provide an input description, and the highlighted editing instructions and output descriptions are both generated by GPT-3, which can ensure the diversity of description instructions.

generate image pairs

After obtaining the description pairs, the author uses Stable Diffusion to convert the pre-edited and edited descriptions into image pairs. A challenge is faced in this process: Even if the input prompts are only slightly different, the generated images cannot guarantee the consistency of content, such as using "photograph of a girl riding a horse" and "photograph of a girl riding a dragon" respectively. Generating an image for the prompt results in the following image:

To solve this problem, the author used the Prompt-to-Prompt method . The method controls the attention map of the edited image by injecting the attention map of the original image during the diffusion process, enabling content-consistent image generation. Using this method and the same prompt word, you will get the following picture:

The authors generated a total of 454,445 samples, and the generated data can be downloaded via an open-source script.

conditional diffusion model

InstructPix2Pix implements the function of editing images according to input images and text instructions by training conditional diffusion models. The author uses the pre-trained Stable Diffusion to initialize the model, and adds an additional conditional channel to the first convolutional layer to support image editing. During training, the model encodes the input image x x into a hidden vector z=E(x) z =E( x ), and adds noise t t to the encoded vector z z during the diffusion process to obtain a hidden noise variable zt zt ​, where Noise levels increase over time.

Train a network ϵθ ϵθ ​to predict the noise t t under the given image condition cI cI ​and text instruction cT cT ​, the objective function:

In order to balance the quality and diversity of the samples generated by the diffusion model, the authors introduced the Classifier-free Guidance method in InstructPix2Pix. When we use a condition to generate an image, we hope that the generated image has a high correlation with the condition, Classifier-free Guidance uses an implicit classifier pθ(c∣zt) pθ ​( c zt ​) To Score the generated samples. Among them, zt zt ​is a latent representation of an encoded image, c c is a condition, which can be a text description or other images. This classifier assigns each sample a corresponding score indicating how relevant the sample is to a given condition. During training, the probability will be biased to those data points with high scores of the implicit classifier through noise guidance, thereby improving the relevance of the generated samples to the given conditions.

For the text editing image task, the author designed the scoring network eθ(zt,cI,cT) ​( zt ​, cI ​, cT ​), for the image condition cI cI ​and the text instruction cT cT ​designed the guidance scale sI sI ​and sT sT ​, it is possible to adjust how well the network follows the input image and editing instructions:

The author gives example experimental results for different guidance scales:

More experimental results and a discussion of limitations can be viewed on the Project Page.

Case introduction

As mentioned above, in order to make it easier for everyone to play, we have released a one-click notebook in ModelArts. In addition to eliminating the need for complex environment adaptation steps, you can also enjoy free GPU resources

The download of the open source model is very slow. I am considerate and will transfer the resources to OBS for everyone :

In addition, the pre-training model only supports English, where there is no English, there is translation, and the translation model is ready, surprise or surprise:

The case uses gradio to build a small application. After one-click operation, directly enter the picture and Chinese editing command in the application box to generate the result:

Don't forget to close after use:

Click to follow and learn about Huawei Cloud's fresh technologies for the first time~

Guess you like

Origin blog.csdn.net/devcloud/article/details/129587335