Peking University proposed DragonDiffusion, using Diffusion to upgrade DragGAN, which can realize drag and drop with one click

picture

 Xi Xiaoyao's Science and Technology Sharing
 Source | Qubit  
 Author | Ming Min

The latest work of the Peking University team, the diffusion model can also be used to drag and drop the P map!

Click a little, you can make the snow mountain grow taller:

picture

Or let the sun rise:

picture

This is DragonDiffusion, which was jointly brought by VILLA (Visual-Information Intelligent Learning LAB), the team of Mr. Zhang Jian of Peking University, relying on Peking University Shenzhen Graduate School-Tuzhan Intelligent AIGC Joint Laboratory, and Tencent ARC Lab.

It can be understood as a variant of DragGAN.

DragGAN now has more than 30,000 GitHub Stars, and its underlying model is based on GAN (generated confrontation network).

Large model research test portal

GPT-4 capability research portal (advanced/continue to visit in case of browser warning):

gpt4test.com

picture

For a long time, GAN has shortcomings in generalization ability and image quality.

And this happens to be the strength of the Diffusion Model.

Therefore, Zhang Jian's team extended the DragGAN paradigm to the Diffusion model.

When the achievement was released, it was on the hot list of Zhihu.

picture

Some people commented that this solves the problem of partial incompleteness in the pictures generated by Stable Diffusion, and can control redrawing very well.

picture

Make a lion turn its head in a photo

The effect that Dragon Diffusion can bring also includes changing the shape of the front of the car:

picture

Let the sofa grow gradually:

picture

Or manually thin face:

picture

It is also possible to replace objects in a photo, such as putting a donut into another image:

picture

Or turn the lion's head:

picture

The method framework includes two branches, the guidance branch and the generation branch.

First, the image to be edited goes through the inverse process of Diffusion to find the representation of the image in the diffusion latent space, which is used as the input of the two branches.

Among them, the guidance branch will reconstruct the original image, and the information in the original image will be injected into the generation branch below during the reconstruction process.

The role of the generation branch is to guide the information to edit the original image, while keeping the main content consistent with the original image.

According to the strong correspondence between the intermediate features of the diffusion model, DragonDiffusion converts the latent variable images of the two branches into the feature domain through the same UNet denoiser in each diffusion iteration.

Then use two masks, and the area. Calibrate the position of the dragged content in the original image and the edited image, and then constrain the content to appear in the area.

The paper uses cosin distance to measure the similarity of two regions, and normalizes the similarity:

picture

In addition to constraining the content changes after editing, the consistency of other unedited areas with the original image should also be maintained. Here too, constraints are performed by the similarity of the corresponding regions. Finally, the total loss function is designed as:

picture

In terms of editing information injection, the paper regards the conditional diffusion process as a joint score function through score-based Diffusion:

picture

The editing signal is converted into a gradient through the score function based on the strong correspondence between features, and the latent variables in the diffusion process are updated.

In order to take both semantic and graphical alignment into consideration, the authors introduce a multi-scale guided alignment design based on this guided strategy.

picture

In addition, in order to further ensure the consistency of the editing results and the original image, a cross-branch self-attention mechanism is designed in the DragonDiffusion method.

The specific method is to use the Key and Value in the guided branch self-attention module to replace the Key and Value in the generated branch self-attention module, so as to realize the reference information injection at the feature level.

Ultimately, the proposed method, with its efficient design, provides multiple editing modes for both generated and real images.

This includes moving objects in the image, resizing objects, replacing object appearance, and image content dragging.

picture

In this approach, all content editing and saving signals come from the image itself without any fine-tuning or training of additional modules, which simplifies the editing process.

In their experiments, the researchers found that the first layers of the neural network were too shallow to accurately reconstruct images. However, if the fourth layer is reconstructed, it will be too deep, and the effect will be poor. Works best on 2nd/3rd floor.

picture

Compared with other methods, Dragon Diffusion's elimination effect is also better.

picture

From Zhang Jian's team at Peking University, etc.

The achievement was jointly brought by Zhang Jian's team of Peking University, Tencent ARC Lab, and Peking University Shenzhen Graduate School-Tuzhan AIGC Joint Laboratory.

Zhang Jian's team once led the development of T2I-Adapter, which can precisely control the content generated by the diffusion model.

Take Stars Super 2k on GitHub.

picture

This technology has been officially used by Stable Diffusion as the core control technology of the graffiti drawing tool Stable Doodle.

picture

The AIGC joint laboratory established by Tuzhan Intelligence and the Institute of Deep Research of Peking University has recently achieved breakthrough technological achievements in image editing and generation, legal AI products and other fields.

Just a few weeks ago, the Peking University-Rabbit Exhibition AIGC Joint Laboratory launched ChatLaw, a large language model product that was listed as the No. 1 hot search on Zhihu. .

picture

The joint laboratory will focus on the multi-modal large model with CV as the core, continue to dig into the ChatKnowledge large model behind ChatLaw in the language field, and solve the problems of anti-illusion, privatization, and data security in vertical fields such as law and finance.

It is reported that the laboratory will launch an original large model of Stable Diffusion in the near future.

Paper address: https://arxiv.org/abs/2307.02421

Project homepage: https://mc-e.github.io/project/DragonDiffusion/

picture

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132093507