Controllable Vincent diagram | Combined, multi-objective concepts, customized generation, new developments in Ali research

This article is edited from paperweekly

Vincent diagrams have made significant progress in the past year. The customized generation of DreamBooth has further proved the potential of Vincent diagrams, and has widely attracted the attention of the community. Compared with single-concept generation, multiple concepts can be customized in one diagram. It is more interesting and has a wide range of application scenarios (AI photo studio, AI comic generation...).

Compared with the success of single-concept custom generation, Cones proposed by Ali and Custom Diffusion proposed by Adobe still have two challenges as existing multi-custom concept generation methods:

  • First, they need to learn a separate model for each combination of multiple concepts, which may be affected by the following: 1) cannot use existing models, such as a new multi-concept group that needs to be customized contains three concepts { A,B,C}, knowledge cannot be obtained from the existing customized model of {A,B}, and can only be retrained. 2) When the number of concepts that need to be customized increases, the consumption of computing resources increases exponentially.

  • Different custom concepts may interfere with each other, resulting in some concepts not being displayed in the final generation, or the attributes between concepts are confused. This phenomenon is especially evident when the semantic similarity between concepts is high (for example, customizing a cat and a dog at the same time may generate a picture in which the customized cat confuses some characteristics of the dog.)

Based on this, the research team of Alibaba and Ant Group proposed a combined multi-concept customization generation method: Cones 2 , which can customize more objects at the same time, and the quality of generated images is significantly improved.

d3b74a1d4018b3b982b548daa427e906.png

Paper home page: Cone 2

https://arxiv.org/abs/2305.19327

Project homepage: Cones-page

https://cones-page.github.io

The team's previous work, Cones, won the ICML 2023 oral and gained a lot of attention on Twitter.

b8576df6f4c73c43c04771685a546327.png

The advantages of Cones 2 are mainly reflected in three aspects. (1) Use a simple and effective method to represent concepts, which can be combined arbitrarily, and various trained single concepts can be reused, so as to generate multiple customized concepts without any retraining for multiple concepts. (2) Use the spatial layout as a guide, which is very easy to obtain in practice, users only need to provide a bounding box, that is, the specific location of each concept can be controlled, and the attribute confusion between concepts can be alleviated at the same time. (3) Satisfactory performance can also be achieved in some challenging scenarios: generate semantically similar multi-customized concepts, such as customizing two dogs, and glasses can be exchanged; in terms of the number of concepts, six can also be synthesized concept.

cba4f2629263a5cb34f44dcf94c63aaa.png

7d2497407190795b9a6fc3ccaef406d3.png

method

3276eae281b2f2dd67eabf0cb00e5f05.png

1. Text-Guided Image Generation Based on Diffusion Model

The diffusion model learns to restore real visual content by gradually denoising from normally distributed noise, which is actually simulating a reversible Markov chain of length T=1000. In the text-to-image task, the training objective of the conditional diffusion model can be reduced to the reconstruction loss:

2e56918c5863afa528ac17c4bfc0f956.png

Text embeddings are injected into the model via a cross-attention mechanism. At inference time, the network samples through iterative denoising.

2. Residual text embeddings represent concepts

In order to be able to customize and generate specific concepts that users need, the model first needs to "remember" the characteristics of these concepts. Since changing the parameters of a pre-trained model often leads to a decrease in the generalization of the model, Cones 2 chooses to learn an appropriate editing direction for each specific concept. By applying this direction to the feature encoding of the base class corresponding to the concept, a customized result can be obtained. This direction is called residual token embedding.

For example, when using Stable Diffusion to generate the image "a dog sitting on the beach", the entire generation process is controlled by the text encoding obtained by passing the text through the text encoding model, then only the text encoding corresponding to "dog" needs to be properly Offset, you can make the model generate a customized "dog". In order to obtain the residual token embedding, it is first necessary to fine-tune the text encoding model with the given data. During the training process, Cones 2 maintains the loss by introducing text encoding, limiting the output of the fine-tuned text encoder and the output of the original pre-trained text encoder. as close as possible.

Also refer to the above example, given "a dog sitting on the beach" as input, the text encodings output by these two text encoders are only different in the category word (dog) corresponding to the customized concept, and in other Parts of the word (beach, etc. . . . ) keep the output as consistent as possible. Combined with the original generative model, the fine-tuned text encoder has the ability to customize specific concepts. Since the fine-tuning process adopts the constraint of text encoding preservation loss, this ability can be calculated by calculating the fine-tuned text encoder and the original text encoder. The average difference of word parts to get the required residual token embedding:

e3759c9563bfef64a8387c17144f0f44.png

The residual representation obtained based on the above method is reusable and plug-and-play. When doing customized generation of multiple concepts, it is only necessary to add the corresponding residual item to the text code of the category word corresponding to each customized concept.

c57c176c480157dd6e638650f9e2203c.png

3. Guide the generation of multi-concept combinations through spatial layout

The attention map between the cross-attention layers is as follows, the cross-attention map directly affects the final generated spatial layout, and one problem in multi-concept custom generated pictures is that some concepts may not be displayed. To avoid this, Cones 2 enhances the activation of the target concept in the region where it is expected to appear, i.e. the user specifies via the bounding box. Another problem is the confusion of attributes between concepts, i.e., a concept in a generated image may contain features of other concepts.

To avoid this, it is desirable to weaken the activation for each object appearing outside the user-specified region. Combining the above two ideas, Cones 2 proposes a method to guide the generation process according to a predefined layout. In practice, a layout is defined as a set of concept bounding boxes, consisting of a guiding layout for each concept. Set the value of to positive in areas where you want the concept to appear and to negative in areas that are not relevant to the concept. Edit the attention map.

a5f9d9c08b8e757e3c7635a1766541f7.png

78ec0355fbe8a421f74fbde60d5092cd.png

experiment

Comparing the generated results with the existing methods, the computational complexity of training and the generated effect have been significantly improved.

401191531aeef2e434b2d8d35b2febdc.png

46e0c17ff212f3c739d706f36f93be38.png

bc1ea5f12ada371034a0da7f5607c70c.png

And it has superior performance in dealing with the generation of more concepts and dealing with semantically similar objects.

4cdb9f3e5296577bede451d48e901659.png

af3c20e107ccdcc7ae34dc21977b1a77.png

2ad137ae788386a995c0684417924a24.png

Application prospects

In addition to being able to generate higher-quality and richer images, multi-customized concept generation has a wide range of application prospects. Now the popular ControlNet is more about controlling the structure of the generated images. Multi-concept custom generation can control the generated content. It makes the text-to-image generation more controllable, and further improves the application value of the Vinsen graph model. For example, the creator generates multi-frame comics through inputting text and several customized character concepts; caps, etc.), to achieve a variety of clothing try-on experience.

Pay attention to the official account [Machine Learning and AI Generation Creation], more exciting things are waiting for you to read

Lying down, 60,000 words! 130 articles in 30 directions! CVPR 2023's most complete AIGC paper! read it in one go

Simple explanation of stable diffusion: Interpretation of the potential diffusion model behind AI painting technology

In-depth explanation of ControlNet, a controllable AIGC painting generation algorithm! 

Classic GAN has to read: StyleGAN

cb0f9103e165a96250ea101dee58232a.png Click me to view GAN's series albums~!

A cup of milk tea, become the frontier of AIGC+CV vision!

The latest and most complete 100 summary! Generate Diffusion Models Diffusion Models

ECCV2022 | Summary of some papers on generating confrontation network GAN

CVPR 2022 | 25+ directions, the latest 50 GAN papers

 ICCV 2021 | Summary of GAN papers on 35 topics

Over 110 articles! CVPR 2021 most complete GAN paper combing

Over 100 articles! CVPR 2020 most complete GAN paper combing

Dismantling the new GAN: decoupling representation MixNMatch

StarGAN Version 2: Multi-Domain Diversity Image Generation

Attached download | Chinese version of "Explainable Machine Learning"

Attached download | "TensorFlow 2.0 Deep Learning Algorithms in Practice"

Attached download | "Mathematical Methods in Computer Vision" share

"A review of surface defect detection methods based on deep learning"

A Survey of Zero-Shot Image Classification: A Decade of Progress

"A Survey of Few-Shot Learning Based on Deep Neural Networks"

"Book of Rites·Xue Ji" has a saying: "Learning alone without friends is lonely and ignorant."

Click on a cup of milk tea and become the frontier waver of AIGC+CV vision! , join  the planet of AI-generated creation and computer vision  knowledge!

Guess you like

Origin blog.csdn.net/lgzlgz3102/article/details/132222454