Cones2

Vincent diagrams have made significant progress in the past year. The customized generation of DreamBooth has further proved the potential of Vincent diagrams, and has widely attracted the attention of the community. Compared with single-concept generation, multiple concepts can be customized in one diagram. It is more interesting and has a wide range of application scenarios (AI photo studio, AI comic generation...).

Compared with the success of single-concept custom generation, Cones proposed by Ali and Custom Diffusion proposed by Adobe still have two challenges as existing multi-custom concept generation methods:

  • First, they need to learn a separate model for each combination of multiple concepts, which may be affected by the following: 1) cannot use existing models, such as a new multi-concept group that needs to be customized contains three concepts { A,B,C}, knowledge cannot be obtained from the existing customized model of {A,B}, and can only be retrained. 2) When the number of concepts that need to be customized increases, the consumption of computing resources increases exponentially.

  • Different custom concepts may interfere with each other, resulting in some concepts not being displayed in the final generation, or the attributes between concepts are confused. This phenomenon is especially evident when the semantic similarity between concepts is high (for example, customizing a cat and a dog at the same time may generate a picture in which the customized cat confuses some characteristics of the dog.)

Based on this, the research team of Alibaba and Ant Group proposed a combined multi-concept customization generation method: Cones 2 , which can customize more objects at the same time, and the quality of generated images is significantly improved.

Paper home page: Cone 2

https://arxiv.org/abs/2305.19327

Project homepage: Cones-page

https://cones-page.github.io

The team's previous work, Cones, won the ICML 2023 oral and gained a lot of attention on Twitter.

The advantages of Cones 2 are mainly reflected in three aspects. (1) Use a simple and effective method to represent concepts, which can be combined arbitrarily, and various trained single concepts can be reused, so as to generate multiple customized concepts without any retraining for multiple concepts. (2) Use the spatial layout as a guide, which is very easy to obtain in practice, users only need to provide a bounding box, that is, the specific location of each concept can be controlled, and the attribute confusion between concepts can be alleviated at the same time. (3) Satisfactory performance can also be achieved in some challenging scenarios: generate semantically similar multi-customized concepts, such as customizing two dogs, and glasses can be exchanged; in terms of the number of concepts, six can also be synthesized concept. Approach  1. Text-Guided Image Generation Based on Diffusion Model

2. Residual text embeddings represent concepts

In order to be able to customize and generate specific concepts that users need, the model first needs to "remember" the characteristics of these concepts. Since changing the parameters of a pre-trained model often leads to a decrease in the generalization of the model, Cones 2 chooses to learn an appropriate editing direction for each specific concept. By applying this direction to the feature encoding of the base class corresponding to the concept, a customized result can be obtained. This direction is called residual token embedding.

For example, when using Stable Diffusion to generate the image "a dog sitting on the beach", the entire generation process is controlled by the text encoding obtained by passing the text through the text encoding model, then only the text encoding corresponding to "dog" needs to be properly Offset, you can make the model generate a customized "dog". In order to get the residual token embedding, it is first necessary to fine-tune the text encoding model with the given data. During the training process, Cones 2 introduces the loss of text encoding to limit the output of the fine-tuned text encoder and the output of the original pre-trained text encoder. as close as possible. whaosoft  aiot  http://143ai.com 

Also refer to the above example, given "a dog sitting on the beach" as input, the text encodings output by these two text encoders are only different in the category word (dog) corresponding to the customized concept, and in other Parts of the word (beach, etc. . . . ) keep the output as consistent as possible. Combined with the original generative model, the fine-tuned text encoder has the ability to customize specific concepts. Since the fine-tuning process adopts the constraint of text encoding preservation loss, this ability can be calculated by calculating the fine-tuned text encoder and the original text encoder. The average difference of word parts to get the required residual token embedding: The residual representation based on the above method can be reused and plug-and-play. When doing customized generation of multiple concepts, it is only necessary to add the corresponding residual item to the text code of the category word corresponding to each customized concept. 3. Guiding Multi-Concept Combination Generation  Experiments Through Spatial Layout

Comparing the generated results with the existing methods, the computational complexity of training and the generated effect have been significantly improved. And it has superior performance in dealing with the generation of more concepts and dealing with semantically similar objects. Application prospects

In addition to being able to generate higher-quality and richer images, multi-customized concept generation has a wide range of application prospects. Now the popular ControlNet is more about controlling the structure of the generated images. Multi-concept custom generation can control the generated content. It makes the text-to-image generation more controllable, and further improves the application value of the Vinsen graph model. For example, the creator generates multi-frame comics through inputting text and several customized character concepts; caps, etc.), to achieve a variety of clothing try-on experience.

Guess you like

Origin blog.csdn.net/qq_29788741/article/details/132210785