AIGC plays with cartoon technology practice

FaceChain photo open source project plug:

       The latest FaceChain supports group photos and hundreds of single photo styles. Project information summary: ModelScope Magic Community  .
       Direct access to github open source (if you find it interesting, click a star.): https://github.com/modelscope/facechain

Abstract: With the continuous wave of AIGC, more and more AI-generated gameplays are being defined and proposed by enthusiasts. Image cartoonization (animation) is very popular based on its high restoration effect and rich variety of styles. As early as a few years ago, with the rise of the GAN network, cartoonization was all the rage. Now, with the rise and continuous development of AIGC technology, the diffusion generation model provides more creativity and generation possibilities for cartoon style. This article will introduce in detail the cartoonization technology practice of Alibaba Open Vision Team.

GAN or large model?

      Most of the current cartooning technologies are based on generative adversarial networks (GAN) or diffusion models. Both of them have their own advantages and disadvantages in the cartoon generation task. The biggest advantages of GAN are speed, efficiency and fidelity to the original image. Its millisecond-level image rendering time can meet the needs of most end-side users, and the generation effect is relatively stable and controllable. For a summary of related technologies, see https: // developer.aliyun.com/article/1308610 . The diffusion model, based on its huge parameter scale and amount of training data, has stronger generation and redrawing capabilities, high picture quality, richer details, and stronger scalability of scenes and content. Next, this article will focus on the practice of cartooning in the field of large models.

Use large amounts of data to fine-tune large, cartoon-like models

Diffusion generative models are often trained on a wide range of image data, enabling them to generate images of various styles and types, such as realistic portraits and landscapes. And we hope that the basic model of cartoonization can be more focused on image generation of various cartoon types. Therefore, using exquisite cartoon domain data in the diffusion generation model to fine-tune the model can not only improve the closeness and restoration of the cartoon effect, but also generate more exquisite cartoon images. Through fine-tuning and training, the diffusion generation model can conduct more in-depth research and understanding of specific domain data, and can better simulate and reproduce the details and style of cartoon style. For example, for character cartoonization, by training the characteristics of cartoon characters in specific domain data, the personality characteristics and expressions of the characters can be better preserved and highlighted. For scene cartoonization, details such as colors, lines, and textures can be improved based on the cartoon scene details in specific domain data to make the cartoonization effect more suitable for the target picture. After fine-tuning a large amount of data, a large basic image diffusion model can be obtained that can generate more appropriate cartoon images, laying a foundation for customizing exclusive styles and maintaining the original image ID, thereby bringing a more realistic cartoon experience.

Use a small amount of data to define your own cartoon style

In the practice of cartoon technology, in order to highlight the distinctiveness of styles and the differences between different styles, it is often necessary to customize an exclusive model for each style, and complete training or finetune diffusion models often require tens of thousands of data volumes. and larger GPU resources. In order to achieve this, the training of LoRA model becomes the best choice [1]. Training the LoRA model only requires dozens to hundreds of pictures of the same style as training data, and the parameters of finetune only include the CLIP text encoder and the cross attention layer in UNet, which greatly reduces the cost of training.

In order to achieve the ideal finetune effect, the training of the LoRA model also requires many noteworthy training techniques:

  1. In addition to having the same style type, the training data also needs to maintain as much semantic information as possible. For example, when training LoRA for two-dimensional characters, the data needs to cover two-dimensional characters of different genders, age groups, angles, and postures as much as possible.

  2. Training data requires refined text labeling. Currently, the more popular automated annotation models include BLIP, BLIP2, DeepDanbooru, etc. It should be noted that in addition to the description of the screen content by the marking model, adding keywords of the target style or ID will be of great help in strengthening the style attributes.

  3. The training of the LoRA model needs to be matched with a base model of a similar style. At the same time, the number of training iterations should not be too large. It is generally set at 10-100 epochs based on the amount of training data.

  4. Mixing the weights of multiple LoRA models often has unexpected effects. For example, mixing characters and background LoRA can achieve a perfect blend of characters and background. The mixture of character and facial LoRA can improve the beauty and refinement of the task face.

Use controllable generation technology to retain character IDs

Controlled image generation technology has recently achieved great breakthroughs, and controllable generation technology is also indispensable for cartoonization. In order to retain some details of the original image (such as the similarity of characters, the similarity of scenes, etc.) and make users feel that "this is a cartoonized photo of "me"" rather than just an irrelevant cartoonized image, we It is necessary to use controllable image generation technology to ensure the similarity between images before and after cartoonization. This article uses Prompt-to-Prompt, ControlNet, T2I Adapter and other technologies to further improve the controllability of generated images. The principles and applications of these technologies are described below, as well as how they can be used to generate personalized and controllable images.

Prompt-to-Prompt

This approach discovers the power of cross-attention layers in text-to-image diffusion models. The figure below shows that these high-dimensional layers have interpretable spatial mappings. These spatial maps play a key role in linking textual cue words to the spatial layout of synthetic images [2]. Therefore, by adding a cartoon style description to the prompt text while injecting the source attention map, we can create various images in a new cartoon style while retaining the structure and ID of the original image.

The relationship between prompt text and attention map:

Cartoonize with Prompt-to-Prompt:

ControlNet

ControlNet can be used to control pre-trained large-scale diffusion models to support additional input conditions. This will allow us to use certain information of the original image (such as depth information, edge information, etc.) as control conditions while generating cartoon images. Guide the image generation process. The structure of ControlNet is shown in the figure below. ControlNet clones the weights of a large diffusion model into a "trainable copy" and a "locked copy": the locked copy retains the network capabilities learned from billions of images, while the trainable copy can Train on task-specific datasets to learn conditional control [3].

Example: Use depth map control in ControlNet to generate cartoonized images [3]:

T2I Adapter

Similar to ControlNet, T2I Adapter is a simple and small model that can provide additional guidance for pre-trained text-to-image models without affecting the original network topology and generation capabilities. Its network structure is shown in the figure below [4].

Use T2I Adapter for cartoon practice[4]:

From technology research and development to application practice

Currently, thanks to large-model visual editing tools represented by webui, everyone can complete a simple hands-on practice of cartooning, and through constant parameter adjustment and experimentation, generate cartoon images that satisfy themselves. However, there is still a long way to go from model inference to the implementation of standardized products:

  1. Standardized products cannot require users to make detailed prompt customization and parameter adjustments. This requires the formation of an effective parameter and prompt generation system within the algorithm to cover different types of inputs.

  2. Standardized products need to ensure the lower limit of the generated results. Although stable diffusion can generate many pictures with stunning effects, it always has the characteristic of unstable generation effects. Defects are prone to occur in some scenes with small faces, facial occlusions, fingers, etc. Therefore, the generation process must be constrained to improve the image rate.

  3. The graph output efficiency of stable diffusion is basically second-level at present. Improving inference efficiency and reducing costs are also extremely important parts of the product implementation process.

Show results

More style trials and quick access

Visual Intelligence Open Platform—Generative Image Cartoonization

NEW:  Portrait style redrawing API details

references

[1]. Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." arXiv preprint arXiv:2106.09685 (2021).

[2]. Hertz, Amir, et al. "Prompt-to-prompt image editing with cross attention control." arXiv preprint arXiv:2208.01626 (2022).

[3]. Zhang, Lvmin, and Maneesh Agrawala. "Adding conditional control to text-to-image diffusion models." arXiv preprint arXiv:2302.05543 (2023).

[4]. Mou, Chong, et al. "T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models." arXiv preprint arXiv:2302.08453 (2023).

Guess you like

Origin blog.csdn.net/sunbaigui/article/details/133342883