[AIGC] [Image Generation] Introduction to controlNet (Principle + Use)

Insert image description here



Install

Download and installation: It is recommended to download the V1.1 version. The model is at:
https://huggingface.co/lllyasviel/ControlNet-v1-1/tree/main

1. ControlNet: AI painting

If you want to paint a painting, on the one hand it is the composition and on the other hand it is the style . Today’s AI painting is alchemy. It is necessary to control the composition and style of the picture through various prompts.

In view of the professionalism and complexity of prompt, two models were generated: ControlNet is used to control composition, and LoRA is used to control style . There is also a style transfer model shuffle. In addition, SD1.5 can also generate good images.

So how does ControlNet compose the picture? There are two methods:
1. The sketch has been drawn by hand, and the subsequent improvement and beautification is done by AI;
2. Use existing pictures to generate new pictures of different styles.
So this is what ControlNet does: control AI drawing through the drawings you already have instead of spells.

Here we only introduce ControlNet, and LoRA will be introduced as the icing on the cake.

1.1. The essence of ControlNet is Vincentian graph (txt2img)

ControlNet's original paper: "Adding Conditional Control toText-to-Image Diffusion Models" and code, the most original work is a txt2img. In addition, there are additional functions of inpaint, namely replacement, elimination, etc.
So in the WEB UI, whether it is txt2img or img2img, there is ControlNet.

2.2. Preprocessor & model selection

Make sure that the preprocessor and model are of the same type .

Preprocessor: Preprocess the reference image as a control chart, and provide it to the subsequent corresponding model to control the drawing. For example: If you choose a scribble-type preprocessor, then the model is best to choose the scribble model:
Insert image description here
In the above picture, the first row is the original image and the scribble-preprocessed image respectively. I want to draw an owl-like image. The second line is the AI ​​drawing results under different configuration conditions. Only when the preprocessing and the model are consistent can we get a better result graph . If either the preprocessor or the model is empty, it is basically equivalent to ControlNet not taking effect, which is a general SD graph, such as the second line. The two pictures in the middle. WEB UI developers also noticed the problem of the correspondence between preprocessors vs models, so they limited the correspondence between preprocessors and models in version 1.1.2XXX, as shown in the figure below.

Insert image description here

1.3. Parameter configuration

Since there are many adjustable parameters, use the default parameters for the first time. If the effect is not good, fine-tune the parameters again.

2. ControlNet model classification

ControlNet version 1.1 released 14 models, which can be mainly divided into three categories:

2.1. Sketch category (6 items)

Mainly use a preprocessor to process the image into a sketch, or directly input the manuscript.

There are the following types of preprocessors:
1) Canny : edge extractor, the most commonly used
2) MLSD : special line detection (such as straight lines), suitable for architectural design and interior design
3) lineart : line extraction (early version called fake_scribble)
4 ) lineart_anime : need to be used with large models of anything_v3-v5 (requiring complex prompt words)
5) SoftEdge : distinguish between several preprocessors, select SoftEdge_PIDI for comprehensive effects
6) Scribble : graffiti, similar to stick figures

以下是不同预处理器的效果和差异

Insert image description here

以下是Scribble的效果

Insert image description here
A simple prompt can generate very good renderings

2.2. Advanced feature classes (3)

1) depth : depth map
2) seg : semantic segmentation map. For example, pink is buildings, green is plants, etc.
3) normalbae : normal map (normal is the gradient of pixels, reflecting the texture)

以下是效果图,后面两列是模型生成的新图

Insert image description here

The corresponding preprocessors are as follows (can be randomly selected, with similar effects):

1.depth: supports depth_Midas, depth_Leres, depth_Zoe, etc., as well as depth maps generated by other software.
2.seg: supports Seg_OFADE20K, Seg_OFCOCO, Seg_UFADE20K, and manual hand-painted masks.
3.normalbae: supports normal_bae, normal_midas

3.3. Advanced category (5)

1) OpenPose : Skeleton capture, very popular
2) inpaint : Partial retouching, you can delete some targets naturally
3) shuffle : Style mixing, convert different styles
4) ip2p : Instructional retouching, can recognize limited instructions
5) tile : Image super-resolution: will add details that are not found in the original image.

以下是ip2p的效果

Insert image description here

以下是shuffle 的效果

Insert image description here


3. Configuration parameters

Most parameters can be selected by default, and parameters need to be adjusted for high-level applications. In addition to the selection of preprocessors and models, other parameters are introduced below.

Insert image description here

Yellow box : related to the selected preprocessor and model . Mainly parameters such as line thickness and richness of details

Red box :

The first group: basic control, relatively simple
1) enable: whether to enable ControlNet
2) lowVRAM: low precision, which can reduce video memory consumption
3) Pixel Perfect: a new feature in the new version, corresponding to the first preprocessor resolution in the yellow box, The algorithm calculates the most appropriate resolution by itself. If the drawing is not a square like 512*512, it is recommended to check this option.
4) allow preview: preview the effect of the preprocessor in advance

The second group: weight control (you can use the default directly)
1) control weight: the control weight of the control network
2) starting control step: when to intervene in control (percentage)
3) ending control step: when to exit control (percentage)
if you don’t want to If you control too much with controlNet, you can intervene later .

The third group: control mode (you can use the default directly)
1) balanced
2) my prompt is important
3) ControlNet is important
is used to allocate the weight ratio of prompt and reference picture

The fourth group: Cropping method.
When the size of the control chart (reference picture) is inconsistent with the size of the target picture (the picture to be generated in txt2img), you need to configure this set of parameters: 1) just resize: change the aspect ratio of the control chart
to Adapt to the size of the target image (may cause deformation).
2) Crop and resize: Crop the control chart to fit the size ratio of the target chart.
3) Resize and fill: Scale the control chart to ensure that the entire control chart can fit into the target chart, and then fill in the blanks of the excess parts.

4. Basic principle: controllable SD model

ControlNet: Clone a copy of the stable diffusion model

ControlNet is a control technology for deep neural networks , which can control the behavior of neural networks by manipulating the input conditions of the neural network . The network mainly includes "resnet" block, "conv-bn-relu" block, multi-head attention block, etc. The goal of ControlNet is achieved by cloning the parameters of the neural network block and applying zero convolutional connection layers .

The replica network has the same inputs and outputs as the original network block. In the initial stage of training, ControlNet will not affect the neural network. The weights and bias values ​​of zero convolution are initialized to zero, which allows ControlNet to not affect the gradient of the weight and bias values ​​during the optimization process. ControlNet adjusts the behavior of neural networks and adapts to different tasks and data by controlling the input conditions of neural network blocks . Each neural network block is defined by a set of parameters that can be optimized during training.

Insert image description here

ControlNet applies to arbitrary neural network blocks. x; y are deep features in neural networks. "+" means feature addition. "c" is the additional condition we want to add to the neural network. “zero convolution” is a 1 × 1 convolutional layer with weights and biases initialized to zero

The ControlNet structure can be expressed as:

yc = F(x; α) + Z(F(x + Z(c;βz1); βc); βz2)

Among them, y c is the output result of this neural network block. During the first training step, all inputs and outputs of the trainable and locked copies of the neural network block are unaffected by ControlNet. The capabilities, functionality and quality of results of any neural network block are perfectly preserved , and any further optimization will become as fast as fine-tuning (compared to training these layers from scratch).

In the first training step, since the weights and bias values ​​of the zero convolutional layer are both initialized to zero, we have:

Z(c; βz1) = 0
F(x + Z(c; βz1); βc) = F(x; βc) = F(x; α)
Z(F(x + Z(c; βz1); βc); βz2) = Z(F(x; βc); βz2) = 0

When applying ControlNet to some neural network blocks, it will have no impact on the deep neural features before any optimization. The capabilities, functionality and quality of results of any neural network block (such as pre-trained SD) are perfectly preserved , and any further optimization will become as fast as fine-tuning. Through an iterative process, ControlNet operations are repeatedly applied to optimize the neural network blocks . This way, at each step, we can keep all other neural network blocks unchanged and only modify and adjust certain neural network blocks .

The original article is based on Stable Diffusion and uses ControlNet to control large networks: the Encoder is copied and trained, and the decoder part is skipped.

Insert image description here

The "Zero Convolution" in the figure is a 1×1 convolution with zero initialization weights and biases. Before the model starts training, all zero convolution outputs are zero, and the model is still the original Stable Diffusion Model . After adding your own training data, the final result will be fine-tuned, so the model will not cause major deviations .

We can see from the overall model structure that the author added the above-mentioned "0 convolution layer" to the decode layer of the Stable Diffusion model to achieve consistency between the final model and the training data.

5.Visualization effects

Image segmentation with SD1.5
Insert image description here

Attitude detection with SD1.5:
Insert image description here

Via HED profile, paired with SD1.5
Insert image description here

Soul Painter with SD1.5
Insert image description here


Summarize

Some pictures are reproduced from Zhihu user @BitByBit

Guess you like

Origin blog.csdn.net/qq_45752541/article/details/132619474