ControlNet-Conditional Text Generation Paper Reading


论文: 《Adding Conditional Control to Text-to-Image Diffusion Models》
github: https://github.com/lllyasviel/ControlNet

Summary

ControlNet controls a large pre-trained diffusion model to support additional input conditions. ControlNet learns specific task conditions in an end-to-end manner. Even if the training set is small (<50k), the learning is relatively robust. The author trains ControlNets based on Stable Diffusion, which can support edge map, segmentation map, and key points as conditional input; this enriches the control diffusion model method and provides convenience for related applications.

algorithm:

ControlNet

ControlNet further controls the output of the entire neural network by controlling the input conditions of the neural network module. As shown in Equation 1, the neural network module F transforms the feature map x into another feature map y through the parameter Θ, which is the process in Figure 2a; the process of
insert image description here
ControlNet applied to any neural network module is shown in Figure 2b, ccc is the additional condition vector,Θ ΘΘ is the locked copy parameter,Θ c Θ_cThcfor ThThe clone of Θ , and trainable, no direct training is to prevent overfitting,Θ z 1 , Θ z 2 Θ_{z1}, Θ_{z2}Thwith 1,Thz2 _For two zero convolution ZZ in ControlNetZ parameter, zero convolution means 1*1 convolution with initial weight and bias as 0, the process is as in formula 2,
insert image description here
due to two zero convolutionZZThe Z parameter is initialized to 0, soyc = y y_c=yyc=y , as in Equation 3,
insert image description here
insert image description here
the zero convolution parameters are optimized from scratch in a learnable way;

ControlNet in Image Diffusion Model

Stable Diffusion uses a text image generation model trained with billions of pictures. It is essentially a U-net including an encoder, an intermediate layer, and a skip-connected decoder. The entire model has 25 blocks, and the encoder and decoder each have 12 blocks. block, of all blocks, 8 are upsampling or downsampling convolutional layers, and 17 are main blocks, each including 4 resnet layers and 2 ViTs. Text is encoded by CLIP, and diffusion time steps are encoded using position.
Stable Diffusion is similar to VQ-GAN. In order to stabilize the training process, 512 * 512 images are mapped to 64 * 64 latent spaces, so ControlNet is required to convert image-based conditions to 64 * 64 feature spaces. This process passes 4 kernels=4, steide =2 convolution implementation, such as formula 9. As shown
insert image description here
in Figure 3, ControlNet controls each level of U-net, because the original weight is fixed, so the calculation is efficient; ControlNet uses the same 12 encoding blocks and 1 middle as SD (Stable Diffusion) block, of which 12 blocks have 4 resolutions (64 × 64, 32 × 32, 16 × 16, 8 × 8), each resolution has 3 blocks; the output part adds 12 skip-connections and 1 middle block to U-net
insert image description here

Training

The diffusion model learns the image to gradually denoise and generate samples;
for the picture z 0 z_0z0, through the diffusion algorithm to gradually increase the noise to generate a noise map zt z_tzt, where ttt is the number of times noise is added, giving step t, text promptct c_tctand specific task conditions cf c_fcf, the diffusion algorithm passes through the network ϵ θ \epsilon_θϵiPredictions are added to the noise map zt z_tztThe noise on , as shown in Equation 10, L is the overall loss function:
insert image description here
During the training process, the author randomly replaces 50% test prompt ct c_tctis empty, so that the SD model encoder can learn from the input control graph cf c_fcfLearn more semantic information.

Improved Training

Small-Scale Training : When resources are limited, the author found that disconnecting ControlNet and SD Decoder Block 1, 2, 3, 4 increases the training speed by 1.6 times. When the model output results are associated with conditions, Decoder Block 1, 2 can be reconnected ,3,4 for training;

Large-Scale Training : When the training resources are sufficient and the amount of data is large, the risk of model overfitting is low. You can fully train ControlNet first, then unlock the weight of the SD model, and conduct joint training with ControlNet as an overall model;

experiment

Experimental settings, the author uses three prompts:
1. No prompt; ""
2. Default prompt: meaningless prompt
3. Automatic prompt: BLIP generation
4. User prompt: user input

Canny edges

insert image description here

Hough lines

insert image description here

Human scribbles

insert image description here

HED boundary map

insert image description here

Openpifpaf pose

insert image description here

Openpose

insert image description here
insert image description here

ADE20K segmentation map

insert image description here

COCO-Stuff segmentation map

insert image description here

DIODE normal map

insert image description here

Depth-to-Image

insert image description here

cartoon line drawings

insert image description here

limit

As shown in Figure 28, when the input segmentation is likely to cause ambiguity, it is difficult to generate reasonable content.
insert image description here

in conclusion

The presentation effect of ControlNet is amazing, and it goes a step further on the basis of the SD model, and supports multiple conditions to control the generation of text to images.

Guess you like

Origin blog.csdn.net/qq_41994006/article/details/129393279