Article directory
论文: 《Adding Conditional Control to Text-to-Image Diffusion Models》
github: https://github.com/lllyasviel/ControlNet
Summary
ControlNet controls a large pre-trained diffusion model to support additional input conditions. ControlNet learns specific task conditions in an end-to-end manner. Even if the training set is small (<50k), the learning is relatively robust. The author trains ControlNets based on Stable Diffusion, which can support edge map, segmentation map, and key points as conditional input; this enriches the control diffusion model method and provides convenience for related applications.
algorithm:
ControlNet
ControlNet further controls the output of the entire neural network by controlling the input conditions of the neural network module. As shown in Equation 1, the neural network module F transforms the feature map x into another feature map y through the parameter Θ, which is the process in Figure 2a; the process of
ControlNet applied to any neural network module is shown in Figure 2b, ccc is the additional condition vector,Θ ΘΘ is the locked copy parameter,Θ c Θ_cThcfor ThThe clone of Θ , and trainable, no direct training is to prevent overfitting,Θ z 1 , Θ z 2 Θ_{z1}, Θ_{z2}Thwith 1,Thz2 _For two zero convolution ZZ in ControlNetZ parameter, zero convolution means 1*1 convolution with initial weight and bias as 0, the process is as in formula 2,
due to two zero convolutionZZThe Z parameter is initialized to 0, soyc = y y_c=yyc=y , as in Equation 3,
the zero convolution parameters are optimized from scratch in a learnable way;
ControlNet in Image Diffusion Model
Stable Diffusion uses a text image generation model trained with billions of pictures. It is essentially a U-net including an encoder, an intermediate layer, and a skip-connected decoder. The entire model has 25 blocks, and the encoder and decoder each have 12 blocks. block, of all blocks, 8 are upsampling or downsampling convolutional layers, and 17 are main blocks, each including 4 resnet layers and 2 ViTs. Text is encoded by CLIP, and diffusion time steps are encoded using position.
Stable Diffusion is similar to VQ-GAN. In order to stabilize the training process, 512 * 512 images are mapped to 64 * 64 latent spaces, so ControlNet is required to convert image-based conditions to 64 * 64 feature spaces. This process passes 4 kernels=4, steide =2 convolution implementation, such as formula 9. As shown
in Figure 3, ControlNet controls each level of U-net, because the original weight is fixed, so the calculation is efficient; ControlNet uses the same 12 encoding blocks and 1 middle as SD (Stable Diffusion) block, of which 12 blocks have 4 resolutions (64 × 64, 32 × 32, 16 × 16, 8 × 8), each resolution has 3 blocks; the output part adds 12 skip-connections and 1 middle block to U-net
Training
The diffusion model learns the image to gradually denoise and generate samples;
for the picture z 0 z_0z0, through the diffusion algorithm to gradually increase the noise to generate a noise map zt z_tzt, where ttt is the number of times noise is added, giving step t, text promptct c_tctand specific task conditions cf c_fcf, the diffusion algorithm passes through the network ϵ θ \epsilon_θϵiPredictions are added to the noise map zt z_tztThe noise on , as shown in Equation 10, L is the overall loss function:
During the training process, the author randomly replaces 50% test prompt ct c_tctis empty, so that the SD model encoder can learn from the input control graph cf c_fcfLearn more semantic information.
Improved Training
Small-Scale Training : When resources are limited, the author found that disconnecting ControlNet and SD Decoder Block 1, 2, 3, 4 increases the training speed by 1.6 times. When the model output results are associated with conditions, Decoder Block 1, 2 can be reconnected ,3,4 for training;
Large-Scale Training : When the training resources are sufficient and the amount of data is large, the risk of model overfitting is low. You can fully train ControlNet first, then unlock the weight of the SD model, and conduct joint training with ControlNet as an overall model;
experiment
Experimental settings, the author uses three prompts:
1. No prompt; ""
2. Default prompt: meaningless prompt
3. Automatic prompt: BLIP generation
4. User prompt: user input
Canny edges
Hough lines
Human scribbles
HED boundary map
Openpifpaf pose
Openpose
ADE20K segmentation map
COCO-Stuff segmentation map
DIODE normal map
Depth-to-Image
cartoon line drawings
limit
As shown in Figure 28, when the input segmentation is likely to cause ambiguity, it is difficult to generate reasonable content.
in conclusion
The presentation effect of ControlNet is amazing, and it goes a step further on the basis of the SD model, and supports multiple conditions to control the generation of text to images.