AIGC Wensheng Diagram: Using ControlNet to Control Stable Diffusion

1 Introduction to ControlNet

1.1 What is ControlNet?

ControlNet, an extension of Stable Diffusion developed by Stanford researchers, enables creators to easily control objects in AI images and videos. It will control image generation based on various conditions such as edge detection, sketch processing or human pose.

Forum address: Adding Conditional Control to Text-to-Image Diffusion Models

ControlNet is a neural network structure that controls stable diffusion by adding additional conditions . It provides a way to enhance stable diffusion using conditional inputs such as scribbles, edge maps, segmentation maps, pose keypoints, etc. during text-to-image generation. The generated image will be closer to the input image, which is a great improvement over traditional image-to-image generation methods.

ControlNet models can be trained on small datasets. Then integrate any pre-trained stable diffusion model to enhance the model for the purpose of fine-tuning.

  • The initial version of ControNet comes with the following pretrained weights. ‍‍
  • Canny edge — a monochrome image with white edges on a black background.
  • Depth/Shallow areas — a grayscale image with black representing dark areas and white representing shallow areas.
  • Normal map — Normal map image.
  • Semantic segmentation map - Segmentation image of ADE20K.
  • HED edge — a monochrome image with white soft edges on a black background.
  • Scribbles - hand drawn monochrome scribble image with white outlines on black background.
  • OpenPose (Pose Keys) — OpenPose skeletal image.
  • M-LSD — a monochrome image consisting only of white straight lines on a black background.

1.2 Principle of ControlNet

ControlNet is a neural network structure that controls the diffusion model by adding additional conditions. Divide the network structure into:

  • Trainable "trainable"
  • Not trainable "locked"

The trainable part learns for the controllable part. The locked part retains the original data of the stable-diffusion model, so using a small amount of data guidance can ensure that the pre-constraints can be fully learned while retaining the learning ability of the original diffusion model itself

The "Zero Convolution" above is a 1×1 convolution with zero-initialized weights and biases. Before the start of your own model training, all zero convolution outputs are zero, and the model is still the original Stable Diffusion Model. After adding your own training data, it will have an impact on the final data. The impact here is more of a fine-tuning of the final result, so it will not cause major deviations in the model. The overall model structure is as follows:

It can be seen from the overall model structure that ControlNet adds the above-mentioned "0 convolutional layer" to the decode layer of the Stable Diffusion model to achieve consistency between the final model and the training data.

2 ControlNet deployment and model download

2.1 Construction of the operating environment

git clone https://github.com/lllyasviel/ControlNet.git

cd ControlNet

conda env create -f environment.yaml

conda activate control

2.2 Model download

(1) sd model and detectors model download

Model address: huggingface

After the download is complete, move the model to the following directory:

  • sd model: models
  • detectors model: annotator/ckpts

Model address: ControlNetHED.pth

After the download is complete, move the model to the annotator/ckpts directory

After the move is complete, view it through the command, and the display is as follows:

 [root@localhost ControlNet]# ll annotator/ckpts/
总用量 1125948
-rw-r--r-- 1 root root 209267595 7月  14 14:19 body_pose_model.pth
-rw-r--r-- 1 root root        13 7月  13 15:27 ckpts.txt
-rw-r--r-- 1 root root  29444406 7月  14 16:52 ControlNetHED.pth
-rw-r--r-- 1 root root 492757791 7月  14 14:20 dpt_hybrid-midas-501f0c75.pt
-rw-r--r-- 1 root root 147341049 7月  14 14:20 hand_pose_model.pth
-rw-r--r-- 1 root root   6341481 7月  14 14:20 mlsd_large_512_fp32.pth
-rw-r--r-- 1 root root   2613835 7月  14 14:20 mlsd_tiny_512_fp32.pth
-rw-r--r-- 1 root root  58871680 7月  14 14:20 network-bsds500.pth
-rw-r--r-- 1 root root 206313115 7月  14 14:21 upernet_global_small.pth
[root@localhost ControlNet]# ll annotator/ckpts/
总用量 1097192
-rw-r--r-- 1 root root 209267595 7月  14 14:19 body_pose_model.pth
-rw-r--r-- 1 root root        13 7月  13 15:27 ckpts.txt
-rw-r--r-- 1 root root 492757791 7月  14 14:20 dpt_hybrid-midas-501f0c75.pt
-rw-r--r-- 1 root root 147341049 7月  14 14:20 hand_pose_model.pth
-rw-r--r-- 1 root root   6341481 7月  14 14:20 mlsd_large_512_fp32.pth
-rw-r--r-- 1 root root   2613835 7月  14 14:20 mlsd_tiny_512_fp32.pth
-rw-r--r-- 1 root root  58871680 7月  14 14:20 network-bsds500.pth
-rw-r--r-- 1 root root 206313115 7月  14 14:21 upernet_global_small.pth

(2) clip-vit model download

Model address: clip-vit-large-patch14

After the download is complete, create a folder clip-vit-large-patch14 in the models directory, move the model to this folder, and use the command to view and display the following information:

[root@localhost ControlNet]# ll models/clip-vit-large-patch14/
总用量 5015648
-rw-r--r-- 1 root root       4519 7月  14 16:18 config.json
-rw-r--r-- 1 root root 1710486359 7月  14 16:21 flax_model.msgpack
-rw-r--r-- 1 root root     524619 7月  14 16:21 merges.txt
-rw-r--r-- 1 root root        316 7月  14 16:21 preprocessor_config.json
-rw-r--r-- 1 root root 1710671599 7月  14 16:23 pytorch_model.bin
-rw-r--r-- 1 root root       7947 7月  14 16:23 README.md
-rw-r--r-- 1 root root        389 7月  14 16:23 special_tokens_map.json
-rw-r--r-- 1 root root 1711114176 7月  14 16:26 tf_model.h5
-rw-r--r-- 1 root root        905 7月  14 16:26 tokenizer_config.json
-rw-r--r-- 1 root root    2224003 7月  14 16:26 tokenizer.json
-rw-r--r-- 1 root root     961143 7月  14 16:26 vocab.json

 Waiting for code to avoid automatic downloads over the network (downloads are slow and often fail)

vi ldm/modules/encoders/modules.py
 def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77,
                 freeze=True, layer="last", layer_idx=None):  # clip-vit-base-patch32
        super().__init__()
        assert layer in self.LAYERS
        self.tokenizer = CLIPTokenizer.from_pretrained('models/clip-vit-large-patch14')
        self.transformer = CLIPTextModel.from_pretrained('models/clip-vit-large-patch14')
        self.device = device
        self.max_length = max_length

3 ControlNet operation and effect display

3.1 run canny2image

python gradio_canny2image.py

Show results:

3.2 Run hough2image

python gradio_hough2image.py

Show results:

3.3 Run hed2image

python gradio_hed2image.py

Show results:

3.4 Running scribble2image

python gradio_scribble2image.py

Show results:

3.5 Running interactive scribble2image

python gradio_scribble2image_interactive.py

Show results:

3.6 Running fake scribble2image

python gradio_fake_scribble2image.py

Show results:

3.7 Running pose2image

python gradio_pose2image.py

Show results:

3.8 Running seg2image

python gradio_seg2image.py

Show results:

3.9 run depth2image

python gradio_depth2image.py

Show results:

3.10 run normal2image

python gradio_normal2image.py

Show results:

 

4 problem solving

4.1 "No module 'xformers'. Proceeding without it" problem solving

Error output:

[root@localhost ControlNet]# python gradio_normal2image.py
logging improved.
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_normal.pth]
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.

 Solution:

pip install xformers==0.0.20

5 summary

ControlNet is a very powerful neural network architecture that controls the diffusion model by adding additional conditions. Multi-ControlNet is not yet supported, and sources in the open source community say it is being actively developed. This new feature provides the possibility to use multiple control networks and use their outputs together for image generation, allowing better control over the entire image.

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131719095