1 Introduction to ControlNet
1.1 What is ControlNet?
ControlNet, an extension of Stable Diffusion developed by Stanford researchers, enables creators to easily control objects in AI images and videos. It will control image generation based on various conditions such as edge detection, sketch processing or human pose.
Forum address: Adding Conditional Control to Text-to-Image Diffusion Models
ControlNet is a neural network structure that controls stable diffusion by adding additional conditions . It provides a way to enhance stable diffusion using conditional inputs such as scribbles, edge maps, segmentation maps, pose keypoints, etc. during text-to-image generation. The generated image will be closer to the input image, which is a great improvement over traditional image-to-image generation methods.
ControlNet models can be trained on small datasets. Then integrate any pre-trained stable diffusion model to enhance the model for the purpose of fine-tuning.
- The initial version of ControNet comes with the following pretrained weights.
- Canny edge — a monochrome image with white edges on a black background.
- Depth/Shallow areas — a grayscale image with black representing dark areas and white representing shallow areas.
- Normal map — Normal map image.
- Semantic segmentation map - Segmentation image of ADE20K.
- HED edge — a monochrome image with white soft edges on a black background.
- Scribbles - hand drawn monochrome scribble image with white outlines on black background.
- OpenPose (Pose Keys) — OpenPose skeletal image.
- M-LSD — a monochrome image consisting only of white straight lines on a black background.
1.2 Principle of ControlNet
ControlNet is a neural network structure that controls the diffusion model by adding additional conditions. Divide the network structure into:
- Trainable "trainable"
- Not trainable "locked"
The trainable part learns for the controllable part. The locked part retains the original data of the stable-diffusion model, so using a small amount of data guidance can ensure that the pre-constraints can be fully learned while retaining the learning ability of the original diffusion model itself
The "Zero Convolution" above is a 1×1 convolution with zero-initialized weights and biases. Before the start of your own model training, all zero convolution outputs are zero, and the model is still the original Stable Diffusion Model. After adding your own training data, it will have an impact on the final data. The impact here is more of a fine-tuning of the final result, so it will not cause major deviations in the model. The overall model structure is as follows:
It can be seen from the overall model structure that ControlNet adds the above-mentioned "0 convolutional layer" to the decode layer of the Stable Diffusion model to achieve consistency between the final model and the training data.
2 ControlNet deployment and model download
2.1 Construction of the operating environment
git clone https://github.com/lllyasviel/ControlNet.git
cd ControlNet
conda env create -f environment.yaml
conda activate control
2.2 Model download
(1) sd model and detectors model download
Model address: huggingface
After the download is complete, move the model to the following directory:
- sd model: models
- detectors model: annotator/ckpts
Model address: ControlNetHED.pth
After the download is complete, move the model to the annotator/ckpts directory
After the move is complete, view it through the command, and the display is as follows:
[root@localhost ControlNet]# ll annotator/ckpts/
总用量 1125948
-rw-r--r-- 1 root root 209267595 7月 14 14:19 body_pose_model.pth
-rw-r--r-- 1 root root 13 7月 13 15:27 ckpts.txt
-rw-r--r-- 1 root root 29444406 7月 14 16:52 ControlNetHED.pth
-rw-r--r-- 1 root root 492757791 7月 14 14:20 dpt_hybrid-midas-501f0c75.pt
-rw-r--r-- 1 root root 147341049 7月 14 14:20 hand_pose_model.pth
-rw-r--r-- 1 root root 6341481 7月 14 14:20 mlsd_large_512_fp32.pth
-rw-r--r-- 1 root root 2613835 7月 14 14:20 mlsd_tiny_512_fp32.pth
-rw-r--r-- 1 root root 58871680 7月 14 14:20 network-bsds500.pth
-rw-r--r-- 1 root root 206313115 7月 14 14:21 upernet_global_small.pth
[root@localhost ControlNet]# ll annotator/ckpts/
总用量 1097192
-rw-r--r-- 1 root root 209267595 7月 14 14:19 body_pose_model.pth
-rw-r--r-- 1 root root 13 7月 13 15:27 ckpts.txt
-rw-r--r-- 1 root root 492757791 7月 14 14:20 dpt_hybrid-midas-501f0c75.pt
-rw-r--r-- 1 root root 147341049 7月 14 14:20 hand_pose_model.pth
-rw-r--r-- 1 root root 6341481 7月 14 14:20 mlsd_large_512_fp32.pth
-rw-r--r-- 1 root root 2613835 7月 14 14:20 mlsd_tiny_512_fp32.pth
-rw-r--r-- 1 root root 58871680 7月 14 14:20 network-bsds500.pth
-rw-r--r-- 1 root root 206313115 7月 14 14:21 upernet_global_small.pth
(2) clip-vit model download
Model address: clip-vit-large-patch14
After the download is complete, create a folder clip-vit-large-patch14 in the models directory, move the model to this folder, and use the command to view and display the following information:
[root@localhost ControlNet]# ll models/clip-vit-large-patch14/
总用量 5015648
-rw-r--r-- 1 root root 4519 7月 14 16:18 config.json
-rw-r--r-- 1 root root 1710486359 7月 14 16:21 flax_model.msgpack
-rw-r--r-- 1 root root 524619 7月 14 16:21 merges.txt
-rw-r--r-- 1 root root 316 7月 14 16:21 preprocessor_config.json
-rw-r--r-- 1 root root 1710671599 7月 14 16:23 pytorch_model.bin
-rw-r--r-- 1 root root 7947 7月 14 16:23 README.md
-rw-r--r-- 1 root root 389 7月 14 16:23 special_tokens_map.json
-rw-r--r-- 1 root root 1711114176 7月 14 16:26 tf_model.h5
-rw-r--r-- 1 root root 905 7月 14 16:26 tokenizer_config.json
-rw-r--r-- 1 root root 2224003 7月 14 16:26 tokenizer.json
-rw-r--r-- 1 root root 961143 7月 14 16:26 vocab.json
Waiting for code to avoid automatic downloads over the network (downloads are slow and often fail)
vi ldm/modules/encoders/modules.py
def __init__(self, version="openai/clip-vit-large-patch14", device="cuda", max_length=77,
freeze=True, layer="last", layer_idx=None): # clip-vit-base-patch32
super().__init__()
assert layer in self.LAYERS
self.tokenizer = CLIPTokenizer.from_pretrained('models/clip-vit-large-patch14')
self.transformer = CLIPTextModel.from_pretrained('models/clip-vit-large-patch14')
self.device = device
self.max_length = max_length
3 ControlNet operation and effect display
3.1 run canny2image
python gradio_canny2image.py
Show results:
3.2 Run hough2image
python gradio_hough2image.py
Show results:
3.3 Run hed2image
python gradio_hed2image.py
Show results:
3.4 Running scribble2image
python gradio_scribble2image.py
Show results:
3.5 Running interactive scribble2image
python gradio_scribble2image_interactive.py
Show results:
3.6 Running fake scribble2image
python gradio_fake_scribble2image.py
Show results:
3.7 Running pose2image
python gradio_pose2image.py
Show results:
3.8 Running seg2image
python gradio_seg2image.py
Show results:
3.9 run depth2image
python gradio_depth2image.py
Show results:
3.10 run normal2image
python gradio_normal2image.py
Show results:
4 problem solving
4.1 "No module 'xformers'. Proceeding without it" problem solving
Error output:
[root@localhost ControlNet]# python gradio_normal2image.py
logging improved.
No module 'xformers'. Proceeding without it.
ControlLDM: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
making attention of type 'vanilla' with 512 in_channels
Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
making attention of type 'vanilla' with 512 in_channels
Loaded model config from [./models/cldm_v15.yaml]
Loaded state_dict from [./models/control_sd15_normal.pth]
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Solution:
pip install xformers==0.0.20
5 summary
ControlNet is a very powerful neural network architecture that controls the diffusion model by adding additional conditions. Multi-ControlNet is not yet supported, and sources in the open source community say it is being actively developed. This new feature provides the possibility to use multiple control networks and use their outputs together for image generation, allowing better control over the entire image.