Interpretation of the core plug-in of Stable Diffusion—ControlNet

Table of contents

1. Introduction

2. How to use

Three, ControlNet structure

1. Overall structure

2.ControlLDM

3.Timestep Embedding

4.HintBlock

5.ResBlock

6.SpatialTransformer

7.SD Encoder Block

8.SD Decoder Block

9.ControlNet Encoder Block

10.Stable Diffusion

4. Training

1. Prepare the dataset

2. Generate ControlNet model

3. Execution training

5. Others

1. Loss function

2. Random replacement tips

3. Support low resource devices


1. Introduction

        Paper address: https://arxiv.org/abs/2302.05543

        Code address: GitHub - lllyasviel/ControlNet: Let us control diffusion models!

        The main idea of ​​the diffusion model (Diffusion Model) is to generate pictures by denoising. The training process is each time step, adding different "concentrations" of noise to the original picture, and then adding the time step (timestep) and noise The image is used as input, the model is responsible for predicting the noise, and then subtracting the noise from the input image to obtain the original image. As Michelangelo said: The statue is already in the stone, I just remove the unnecessary parts. This is why the Sampling steps are not as big as possible when using Stable Diffusion. This value needs to correspond to the time step of the current noise picture.

        ControlNet implements more input conditions on the basis of a large pre-trained diffusion model (Stable Diffusion), such as edge mapping, segmentation mapping, and key points. Add text as a prompt to generate new images, and it is also stable-diffusion-webui important plug-ins. Because ControlNet uses Stable Diffusion and zero convolution with frozen parameters, the fine-tuning effect will not decrease even if a personal computer is used on a small data set, thus realizing the conditional purpose of learning a specific task in an end-to-end manner.

ControlNet has two main innovations:

1. Use Stable Diffusion and freeze its parameters, and copy a copy of SDEncoder at the same time. The parameters of this copy are trainable. There are two advantages to doing this:

        a. The purpose of making such copies instead of directly training the original weights is to avoid overfitting when the dataset is small, while maintaining the quality of large models learned from billions of images.

        b. Since the original weights are locked, there is no need to perform gradient calculations on the original encoder for training. This speeds up training; it saves GPU memory because gradients of parameters on the original model are not computed.

2. Zero convolution: That is, the initial weight and bias are zero convolution. A zero convolution is added to each layer in the replica to connect to the corresponding layer of the original network. During the first training step, all inputs and outputs of the trainable copy and the locked copy of the neural network block are consistent, as if ControlNet did not exist. In other words, before any optimization, ControlNet will not cause any impact on deep neural features, any further optimization will make the model performance improved, and the training speed is fast.

2. How to use

        The project provides a lot of functions, such as: line graph generation, split graph generation, pose generation, etc. The usage methods are similar. Let's take line graph generation as an example.

        Download the pre-training model, address: lllyasviel/ControlNet at main , download the control_sd15_canny.pth model, and put it in the models directory.

        Execute the following command to start the project:

python gradio_canny2image.py

        Visit http://127.0.0.1/7860 after startup, open the page as follows:

         Upload the picture in the first red box, and fill in the prompt in the second red box. Only English is supported. After a while, two pictures will be generated on the right side. One is the Canny picture generated based on the original picture, and the other is generated based on the Canny picture and the prompt. In the result picture, you can see that the model understands what I mean and the girl's hair is purple.

        Click on Advanced options and additional options will appear. Let me briefly introduce the meaning of each option:

Images: Generate several pictures, if the size is too large, be careful to burst the video memory.

Image Resolution: The generated image resolution.

Control Strength: The following will introduce that ControlNet is divided into two parts: Stable Diffusion and ControlNet. This parameter is the weight of ControlNet. When the Guess Mode below is not selected, the weight of the ControlNet part is all this value; if the Guess Mode below is selected, in The weight of each layer (a total of 13 layers) in the ControlNet part will increase, ranging from 0 to 1. The incremental code is as follows, and the comments are quite interesting:

# 位置 gradio_canny2image.py
# Magic number. IDK why. Perhaps because 0.825**12<0.01 but 0.826**12>0.01
model.control_scales = [strength * (0.825 ** float(12 - i)) for i in range(13)] if guess_mode else ([strength] * 13)  

Guess Mode: If it is not selected, when the model is processing the Negative Prompt part, both Stable Diffusion and ControlNet parts are all valid; when it is selected, when the Negative Prompt part is processed, only the Stable Diffusion branch is taken, and the ControlNet part is invalid. The code is divided into two parts:


# 位置 gradio_canny2image.py
...
un_cond = {"c_concat": None if guess_mode else [control], "c_crossattn": [model.get_learned_conditioning([n_prompt] * num_samples)]}
...

# 位置 cldm/cldm.py
if cond['c_concat'] is None:
            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=None, only_mid_control=self.only_mid_control)
        else:
            # ControlNet()  # 位置
            control = self.control_model(x=x_noisy, hint=torch.cat(cond['c_concat'], 1), timesteps=t, context=cond_txt)
            control = [c * scale for c, scale in zip(control, self.control_scales)]
            # ControlledUnetModel()  # 位置
            eps = diffusion_model(x=x_noisy, timesteps=t, context=cond_txt, control=control, only_mid_control=self.only_mid_control)

Canny low threshold: Canny's parameter, if the edge pixel value is less than the low threshold, it will be suppressed.

Canny high threshold: Canny's parameters, the value of the edge pixel is higher than the high threshold, and it is marked as a strong edge pixel.

Steps: How many times to perform "denoising" operation.

Guidance Scale: The proportion of positive prompts. The unconditional_guidance_scale in the following code is this parameter. model_t is the feature generated by positive prompt+Added Prompt, and model_uncond is the feature generated by Negative Prompt:


# 位置 cldm/ddim_hacked.py
model_output = model_uncond + unconditional_guidance_scale * (model_t - model_uncond)

Seed: The random seed when generating the noise map. When this value is constant and other conditions remain unchanged, the generated result will also remain unchanged.

eta (DDIM): The eta value in DDIM sampling.

Added Prompt: Additional positive prompts, such as best quality, extremely detailed

Negative Prompt: The negative prompt of the attachment, if the generated image is not satisfactory, which part of the unsatisfactory part can be written here, such as longbody, lowres, bad anatomy

Three, ControlNet structure

        The official structure diagram given by ControlNet is as follows:

        This figure generally summarizes the structure of ControlNet, but many details are not shown. By reading the code, I will give a more detailed introduction to the model structure. The data input used for training in the project is 512x512. In order to distinguish the width and height, I use the input of 1024x512. Let's take canny2image as an example, and batch_size=1.

1. Overall structure

       The overall structure of the model is as follows:

        Model input includes canny graph (Map Input), Prompt, additional prompt (Added Prompt), negative prompt (Negative Prompt), random graph (Random Input).

        Prompt and Added Prompt are spliced ​​together with a string to get the spatial representation of the text through CLIPEmbedder (two FrozenCLIPEmbedder share parameters), and then send it to ControlNet's core module ControlLDM (Latent Diffusion) together with Map Input and Random Input, and then loop 20 times (corresponding to the page parameter Steps), where the timesteps are different for each time step. Taking Steps=20 as an example, the timesteps are respectively equal to [1,51,101,151,201,251,301,351,401,451,501,551,601,651,701,751,801,851,901,951].

        Negative Prompt also performs similar operations, and then weights the output of Prompt and Prompt. The formula is as follows, where GuidanceScale is a page parameter, and the default is 9:

out=Guidance Scale*PromptOut+(Guidance Scale-1)*Negative Prompt

        Finally, it is restored to the original image size after Decode First Stage.

2.ControlLDM

       ControlLDM is the core module of ControlNet, the structure diagram is as follows:

        The overall structure of ControlLDM is fairly clear, and the main process of data transfer is as follows:

a.timesteps are converted into feature vectors by embedding and sent to Stable Diffusion and ControlNet;

b. Random noise is sent to Stable Diffusion;

c. The Map of the image passes through HintBlock, adds random noise, and sends it to ControlNet;

d. The Embedding of Prompt is sent to Stable Diffusion and ControlNet;

e. All parameters of Stable Diffusion are frozen and do not participate in training. Stable Diffusion consists of three SDEncoderBlocks, two SDEncoders, one SDMiddleBlock, two SDDecoders and three SDDecoderBlocks;

The structure of f.ControlNet is consistent with Stable Diffusion, except that a zero convolution is added after each layer;

ResBlock in g.Stable Diffusion and ControlNet takes the output and timesteps of the previous layer as input;

The SpatialTransformer in h.Stable Diffusion and ControlNet takes the output of the previous layer and Prompt Embedding as input.

        There are still some modules in the figure that need to be mentioned separately.

3.Timestep Embedding

        Timestep is an important input of the model, which directly affects the denoising effect. When timestep is input, it is a number, and after Timestep Embedding, the length becomes 1280embedding.

code show as below:

# 位置 ldm/modules/diffusionmodules/util.py
def timestep_embedding(timesteps, dim, max_period=10000, repeat_only=False):
    """
    Create sinusoidal timestep embeddings.
    :param timesteps: a 1-D Tensor of N indices, one per batch element.
                      These may be fractional.
    :param dim: the dimension of the output.
    :param max_period: controls the minimum frequency of the embeddings.
    :return: an [N x dim] Tensor of positional embeddings.
    """
    if not repeat_only:
        half = dim // 2
        freqs = torch.exp(
            -math.log(max_period) * torch.arange(start=0, end=half, dtype=torch.float32) / half
        ).to(device=timesteps.device)
        args = timesteps[:, None].float() * freqs[None]
        embedding = torch.cat([torch.cos(args), torch.sin(args)], dim=-1)
        if dim % 2:
            embedding = torch.cat([embedding, torch.zeros_like(embedding[:, :1])], dim=-1)
    else:
        embedding = repeat(timesteps, 'b -> b d', d=dim)
    return embedding

4.HintBlock

        The main function of HintBlock is to extract a wave of features before the input image Map is fused with other features, which is a common operation. HintBlock stacks several layers of convolutions, ending with a zero convolution, which increases the size of the Map channel.

 Code:

# 位置cldm/cldm.py 
self.input_hint_block = TimestepEmbedSequential(
            conv_nd(dims, hint_channels, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 16, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 16, 32, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 32, 32, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 32, 96, 3, padding=1, stride=2),
            nn.SiLU(),
            conv_nd(dims, 96, 96, 3, padding=1),
            nn.SiLU(),
            conv_nd(dims, 96, 256, 3, padding=1, stride=2),
            nn.SiLU(),
            zero_module(conv_nd(dims, 256, model_channels, 3, padding=1))
        )

5.ResBlock

        ResBlock is mainly responsible for fusing the Embedding of the time step with the output of the previous layer. The Embedding branch uses full connection, and the parameters increase sharply; it also uses GroupNorm, which saves computing power to a certain extent. Because there is a residual edge, ResBlock obtains name, the structure is as follows:

code show as below:

# 位置 ldm/modules/diffusionmodules/openaimodel.py
class ResBlock(TimestepBlock):
    """
    A residual block that can optionally change the number of channels.
    :param channels: the number of input channels.
    :param emb_channels: the number of timestep embedding channels.
    :param dropout: the rate of dropout.
    :param out_channels: if specified, the number of out channels.
    :param use_conv: if True and out_channels is specified, use a spatial
        convolution instead of a smaller 1x1 convolution to change the
        channels in the skip connection.
    :param dims: determines if the signal is 1D, 2D, or 3D.
    :param use_checkpoint: if True, use gradient checkpointing on this module.
    :param up: if True, use this block for upsampling.
    :param down: if True, use this block for downsampling.
    """

    def __init__(
        self,
        channels,
        emb_channels,
        dropout,
        out_channels=None,
        use_conv=False,
        use_scale_shift_norm=False,
        dims=2,
        use_checkpoint=False,
        up=False,
        down=False,
    ):
        super().__init__()
        self.channels = channels
        self.emb_channels = emb_channels
        self.dropout = dropout
        self.out_channels = out_channels or channels
        self.use_conv = use_conv
        self.use_checkpoint = use_checkpoint
        self.use_scale_shift_norm = use_scale_shift_norm

        self.in_layers = nn.Sequential(
            normalization(channels),
            nn.SiLU(),
            conv_nd(dims, channels, self.out_channels, 3, padding=1),
        )

        self.updown = up or down

        if up:
            self.h_upd = Upsample(channels, False, dims)
            self.x_upd = Upsample(channels, False, dims)
        elif down:
            self.h_upd = Downsample(channels, False, dims)
            self.x_upd = Downsample(channels, False, dims)
        else:
            self.h_upd = self.x_upd = nn.Identity()

        self.emb_layers = nn.Sequential(
            nn.SiLU(),
            linear(
                emb_channels,
                2 * self.out_channels if use_scale_shift_norm else self.out_channels,
            ),
        )
        self.out_layers = nn.Sequential(
            normalization(self.out_channels),
            nn.SiLU(),
            nn.Dropout(p=dropout),
            zero_module(
                conv_nd(dims, self.out_channels, self.out_channels, 3, padding=1)
            ),
        )

        if self.out_channels == channels:
            self.skip_connection = nn.Identity()
        elif use_conv:
            self.skip_connection = conv_nd(
                dims, channels, self.out_channels, 3, padding=1
            )
        else:
            self.skip_connection = conv_nd(dims, channels, self.out_channels, 1)

    def forward(self, x, emb):
        """
        Apply the block to a Tensor, conditioned on a timestep embedding.
        :param x: an [N x C x ...] Tensor of features.
        :param emb: an [N x emb_channels] Tensor of timestep embeddings.
        :return: an [N x C x ...] Tensor of outputs.
        """
        return checkpoint(
            self._forward, (x, emb), self.parameters(), self.use_checkpoint
        )


    def _forward(self, x, emb):
        if self.updown:
            in_rest, in_conv = self.in_layers[:-1], self.in_layers[-1]
            h = in_rest(x)
            h = self.h_upd(h)
            x = self.x_upd(x)
            h = in_conv(h)
        else:
            h = self.in_layers(x)
        emb_out = self.emb_layers(emb).type(h.dtype)
        while len(emb_out.shape) < len(h.shape):
            emb_out = emb_out[..., None]
        if self.use_scale_shift_norm:
            out_norm, out_rest = self.out_layers[0], self.out_layers[1:]
            scale, shift = th.chunk(emb_out, 2, dim=1)
            h = out_norm(h) * (1 + scale) + shift
            h = out_rest(h)
        else:
            h = h + emb_out
            h = self.out_layers(h)
        return self.skip_connection(x) + h

6.SpatialTransformer

        SpatialTransformer is mainly responsible for fusing the output of Prompt Embedding and the previous layer. The structure is as follows:

        As shown in the figure above, SpatialTransformer is mainly composed of two CrossAttention modules and a FeedForward module.

        CrossAttention1 takes the output of the previous layer as input, divides the input into thirds, and obtains K and V through two full connections respectively. K and Q are multiplied by Softmax to obtain a probability map, and then multiplied by V, which is a comparison standard The Attention structure is actually like a Self Attention.

        The general structure of CrossAttention2 and CrossAttention1 is the same, the difference is that K and V are generated by Prompt Embedding. After two CrossAttentions, the image features and Prompt Embedding have been fused together.

        The FeedForward module uses GEGLU, and there are two fully connected layers at the head and tail to further extract the fused features.

Code:

# 位置 ldm/modules/attention.py
class BasicTransformerBlock(nn.Module):
    ATTENTION_MODES = {
        "softmax": CrossAttention,  # vanilla attention
        "softmax-xformers": MemoryEfficientCrossAttention
    }
    def __init__(self, dim, n_heads, d_head, dropout=0., context_dim=None, gated_ff=True, checkpoint=True,
                 disable_self_attn=False):
        super().__init__()
        attn_mode = "softmax-xformers" if XFORMERS_IS_AVAILBLE else "softmax"
        assert attn_mode in self.ATTENTION_MODES
        attn_cls = self.ATTENTION_MODES[attn_mode]
        self.disable_self_attn = disable_self_attn
        self.attn1 = attn_cls(query_dim=dim, heads=n_heads, dim_head=d_head, dropout=dropout,
                              context_dim=context_dim if self.disable_self_attn else None)  # is a self-attention if not self.disable_self_attn
        self.ff = FeedForward(dim, dropout=dropout, glu=gated_ff)
        self.attn2 = attn_cls(query_dim=dim, context_dim=context_dim,
                              heads=n_heads, dim_head=d_head, dropout=dropout)  # is self-attn if context is none
        self.norm1 = nn.LayerNorm(dim)
        self.norm2 = nn.LayerNorm(dim)
        self.norm3 = nn.LayerNorm(dim)
        self.checkpoint = checkpoint

    def forward(self, x, context=None):
        return checkpoint(self._forward, (x, context), self.parameters(), self.checkpoint)

    def _forward(self, x, context=None):
        x = self.attn1(self.norm1(x), context=context if self.disable_self_attn else None) + x
        x = self.attn2(self.norm2(x), context=context) + x
        x = self.ff(self.norm3(x)) + x
        return x


class SpatialTransformer(nn.Module):
    """
    Transformer block for image-like data.
    First, project the input (aka embedding)
    and reshape to b, t, d.
    Then apply standard transformer action.
    Finally, reshape to image
    NEW: use_linear for more efficiency instead of the 1x1 convs
    """
    def __init__(self, in_channels, n_heads, d_head,
                 depth=1, dropout=0., context_dim=None,
                 disable_self_attn=False, use_linear=False,
                 use_checkpoint=True):
        super().__init__()
        if exists(context_dim) and not isinstance(context_dim, list):
            context_dim = [context_dim]
        self.in_channels = in_channels
        inner_dim = n_heads * d_head
        self.norm = Normalize(in_channels)
        if not use_linear:
            self.proj_in = nn.Conv2d(in_channels,
                                     inner_dim,
                                     kernel_size=1,
                                     stride=1,
                                     padding=0)
        else:
            self.proj_in = nn.Linear(in_channels, inner_dim)

        self.transformer_blocks = nn.ModuleList(
            [BasicTransformerBlock(inner_dim, n_heads, d_head, dropout=dropout, context_dim=context_dim[d],
                                   disable_self_attn=disable_self_attn, checkpoint=use_checkpoint)
                for d in range(depth)]
        )
        if not use_linear:
            self.proj_out = zero_module(nn.Conv2d(inner_dim,
                                                  in_channels,
                                                  kernel_size=1,
                                                  stride=1,
                                                  padding=0))
        else:
            self.proj_out = zero_module(nn.Linear(in_channels, inner_dim))
        self.use_linear = use_linear

    def forward(self, x, context=None):
        # note: if no context is given, cross-attention defaults to self-attention
        if not isinstance(context, list):
            context = [context]
        b, c, h, w = x.shape
        x_in = x
        x = self.norm(x)
        if not self.use_linear:
            x = self.proj_in(x)
        x = rearrange(x, 'b c h w -> b (h w) c').contiguous()
        if self.use_linear:
            x = self.proj_in(x)
        for i, block in enumerate(self.transformer_blocks):
            x = block(x, context=context[i])
        if self.use_linear:
            x = self.proj_out(x)
        x = rearrange(x, 'b (h w) c -> b c h w', h=h, w=w).contiguous()
        if not self.use_linear:
            x = self.proj_out(x)
        return x + x_in

7.SD Encoder Block

        The SD Encoder Block is a component unit of the Stable Diffusion encoding stage. It is a module of the encoding stage. It is mainly a stack of ResBlock and SpatialTransformer, which realizes the feature fusion of timestep, hint Map, and PromptEmbedding, and performs downsampling at the same time to increase the number of channels of the feature map. . It is worth noting that this part of the code is frozen, and the structure diagram is as follows:

8.SD Decoder Block

        The SD Decoder Block is also a component unit of the Stable Diffusion encoding stage. It is a module of the decoding stage. It is mainly a stack of ResBlock and SpatialTransformer, which realizes the feature fusion of timestep, hint Map, and PromptEmbedding, and performs upsampling at the same time to reduce the number of channels of the feature map. . This part of the code is also frozen. The structure diagram is as follows:

SD Encoder Block + SD Decoder Block code implementation:

# 位置 cldm/cldm.py 
class ControlledUnetModel(UNetModel):
    def forward(self, x, timesteps=None, context=None, control=None, only_mid_control=False, **kwargs):
        hs = []
        with torch.no_grad():
            t_emb = timestep_embedding(timesteps, self.model_channels, repeat_only=False)
            emb = self.time_embed(t_emb)
            h = x.type(self.dtype)
            for module in self.input_blocks:
                h = module(h, emb, context)
                hs.append(h)
            h = self.middle_block(h, emb, context)

        if control is not None:
            h += control.pop()

        for i, module in enumerate(self.output_blocks):
            if only_mid_control or control is None:
                h = torch.cat([h, hs.pop()], dim=1)
            else:
                h = torch.cat([h, hs.pop() + control.pop()], dim=1)
            h = module(h, emb, context)

        h = h.type(x.dtype)
        return self.out(h)

9.ControlNet Encoder Block

        ControlNet Encoder Block is cloned from SD Encoder Block, only adding zero convolution, and the parameters are trainable, the structure diagram is as follows:

10.Stable Diffusion

        The parameters of the entire Stable Diffusion are frozen and untrainable. The code of the frozen parameters is as follows:

# 位置 cldm/cldm.py    
def configure_optimizers(self):
        lr = self.learning_rate
        params = list(self.control_model.parameters())
        if not self.sd_locked:
            params += list(self.model.diffusion_model.output_blocks.parameters())
            params += list(self.model.diffusion_model.out.parameters())
        opt = torch.optim.AdamW(params, lr=lr)
        return opt

4. Training

        The training of ControlNet is not complicated, the main thing is to prepare the data set, we also take canny2image as an example.

1. Prepare the dataset

        The training data requires 3 types of files, the original image, the cannyMap image, and the corresponding prompt. If you just want to run the training process smoothly, you can use the fill50k dataset . If you want to use your own dataset, you must prepare the image in the style you need. Below I will introduce how to obtain the cannyMap map and the corresponding Prompt.

a. Generate cannyMap

        There are ready-made pages in the project for generating cannyMap, execute the following command:

python gradio_annotator.py

        Enter the address printed on the console, usually  http://127.0.0.1:7860/  .

         Upload the picture in the red box above, and then click Run to generate cannyMap. If your data set is not large, you can use this method. If you have a lot of data, you have to write a simple program, which is also very simple. Drop it below This method will do:

# 位置 gradio_annotator.py
def canny(img, res, l, h):
    img = resize_image(HWC3(img), res)
    global model_canny
    if model_canny is None:
        from annotator.canny import CannyDetector
        model_canny = CannyDetector()
    result = model_canny(img, l, h)
    return [result]

b. Generate Prompt

        The easiest way to generate a prompt is to use stable-different-webui, please jump here for the installation tutorial , use the deepbooru plug-in to help us generate a prompt, just follow the red box below.

         The result is generated under the directory of the fourth red box in the above figure, and the directory structure looks like this:

         The content in txt looks like this:

1girl, asian, bangs, black_eyes, blunt_bangs, closed_mouth, lips, long_hair, looking_at_viewer, realistic, shirt, smile, solo, white_shirt

c. Prepare the prompt.json file

        The content structure of the prompt.json file is as follows, and the meaning of the key is clear at a glance:

        The final dataset directory structure is as follows:

 d. Modify the data set prompt.json path

        Modify the directory of the prompt.json file in the tutorial_train.py file:

2. Generate ControlNet model

        We can download the Stable Different pre-training model from here and put it in the models directory, and then generate the ControlNet model through the following command. This step is mainly to copy the structure and parameters of the Stable Different Encoder:

python tool_add_control.py  ./models/v1-5-pruned.ckpt ./models/control_sd15_ini.ckpt

3. Execution training

        We've finally come to the most exciting part: training!

python tutorial_train.py

5. Others

1. Loss function

        The ControlNet paper mentions the use of L2 loss:

         In fact, you can also choose L1 loss in the code:

# 位置  ldm/models/diffusion/dpm_solver/ddpm.py    
def get_loss(self, pred, target, mean=True):
        if self.loss_type == 'l1':
            loss = (target - pred).abs()
            if mean:
                loss = loss.mean()
        elif self.loss_type == 'l2':
            if mean:
                loss = torch.nn.functional.mse_loss(target, pred)
            else:
                loss = torch.nn.functional.mse_loss(target, pred, reduction='none')
        else:
            raise NotImplementedError("unknown loss type '{loss_type}'")

        return loss

2. Random replacement tips

       During training, 50% of the text prompts are randomly replaced with empty strings. This contributes to ControlNet's ability to identify semantic content from input conditional maps. This is mainly because the encoder can learn more semantics from the Map input as a substitute for hints when the hints are not visible to SD.

3. Support low resource devices

        If your device configuration is very low, you can only train the middle part of ControlNet, and adjust the code like this:

# 位置 tutorial_train.py
sd_locked = True
only_mid_control = True

        If the configuration is general, you can only use the standard training process, that is, freeze Stable Diffusion and train ControlNet. This is also the default configuration. The code is as follows:

# 位置 tutorial_train.py
sd_locked = True
only_mid_control = False

        If the configuration of fried chicken is good, you can train at full capacity:

# 位置 tutorial_train.py
sd_locked = False
only_mid_control = False

        These are the key content sections of ControlNet. I will continue to update the relevant content of Stable Diffusion, pay attention, and don't get lost.

Guess you like

Origin blog.csdn.net/xian0710830114/article/details/129194419