Stable Diffusion uses lora-scripts WebUI to train LoRA models

If you have difficulty using the code, you can directly use the WebUI version of the LoRA module for training operations. Whether it is training characters, scenes, styles, or clothing, it is a set of general models, but the results obtained by using different data sets are different.

installation of lora-scripts WebUI

Use to git clone --recurse-submodules https://github.com/Akegarasu/lora-scriptsdownload the source file, which can be done in the extension directory of your SD. The code is derived based on https://github.com/kohya-ss/sd-scripts .

If it occurs during the installation process, the installation Failed to clone 'sd-scripts' a second time, abortingneeds to be performed manually git clone --recurse-submodules https://github.com/kohya-ss/sd-scripts.git.

insert image description here

If the network is not good, please try a few more times. After the script is cloned, it looks like this. During
insert image description here
execution , install and configure the development environment. Right-click and select Execute. Select Y all the way to install by default, which is creating a virtual environment and configuring cuda. When the display proves that the installation was successful, the window will automatically close.lora-scriptsinstall.ps1power shell
insert image description here

insert image description here
Install completed
insert image description here

git clone --recurse-submodules https://github.com/kohya-ss/sd-scripts.gitExecute the installation in the current directory sd-scripts. sd-scriptsOr just drag in the previously downloaded one .

Execute run_gui.ps1and open to enter the WebUI interface. If the window flashes back, it proves that the installation just now is not complete, and you need to re-execute install-cn.ps1the completion of the three-party dependent package.

insert image description here
It shows that the installation is complete. Execution run_gui.ps1will pop up the URL in the browser and http://127.0.0.1:28000/lora/basic.htmlthe page will appear.

insert image description here

LoRA training process

Here we use the standing paintings of Wu Guo in "Dynasty Warriors 8" as training materials.

insert image description here

SD split material

Open SD and click the image preprocessing module, enter the directory of the picture prevention folder, remember that there should be no Chinese in this path. Set it according to the figure below. The resolution setting must be a multiple of 64 and be between 512-1024.

Remember to choose one of the two keywords.
insert image description here

Clicking on preprocessing will generate corresponding focused and cut images and keyword prompts under the data2 folder.

insert image description here

Novice and Expert models

Whether it is expert mode or novice mode, the basic configuration is the same. Here is the novice mode. If you want to use the expert mode, you can configure the settings yourself according to the basic parameter description below.

training model

Base model path, select the model path you want to train. The default is a relative path, so you need to throw your trained model into the corresponding sd-modelsfolder.
insert image description here
Dataset settings

The training data set path needs to create a folder lora-scriptsunder the folder . trainThen rename the folder we just preprocessed DynastyWarriors_8_wu, and then we need to add a number and underscore under the new folder like me. Do not have Chinese characters, spaces and other characters in the folder.

insert image description here

By default, regularized image data is not used, and the resolution can be set to 512x512 which has just been divided.

Save Settings

Model save name, this is the file name to save the model, you can use the folder name command for easy management. The model save folder defaults to outputthe folder under the current folder. Set the model to save by default, that is, how many times to save the model for each training. If you have high requirements for the model, it is recommended to select 1, and then select the trained model according to the parameters.
insert image description here
Training related parameters

The maximum training epoch indicates how many times to train, which is the corresponding epoch in the above save settings. Generally speaking, 20-30 can be selected.

The batch size is set according to the video memory of your machine. Like my RTX4090, I usually choose 5, and you can choose according to your own situation.

insert image description here
Learning rate and optimizer settings

If you don’t understand this, you can choose the default. There will be an introduction to the parameter explanation later, so you can go back and adjust it.

Below is my configuration for reference.
insert image description here
Training preview settings

It can be turned on or not turned on here. When it is used for self-training, it only produces a sample image at each step, and the parameters are default. The picture previewed each time will tomlbe viewed under the folder. If the performance of the machine is not good, don’t open it, just look at the final result.
insert image description here
network settings

The function of continuing training is supported here. You need to set the address of the previous network structure model yourself. Since this is my first training, I don’t have it. Other dim and alpha can be selected by default.

insert image description here

captions option

Whether to scramble the picture and keep the token, that is, the keyword, it is recommended to select the model. If the keyword of each picture needs to be fixed and unified, it is recommended to select the value you need.

insert image description here
After the parameters are set, click to start training directly. Then just wait slowly. The first startup will automatically download some configuration information.
insert image description here
The training process starts running epoch.
insert image description here
According to the settings, each training stage will automatically save the model file to the outputnext.
insert image description here
The preview image outputis in samplethe folder below. In the process of previewing the image, we will find that it is a bit like the evolution of the costumes of the characters in the "Dynasty Warriors" series. The more refined the training clothing is.
insert image description here

Model Selection and Use

It is also possible to select the last model by default, or open the expert mode to print the log, and a loss value will appear. According to the robustness of the model, the lower the loss value is preferred.
insert image description here
Put the trained LoRA model into the lora folder of models in your SD directory and you will see it.
insert image description here

Configuration save and read

In order to avoid the trouble of setting every time, you can save the configuration and set the name, which is convenient for reading and calling in the future.
insert image description here

Basic parameter description

In the process of Stable Diffusion model training, a key factor is the correct selection and configuration of models and datasets. In this blog, I'll outline how to make the best decisions and explain some key concepts. Whether it is expert mode or novice mode, the parameters are basically the same. Here is a parameter detailed page. Just put it in plain language.

Model and Dataset

Choose base model

When choosing the base model (or "base model"), it is recommended to choose the "ancestor level" model as much as possible, because the LoRA (Latent Optimizer Re-Alignment) trained with these models will be more general. Ancestor-level models include sd1.5, 2.0, and novelai original leaked models, which are non-fused models. Fusion models such as the anything series and the orange mix series, these models incorporate many different elements. While training on these models may yield good results in generating images, this often results in a loss of generalizability of the model, making the trained model perform poorly in other contexts. Therefore, it is very important to make a choice according to the needs.

training resolution

During the training process, you can set the training resolution, that is, the width and height of the image. While this can be non-square, it must be a multiple of 64. Values ​​greater than 512x512 and less than 1024x1024 are generally recommended. The aspect ratio should be determined according to the characteristics of the training set. Generally speaking, the resolution of the square can be compatible with various image resolutions.

But if the training set mainly consists of portrait images use a resolution such as 512x768. On the contrary, if there are more horizontal pictures in the training set, use a resolution of 768x512.

ARB bucket

ARB bucketing is a training technique that allows training with images that have a non-fixed aspect ratio, meaning no manual cropping of images is required. But it will increase the training time to a certain extent, and its resolution must be greater than the training resolution, so it will take up more video memory. If the video memory is less than 12G, it is not recommended to open the ARB bucket, please use the novice mode to train LoRA.

Learning Rate and Optimizer

In deep learning, effective learning rate setting and optimizer selection are key.

learning rate setting

The learning rates of UNet and TE are usually different because they have different learning difficulties. The learning rate of UNet will be higher than that of TE in the normal setting. This is because if the UNet is undertrained, the generated images may not be accurate enough, while overtraining can lead to distorted faces or large color patches. On the contrary, under-training of TE will lead to low obedience of the image to Prompt, and over-training may generate redundant content.

In order to accurately calculate the number of learning steps, the calculation formula is as follows:

Number of learning steps = (Number of pictures ∗ number of repetitions ∗ epoch) / batch size Learning steps = (Number of pictures * number of repetitions * epoch) / batch sizelearning steps=(Number of picturesrepeat timese p oc h ) / batch size

Generally speaking, the initial value with better effect is 1e-4 for UNet and 5e-5 for TE.

Taking the UNet learning rate as 1e-4 as an example, it takes at least 1,000 steps to train characters, at least 2,500 steps to train painting styles, and at least 3,000 steps to train concepts. This is the minimum number of steps, if there are more pictures, more steps are required.

The best way to determine the learning rate and number of steps is through iterative experiments, training first and then testing.

Learning rate adjustment strategy (lr_scheduler)

For the learning rate adjustment strategy, Cosine Annealing is recommended. If warm-up is enabled, the warm-up steps should account for 5%-10% of the total steps. If you choose to use Cosine Annealing with Restarts (Cosine Annealing with Restarts), then the number of restarts should not exceed 4 times.

Batch size (Batch Size)

The larger the batch size, the more stable the gradient, and a larger learning rate can be used to speed up the convergence. But it also consumes more machine resources and takes up more video memory. The general setting batch size=2is twice the UNet learning rate.

optimizer selection

The three most commonly used optimizers are:

  1. AdamW8bit : This is an AdamW optimizer that enables int8 optimization, usually the default option.
  2. Lion : The new optimizer published by Google Brain outperforms AdamW in all aspects and takes up less video memory, but requires a larger batch size to keep the gradient update stable.
  3. D-Adaptation : The adaptive learning rate optimizer published by Facebook, which is easy to adjust and does not need to manually control the learning rate, but takes up a lot of video memory (usually greater than 8G). When using it, set the learning rate to 1, and the learning rate adjustment strategy uses constant. Need to add --optimizer_args decouple=Truelearning rate to separate UNet and TE.

Network Settings and Network Structure

Network structure is the basis for building any model. LoRA, LyCORIS, LoCon, LoHa, LoKr and IA3 select the appropriate network size according to the actual number of training set pictures and the network structure used.

network structure

Different network structures correspond to different matrix low-rank decomposition methods. The following are several structural descriptions:

  • LoRA is a network structure that mainly controls the linear layer and 1x1 convolutional layer in the model. This is a basic structure, and many subsequent network structures are improved on the basis of LoRA.
  • LyCORIS is an improvement to LoRA that incorporates several different algorithms, including LoCon, LoHa, LoKr, and IA3.
  • LoCon: This algorithm adds control over convolutional layers (Conv).
  • LoHa and LoKr: These two methods use the Hadamard product and the Kronecker product respectively.

In theory, LyCORIS will have a stronger fine-tuning effect than LoRA, but it is also easier to overfit.

network size

The choice of network size should be based on the actual number of training set pictures and the network structure used. These recommended values ​​are not optimal for all different data sets, and experiments are required to obtain the optimal solution according to the actual situation. In addition, for the convolutional layer (Conv), it is best not to exceed 8.

Network Alpha

Network Alpha (network_alpha) is a parameter that is used to scale the weights of the network during training. The smaller the Alpha, the slower the learning, and the relationship can be considered as a negative linear correlation. Generally set to dim/2 or dim/4. If you choose 1, you need to increase the learning rate or use the D-Adapation optimizer.

Expert Advanced Settings

When exploring the in-depth application of Stable Diffusion, we will encounter some advanced settings, which may cause some confusion for beginners.

Caption Dropout

There are relatively few documents and online information on the Caption Dropout parameters. Even in the author's documentation, these parameters are not described in detail. However, Caption Dropout can improve the performance of the model in some cases.

  • caption_dropout_rate : This is the probability of dropping all labels, which means that the picture may not use caption or class token.
  • caption_dropout_every_n_epochs : This parameter sets the frequency at which all labels are discarded every N epochs.
  • caption_tag_dropout_rate : This is the probability of randomly dropping tags by comma separated tags. If you use the DB+tag training method to train the painting style, it is recommended to use this parameter because it can effectively prevent tag overfitting. Generally, we will choose a value between 0.2-0.5. For character training, we usually don't need to turn on this option.

Token

Token is composed of two related parameters, including token_warmup_min and token_warmup_step.

  • token_warmup_min : This is the minimum number of tokens to learn.
  • token_warmup_step : This is the number of steps after which the maximum number of tokens is reached.

Token warm-up can be understood as another form of caption dropout. However, if the tokens are not randomly shuffled, the model will only learn the first N tokens.

noise correlation

There are two noise-related parameters, namely noise offset (noise_offset) and multi-resolution/pyramid noise (multires_noise_iterations, multires_noise_discount).

  • Noise offset (noise_offset) : Adding global noise during the training process can improve the brightness range of the picture, so that the model can generate darker or whiter pictures. If you need to enable this option, the recommended setting value is 0.1, and you need to increase the number of learning steps as compensation for slower network convergence.
  • Multi-resolution/pyramid noise : Related parameters include multires_noise_iterations and multires_noise_discount. For iterations, it is recommended to set it at 6-8, and it will not improve much if it is higher. For discount, it is recommended to set it between 0.3-0.8, and a smaller value requires more steps.

Other parameters

  • CLIP_SKIP uses the parameters output by the CLIP model of the last N layer. This needs to be consistent with the model we use. If we are using a NAI-based quadratic model, then we should use 2; and if we are using a real model like SD1.5, then we should use 1.
  • Min-SNR-γ is used to speed up the convergence of the diffusion model. Due to the different learning difficulties of different sample batches, the gradient direction is inconsistent, which makes the convergence process slower. To solve this problem, it works best when this parameter is set to 5. This approach does not apply when the optimizer uses D-Adaptation.
  • Data augmentation is a method of transforming images in real time during training to prevent model overfitting. The data augmentation methods that can be used include: color_aug, flip_aug, face_crop_aug_range, random_crop. It is not recommended to use, it will cause differences between keywords and screens.
  • max_grad_norm is generally useless. This parameter is used to limit the size of the model update gradient, thereby improving numerical stability. If the norm of the gradient exceeds this value, it will be scaled to this size.
  • The parameter gradient_accumulation_steps is the number of steps for gradient accumulation, which is used to simulate the effect of a large batch size on a small video memory. If the video memory is sufficient to use a batch size above 4, such as 3090 or 4090 with 24G video memory, then there is no need to enable this parameter.
  • log_with, wandb_api_key are used to select the logger type, you can choose tensorboard or wandb.
  • Prior_loss_weight is generally useless, and the default selection of 1 is fine. Used to control the weight of the prior part in DB training, which can be used to control the regularization strength of the image.
  • debug_dataset is used to check that our settings are correct.
  • Using 2-4 for vae_batch_size can slightly speed up the process of cache latent. Since the parameters of the VAE encoder itself are relatively small, even on a Linux machine, a graphics card with 8GB of video memory can be set to 4. For the Windows system, since the system occupies a lot of video memory, if the video memory is less than 10GB, do not enable this parameter.

Guess you like

Origin blog.csdn.net/qq_20288327/article/details/131561342