If you have difficulty using the code, you can directly use the WebUI version of the LoRA module for training operations. Whether it is training characters, scenes, styles, or clothing, it is a set of general models, but the results obtained by using different data sets are different.
Article Directory
installation of lora-scripts WebUI
Use to git clone --recurse-submodules https://github.com/Akegarasu/lora-scripts
download the source file, which can be done in the extension directory of your SD. The code is derived based on https://github.com/kohya-ss/sd-scripts .
If it occurs during the installation process, the installation Failed to clone 'sd-scripts' a second time, aborting
needs to be performed manually git clone --recurse-submodules https://github.com/kohya-ss/sd-scripts.git
.
If the network is not good, please try a few more times. After the script is cloned, it looks like this. During
execution , install and configure the development environment. Right-click and select Execute. Select Y all the way to install by default, which is creating a virtual environment and configuring cuda. When the display proves that the installation was successful, the window will automatically close.lora-scripts
install.ps1
power shell
Install completed
git clone --recurse-submodules https://github.com/kohya-ss/sd-scripts.git
Execute the installation in the current directory sd-scripts
. sd-scripts
Or just drag in the previously downloaded one .
Execute run_gui.ps1
and open to enter the WebUI interface. If the window flashes back, it proves that the installation just now is not complete, and you need to re-execute install-cn.ps1
the completion of the three-party dependent package.
It shows that the installation is complete. Execution run_gui.ps1
will pop up the URL in the browser and http://127.0.0.1:28000/lora/basic.html
the page will appear.
LoRA training process
Here we use the standing paintings of Wu Guo in "Dynasty Warriors 8" as training materials.
SD split material
Open SD and click the image preprocessing module, enter the directory of the picture prevention folder, remember that there should be no Chinese in this path. Set it according to the figure below. The resolution setting must be a multiple of 64 and be between 512-1024.
Remember to choose one of the two keywords.
Clicking on preprocessing will generate corresponding focused and cut images and keyword prompts under the data2 folder.
Novice and Expert models
Whether it is expert mode or novice mode, the basic configuration is the same. Here is the novice mode. If you want to use the expert mode, you can configure the settings yourself according to the basic parameter description below.
training model
Base model path, select the model path you want to train. The default is a relative path, so you need to throw your trained model into the corresponding sd-models
folder.
Dataset settings
The training data set path needs to create a folder lora-scripts
under the folder . train
Then rename the folder we just preprocessed DynastyWarriors_8_wu
, and then we need to add a number and underscore under the new folder like me. Do not have Chinese characters, spaces and other characters in the folder.
By default, regularized image data is not used, and the resolution can be set to 512x512 which has just been divided.
Save Settings
Model save name, this is the file name to save the model, you can use the folder name command for easy management. The model save folder defaults to output
the folder under the current folder. Set the model to save by default, that is, how many times to save the model for each training. If you have high requirements for the model, it is recommended to select 1, and then select the trained model according to the parameters.
Training related parameters
The maximum training epoch indicates how many times to train, which is the corresponding epoch in the above save settings. Generally speaking, 20-30 can be selected.
The batch size is set according to the video memory of your machine. Like my RTX4090, I usually choose 5, and you can choose according to your own situation.
Learning rate and optimizer settings
If you don’t understand this, you can choose the default. There will be an introduction to the parameter explanation later, so you can go back and adjust it.
Below is my configuration for reference.
Training preview settings
It can be turned on or not turned on here. When it is used for self-training, it only produces a sample image at each step, and the parameters are default. The picture previewed each time will toml
be viewed under the folder. If the performance of the machine is not good, don’t open it, just look at the final result.
network settings
The function of continuing training is supported here. You need to set the address of the previous network structure model yourself. Since this is my first training, I don’t have it. Other dim and alpha can be selected by default.
captions option
Whether to scramble the picture and keep the token, that is, the keyword, it is recommended to select the model. If the keyword of each picture needs to be fixed and unified, it is recommended to select the value you need.
After the parameters are set, click to start training directly. Then just wait slowly. The first startup will automatically download some configuration information.
The training process starts running epoch.
According to the settings, each training stage will automatically save the model file to the output
next.
The preview image output
is in sample
the folder below. In the process of previewing the image, we will find that it is a bit like the evolution of the costumes of the characters in the "Dynasty Warriors" series. The more refined the training clothing is.
Model Selection and Use
It is also possible to select the last model by default, or open the expert mode to print the log, and a loss value will appear. According to the robustness of the model, the lower the loss value is preferred.
Put the trained LoRA model into the lora folder of models in your SD directory and you will see it.
Configuration save and read
In order to avoid the trouble of setting every time, you can save the configuration and set the name, which is convenient for reading and calling in the future.
Basic parameter description
In the process of Stable Diffusion model training, a key factor is the correct selection and configuration of models and datasets. In this blog, I'll outline how to make the best decisions and explain some key concepts. Whether it is expert mode or novice mode, the parameters are basically the same. Here is a parameter detailed page. Just put it in plain language.
Model and Dataset
Choose base model
When choosing the base model (or "base model"), it is recommended to choose the "ancestor level" model as much as possible, because the LoRA (Latent Optimizer Re-Alignment) trained with these models will be more general. Ancestor-level models include sd1.5, 2.0, and novelai original leaked models, which are non-fused models. Fusion models such as the anything series and the orange mix series, these models incorporate many different elements. While training on these models may yield good results in generating images, this often results in a loss of generalizability of the model, making the trained model perform poorly in other contexts. Therefore, it is very important to make a choice according to the needs.
training resolution
During the training process, you can set the training resolution, that is, the width and height of the image. While this can be non-square, it must be a multiple of 64. Values greater than 512x512 and less than 1024x1024 are generally recommended. The aspect ratio should be determined according to the characteristics of the training set. Generally speaking, the resolution of the square can be compatible with various image resolutions.
But if the training set mainly consists of portrait images use a resolution such as 512x768. On the contrary, if there are more horizontal pictures in the training set, use a resolution of 768x512.
ARB bucket
ARB bucketing is a training technique that allows training with images that have a non-fixed aspect ratio, meaning no manual cropping of images is required. But it will increase the training time to a certain extent, and its resolution must be greater than the training resolution, so it will take up more video memory. If the video memory is less than 12G, it is not recommended to open the ARB bucket, please use the novice mode to train LoRA.
Learning Rate and Optimizer
In deep learning, effective learning rate setting and optimizer selection are key.
learning rate setting
The learning rates of UNet and TE are usually different because they have different learning difficulties. The learning rate of UNet will be higher than that of TE in the normal setting. This is because if the UNet is undertrained, the generated images may not be accurate enough, while overtraining can lead to distorted faces or large color patches. On the contrary, under-training of TE will lead to low obedience of the image to Prompt, and over-training may generate redundant content.
In order to accurately calculate the number of learning steps, the calculation formula is as follows:
Number of learning steps = (Number of pictures ∗ number of repetitions ∗ epoch) / batch size Learning steps = (Number of pictures * number of repetitions * epoch) / batch sizelearning steps=(Number of pictures∗repeat times∗e p oc h ) / batch size
Generally speaking, the initial value with better effect is 1e-4 for UNet and 5e-5 for TE.
Taking the UNet learning rate as 1e-4 as an example, it takes at least 1,000 steps to train characters, at least 2,500 steps to train painting styles, and at least 3,000 steps to train concepts. This is the minimum number of steps, if there are more pictures, more steps are required.
The best way to determine the learning rate and number of steps is through iterative experiments, training first and then testing.
Learning rate adjustment strategy (lr_scheduler)
For the learning rate adjustment strategy, Cosine Annealing is recommended. If warm-up is enabled, the warm-up steps should account for 5%-10% of the total steps. If you choose to use Cosine Annealing with Restarts (Cosine Annealing with Restarts), then the number of restarts should not exceed 4 times.
Batch size (Batch Size)
The larger the batch size, the more stable the gradient, and a larger learning rate can be used to speed up the convergence. But it also consumes more machine resources and takes up more video memory. The general setting batch size=2
is twice the UNet learning rate.
optimizer selection
The three most commonly used optimizers are:
- AdamW8bit : This is an AdamW optimizer that enables int8 optimization, usually the default option.
- Lion : The new optimizer published by Google Brain outperforms AdamW in all aspects and takes up less video memory, but requires a larger batch size to keep the gradient update stable.
- D-Adaptation : The adaptive learning rate optimizer published by Facebook, which is easy to adjust and does not need to manually control the learning rate, but takes up a lot of video memory (usually greater than 8G). When using it, set the learning rate to 1, and the learning rate adjustment strategy uses constant. Need to add
--optimizer_args decouple=True
learning rate to separate UNet and TE.
Network Settings and Network Structure
Network structure is the basis for building any model. LoRA, LyCORIS, LoCon, LoHa, LoKr and IA3 select the appropriate network size according to the actual number of training set pictures and the network structure used.
network structure
Different network structures correspond to different matrix low-rank decomposition methods. The following are several structural descriptions:
- LoRA is a network structure that mainly controls the linear layer and 1x1 convolutional layer in the model. This is a basic structure, and many subsequent network structures are improved on the basis of LoRA.
- LyCORIS is an improvement to LoRA that incorporates several different algorithms, including LoCon, LoHa, LoKr, and IA3.
- LoCon: This algorithm adds control over convolutional layers (Conv).
- LoHa and LoKr: These two methods use the Hadamard product and the Kronecker product respectively.
In theory, LyCORIS will have a stronger fine-tuning effect than LoRA, but it is also easier to overfit.
network size
The choice of network size should be based on the actual number of training set pictures and the network structure used. These recommended values are not optimal for all different data sets, and experiments are required to obtain the optimal solution according to the actual situation. In addition, for the convolutional layer (Conv), it is best not to exceed 8.
Network Alpha
Network Alpha (network_alpha) is a parameter that is used to scale the weights of the network during training. The smaller the Alpha, the slower the learning, and the relationship can be considered as a negative linear correlation. Generally set to dim/2 or dim/4. If you choose 1, you need to increase the learning rate or use the D-Adapation optimizer.
Expert Advanced Settings
When exploring the in-depth application of Stable Diffusion, we will encounter some advanced settings, which may cause some confusion for beginners.
Caption Dropout
There are relatively few documents and online information on the Caption Dropout parameters. Even in the author's documentation, these parameters are not described in detail. However, Caption Dropout can improve the performance of the model in some cases.
- caption_dropout_rate : This is the probability of dropping all labels, which means that the picture may not use caption or class token.
- caption_dropout_every_n_epochs : This parameter sets the frequency at which all labels are discarded every N epochs.
- caption_tag_dropout_rate : This is the probability of randomly dropping tags by comma separated tags. If you use the DB+tag training method to train the painting style, it is recommended to use this parameter because it can effectively prevent tag overfitting. Generally, we will choose a value between 0.2-0.5. For character training, we usually don't need to turn on this option.
Token
Token is composed of two related parameters, including token_warmup_min and token_warmup_step.
- token_warmup_min : This is the minimum number of tokens to learn.
- token_warmup_step : This is the number of steps after which the maximum number of tokens is reached.
Token warm-up can be understood as another form of caption dropout. However, if the tokens are not randomly shuffled, the model will only learn the first N tokens.
noise correlation
There are two noise-related parameters, namely noise offset (noise_offset) and multi-resolution/pyramid noise (multires_noise_iterations, multires_noise_discount).
- Noise offset (noise_offset) : Adding global noise during the training process can improve the brightness range of the picture, so that the model can generate darker or whiter pictures. If you need to enable this option, the recommended setting value is 0.1, and you need to increase the number of learning steps as compensation for slower network convergence.
- Multi-resolution/pyramid noise : Related parameters include multires_noise_iterations and multires_noise_discount. For iterations, it is recommended to set it at 6-8, and it will not improve much if it is higher. For discount, it is recommended to set it between 0.3-0.8, and a smaller value requires more steps.
Other parameters
- CLIP_SKIP uses the parameters output by the CLIP model of the last N layer. This needs to be consistent with the model we use. If we are using a NAI-based quadratic model, then we should use 2; and if we are using a real model like SD1.5, then we should use 1.
- Min-SNR-γ is used to speed up the convergence of the diffusion model. Due to the different learning difficulties of different sample batches, the gradient direction is inconsistent, which makes the convergence process slower. To solve this problem, it works best when this parameter is set to 5. This approach does not apply when the optimizer uses D-Adaptation.
- Data augmentation is a method of transforming images in real time during training to prevent model overfitting. The data augmentation methods that can be used include: color_aug, flip_aug, face_crop_aug_range, random_crop. It is not recommended to use, it will cause differences between keywords and screens.
- max_grad_norm is generally useless. This parameter is used to limit the size of the model update gradient, thereby improving numerical stability. If the norm of the gradient exceeds this value, it will be scaled to this size.
- The parameter gradient_accumulation_steps is the number of steps for gradient accumulation, which is used to simulate the effect of a large batch size on a small video memory. If the video memory is sufficient to use a batch size above 4, such as 3090 or 4090 with 24G video memory, then there is no need to enable this parameter.
- log_with, wandb_api_key are used to select the logger type, you can choose tensorboard or wandb.
- Prior_loss_weight is generally useless, and the default selection of 1 is fine. Used to control the weight of the prior part in DB training, which can be used to control the regularization strength of the image.
- debug_dataset is used to check that our settings are correct.
- Using 2-4 for vae_batch_size can slightly speed up the process of cache latent. Since the parameters of the VAE encoder itself are relatively small, even on a Linux machine, a graphics card with 8GB of video memory can be set to 4. For the Windows system, since the system occupies a lot of video memory, if the video memory is less than 10GB, do not enable this parameter.