Training Stable Diffusion Lora using Kohya_ss

Stable Diffusion model fine-tuning method

There are four main methods of Stable Diffusion: Dreambooth, LoRA, Textual Inversion, and Hypernetworks.

Textual Inversion (also called Embedding), it does not actually modify the original Diffusion model, but uses deep learning to find a character that is consistent with the image you want. The image feature parameters are saved through this small model. This means that if the training in this aspect is missing in the original model, it is actually difficult for you to "learn" it through embedding. It cannot teach the Diffusion model to render image content that it has not seen.

Dreambooth adjusts the weights of all layers of the entire neural network, and trains the input images into the Stable Diffusion model. Its essence is to copy the source model first, and then On the basis of fine tuning, a new model was independently formed, which can basically do anything. The disadvantage is that training it requires a lot of VRAM, and it is currently optimized to complete training with 16GB of video memory.

LoRA (Low-Rank Adaptation of Large Language Models) also uses a small number of images, but it trains the weights of a separate specific network layer and inserts new ones into the original model. network layer, thus avoiding the need to modify the original model parameters and copying the entire model. At the same time, it also optimizes the amount of parameters in the insertion layer, ultimately achieving a very lightweight model tuning method. , the model generated by LoRA is smaller and the training speed is fast. The LoRA model + the basic model is required for inference. The LoRA model will replace the specific network layer of the basic model, so its effect will depend on the basic model.

The training principle of Hypernetworks is similar to LoRA. Different from LoRA, Hypernetwork is a separate neural network model. The output of this model can be inserted into the middle of the original Diffusion model. layer. Therefore, through training, we will get a new neural network model that can insert appropriate intermediate layers and corresponding parameters into the original Diffusion model, thereby creating a correlation between the output image and the input instructions.

Hardware Configuration

Recommendations for selecting a graphics card: a GPU with a graphics memory of 10GB or above, which is RTX3060 or above.

Prepare training data

Picture collection

  • At least 10 pictures should be prepared for training.
  • The resolution is moderate and do not collect extremely small images.
  • The data set needs to have a unified theme and style of content, and the pictures should not have complex backgrounds or other irrelevant characters.
  • The characters in the image should have as many angles, expressions, and postures as possible.
  • The proportion of the number of images that highlight the face is slightly larger, and the proportion of the number of images that highlight the face is slightly smaller.

Image preprocessing

(1) Crop pictures

After downloading the image, crop the training image into 512x512 pixels. You can choose to automatically crop using SD WebUI or manually crop.

  1. Place the images to be cropped in the same directory.
  2. Open SD WebUI and go to the Train → Preprocess images page.
  3. The first field, Source directory, fills in the path of the original image.
  4. The second field Destination directory fills in the output path.
  5. Width and Height are set to 512x512.
  6. Click Preprocess and the image will be automatically cropped. After that the original image can be deleted, leaving only the cropped image.
(2) Give the picture prompt words in advance (picture marking)

Then the picture must be pre-labeled with prompt words, so that the AI ​​knows which prompt words to learn.

  1. Start SD WebUI and enter the Train page.
  2. Enter the Preprocess page, enter the path of the cropped image in Source, and fill in the output path of the processed image in Destination.
  3. Then check Create Flipped Copies to create flipped images to increase the number of training.
  4. Then use Stable Diffusion to train real pictures and check Use BLIP for caption; when training anime characters, check Use DeepBooru for caption.
  5. Click Preprocess and the process will be completed in a few minutes. The output directory will contain the prompt word txt file corresponding to each picture.
  6. After the image annotation is completed, a txt file with the same name as the image will be generated in the image folder. Click to open the txt file and delete all the irrelevant and redundant features.
(3) Optimization of prompt word tags
  • Method 1: Keep all tags

No label deletion processing is performed, and it is used directly for training. Use it when training your painting style, or when you want to save trouble and quickly train character models. You need to enter more tags to call, and the epoch training rounds need to be increased during training, resulting in longer training time.

  • Method 2: Delete some tags

For example, when training a specific character, if you want to retain a certain feature as its own feature, then its binding label must be deleted.

Tags that need to be deleted:

Such as character characteristics such as long hair and blue eyes.

Tags that don’t need to be deleted:

For example, character actions such as stand and run, character expressions such as smile and open mouth, backgrounds such as simple background and white background, and frame positions such as full body, upper body, and close up.

Add trigger words (optional):

Organize the tags of each image. Add the trigger word you want to train to the first sentence of the tag corresponding to each image. For example, if I want to call it qibaishi, open each tag file (txt) file and add it at the front. The keyword qibaishi.

Install kohya_ss

Environmental preparation

Install Python 3.10, git

Pull code

git clone https://github.com/bmaltais/kohya_ss

Enter the kohya_ss directory

cd kohya_ss

Run the setup script

.\setup.bat

Start GUI

gui.bat

Allow remote access

gui.bat --listen 0.0.0.0 --server_port 7860 —headless

Configuration path

The following three directories need to be configured:

  • image folder: stores the training set
  • logging folder: store log files
  • output folder: stores the trained model

First, create a new folder named 100_xxxx in the image folder. 100 is used to represent a single image trained 100 times. Then put all the previously labeled training data into the folder.

The detailed configuration is as follows:


​​​​​​​

Configure training parameters:

kohya_ss provides many adjustable parameters, such as batchsize, learning rate, optimizer, etc. You can configure it according to your actual situation.​ 

Parameter Description:

  • train_batch_size: Training batch size, specifies the number of simultaneous training images, the default value is 1, the larger the value, the shorter the training time, and the more memory is consumed.
  • Number of CPU threads per core: The number of threads per CPU core during training. Basically, the higher the number, the more efficient it is, but it will be necessary to adjust the settings according to the specifications.
  • epoch: Training period. Suppose you want to learn by reading 50 pictures 10 times. In this case, 1 epoch is 50x10=500 training sessions. If it is 2 epochs, this will be repeated twice, so it will be 500x2=1000 learning times. For LoRA, 2-3 epochs of learning are enough
  • Save every N epochs: Save every N epochs. If there is no need to create an intermediate LoRA, set the value to the same as "Epoch".
  • Mixed precision:Specifies the mixed precision type of weight data during training. The weight data is initially in 32-bit units, but learning by mixing 16-bit unit data will save a lot of memory and speed up if necessary. fp16 is a data format with half the precision, bf16 is a data format designed to handle the same number width as 32-bit data. High enough accuracy can be obtained on fp16.
  • Save precision:Specifies the type of weight data to be saved in the LoRA file. float is 32-bit, fp16 and bf16 are 16-bit units. The default value is fp16.
  • Learning rate: Learning rate, slightly changing the weights to incorporate more of a given image. The default value is 0.0001.
  • LR Scheduler: The scheduler is a setting on how to change the learning rate. The default value is cosine.

LR Scheduler value description:

  • adafactor: While learning, automatically adjust the learning rate according to the situation to save VRAM
  • constant: the learning rate does not change from beginning to end
  • constant_with_warmup: Starting from a learning rate of 0, gradually increasing to the set value of the learning rate during warm-up, and using the set value of the learning rate during main learning.
  • cosine: Gradually reduce the learning rate to 0 when drawing a cosine curve
  • cosine _with_restarts: Repeat cosine multiple times
  • linear: linear, starting from the learning rate setting and decreasing linearly towards 0
  • polynomial: polynomial, same behavior as linear, but a bit more complicated to reduce

Complete training parameter reference:

LoRA training parameters · bmaltais/kohya_ss Wiki · GitHubContribute to bmaltais/kohya_ss development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters

Compatible with Intel graphics card

The latest version of kohya_ss adds Intel ARC GPU support and IPEX support on Linux/WSL

  • Mixed precision select BF16
  • Optimizer selects AdamW (or any other non-8-bit)
  • CrossAttention choose SDPA

Run setup.sh:

./setup.sh --use-ipex

Run gui.sh:

./gui.sh --use-ipex

Guess you like

Origin blog.csdn.net/watson2017/article/details/134358717