Teach you step by step how to make alchemy in the cloud environment: Stable Diffusion LoRA model nanny level refining tutorial

 Many students want their own exclusive AI models, but the training of large models is time-consuming and laborious, and is not suitable for ordinary users. AI developers are also aware of this problem, so they have developed fine-tuning models, and LoRA is one of them. In the field of AI painting, only a small number of pictures are needed to train a LoRA model with a unique style, such as someone's face, a certain posture, a certain style of painting, a certain object, etc.

Training models is often jokingly called "alchemy". This term not only brings us great expectations, but also reflects the difficulty of the model creation process. Just as alchemy requires careful care, the training of AI models also requires patience and meticulousness. However, even with all the hard work, the end result may not be as expected. This is what everyone needs to be mentally prepared for.

There are many introductions to the principles of LoRA online, so I won’t go into details. This article specifically introduces how to train the LoRA model in a cloud environment. The so-called cloud environment means renting a cloud server instead of on a local computer. This is especially suitable for students who want to show off their skills but do not have a good graphics card.

The cloud environment here chooses AutoDL, which I often use: https://www.autodl.com. Regarding the use of AutoDL, this article only gives some brief introductions around training the LoRA model. Students who want to know more, please read what I wrote. Another article: Teach you step by step how to deploy Stable Diffusion WebUI in a cloud environment.

This article will use the kohya_ss open source project to train the LoRA model, and it will officially start below.

cloud environment

You need to recharge AutoDL before you can rent a server. You can pay 2 yuan first to complete this training.

server

To put it simply, select "Pay-As-You-Go" for the billing method, select "Inner Mongolia Area A" for the region, select "RTX A5000" for the GPU model, select "1" for the number of GPUs, and then select a host with an idle GPU.

For the mirror, select "Community Mirror", enter "yinghuoai-kohya", and select the mirror I published in the pop-up menu.

Then click "Create Now" and that's it.

Wait for the server instance to start up in the AutoDL console. After the startup is successful, you can see some operation items in the shortcut tool, click "JupyterLab" among them.

Click the double arrow button above the notebook in JupyterLab. It will perform some initialization operations and start kohya_ss. When you see the "Running on local URL" prompt, it means the startup is successful.

Then return to the AutoDL console and click "Custom Service" in the shortcut tool to start the kohya_ss web interface.

training directory

In order to effectively manage model training, I created several directories in the image, which can be viewed through "JupyterLab". They are all under /root/autodl-tmp. autodl-tmp mounts the AutoDL data disk, which can save valuable data. of system disk space.

  • /root/autodl-tmp/models: SD large model directory. When training the Lora model, it needs to be based on a certain large model.
  • /root/autodl-tmp/train: Directory of training data, including input images, training parameters, output Lora model, etc. We will create different training project directories in this directory.

The actual effect is shown in the figure below:

Quick experience

I have built a training data and training configuration into the image, allowing everyone to quickly experience Lora alchemy.

After starting the page through AutoDL's custom service, click "LoRA" -> "Training".

  • In "Configuration file" enter the training configuration file address that I preset in advance: /root/autodl-tmp/train/dudu/config.json;
  • Then click "Load" to load the training parameters;
  • Finally click "Start training" to start training.

You need to check the progress of training in JupterLab. It takes about 8 minutes. When you see steps showing 100%, it means that the training is completed and the model has been saved to the directory: /root/autodl-tmp/train/dudu/model

During the training process, several sample pictures will be generated and saved in /root/autodl-tmp/train/dudu/model/sample. You can open them to see the training effect:

For actual experience, you need to download the model file to the local computer first, and then upload it to Stable Diffusion WebUI. Use the Vincent diagram to generate the image. Reference parameters:

  • Large model: realisticVisionV51_v51VAE, other real models can also be tried.
  • 提示词:masterpiece, best quality, 1dog, solo, sitting, looking at viewer, outdoor, the background is egyptian pyramids,tall pyramids <lora:dudu_log:0.8>
  • Reverse prompt words: low quality, worst quality, harness, tree, bad anatomy, bad composition, poor, low effort
  • Image size: 768*768

Note : If you want to retrain this project, you need to delete the contents of the model directory first, and then restart training.


A quick experience can only give you a brief appreciation of the charm of alchemy. However, to make a perfect elixir, you need to prepare enough picture materials, understand the tool parameter settings, and continuously test and optimize the model. Next, I will explain in detail how to carefully build your LoRA model step by step.

Prepare

The main thing is to prepare the pictures to be trained and generate prompt words for the pictures. Then they can be sent to the alchemy furnace for refining.

Pick a picture

How many pictures are needed to train the LoRA model? I didn't find any specific requirements, but I recommend at least 10 or more pictures, and the pictures are required to be clear and textured . If it is for a certain individual, the angle of the shooting target should be diverse , so that a better model can be trained.

If you don’t have suitable pictures on hand, you can take them yourself, or you can go to Baidu Pictures and other picture websites to find large high-definition pictures.

Most of the tutorials on the Internet are for training beautiful faces, and I guess everyone is tired of seeing them, so I chose a picture of a dog for training. Its name is Dudu, and it looks like the picture below. The number of pictures I prepared is 20, which have been built into the AutoDL image I released.

Crop picture

The images need to be processed into the same size. The size can be 512*512, 512*768, or 512*640, which are all multiples of 64. Those with low video memory can use 512, and those with high video memory can use 768. Here is a website for cropping pictures: BIRME - Bulk Image Resizing Made Easy 2.0 (Online & Free) . The operation method is as shown in the figure below:

  • Select the local picture to be cropped on the left.
  • On the right are the cropping settings, where you can set the cropping size, etc.

Note that RENAME can make the output image names more orderly, making it easier to use the training program. xxx represents three digits, and the starting number below represents which number to start sorting from.

Deployment picture

After the image is successfully processed, it needs to be placed in a specific directory.

We first create a project directory under /root/autodl-tmp/train. I will use the dog’s name here: dudu, and then create an img directory in this directory to place the processed pictures, but the pictures are still It cannot be placed directly under img. You also need to create a subdirectory: 100_dudu. The name of this directory is particular. The 100 in the front means that each picture needs to be studied 100 times. The dudu in the back is the theme name of the picture.

There is no fixed standard for the number of learning times here. It is recommended to use 50-100 for real pictures and 15-30 for two-dimensional pictures. If the finally trained model has over-fitting problems, for example, if the prompt word for generating pictures contains blue eyes but the generated models are all black eyes, you can try reducing the number of learning times.

Upload the cropped pictures to the directory /root/autodl-tmp/train/dudu/img/100_dudu, as shown below:

Image marking

The so-called marking is to write prompt words for pictures. Generally, the prompt word reverse tool is used to generate the prompt words, and then the generated prompt words are modified according to the actual situation.

After starting kohya_ss, enter the opened Web page and enter "Utilities" -> "Captioning" -> "BLIP Captioning".

Enter the image directory to be marked in "Image folder to caption". Here it is: /root/autodl-tmp/train/dudu/img/100_dudu.

"Caption file extension" is the suffix name of the generated prompt word file.

"Prefix to add to BLIP caption" is a fixed prefix added to the generated prompt words. If these prefixes are used during training, these prefixes can be used to trigger the Lora model more conveniently when generating images. However, based on experience, there is no guarantee that it will be triggered. . There is also a parameter "Postfix to add BLIP caption" after it, which is a fixed suffix. Prefixes have higher priority when processing images.

Just use the default settings for other parameters. Those who are interested can research it, but I won’t go into details here.

Finally click on “Caption images”.

Note that there is no progress tracking on the page of this tool. You need to go to the shell or console interface to view it. If you see a 100% progress bar and a "captioning done" prompt, it means that the marking is complete.

At this point, when we enter the image directory to be trained, we can see these marking files. As you can see, corresponding marking files are generated for each picture.

We can double-click to open these txt files and view their contents; if they feel they are not well written, we can modify them.

The generated prompt words generally have some problems.

  • For the painting-style Lora, all generated labels can be retained, but it will take a few more rounds during training, and the training time may be longer.
  • For Lora of a specific character, if we want to retain a certain feature as the character's own feature, delete the corresponding tag, such as long hair, so that long hair will be saved in Lora as a feature of the character. However, this also has shortcomings, which may lead to over-fitting when generating pictures: the prompt words do not take effect. For example, if short hair is entered, long hair will be generated. In severe cases, the picture may be broken and blurred.

Modifying the prompt words is a big project. In order to demonstrate as quickly as possible, we will not modify the prompt words here.

In addition, in this tool, we can see that in addition to BLIP, there are three methods for marking images, but I have not been successful in running them. If you are interested, you can study them together.

If you don’t use the marking tool that comes with kohya_ss, it’s completely fine. For example, in the “training” function of SD WebUI, you can also crop pictures and reverse the prompt words. You just need to manually upload the marking files to the training directory here. . Here is another prompt word editing tool: https://github.com/starik222/BooruDatasetTagManager . If you are interested, you can try it.

train

parameter settings

There are many parameters for training the model, and it also involves some concepts of deep learning. Students who have not been exposed to it before may feel a headache, but it doesn't matter. I will try to explain the main parameters clearly.

After kohya_ss starts, enter "LoRA" -> "Training" in sequence.

Source Model

Set the Stable Diffusion large model used for training. In "Model Quic Pick", you can select some SD basic large models. During training, you will first download it from HuggingFace. However, my actual test run is different, so I select custom here, and then upload a model myself. , because the training pictures are real-world dogs, so realisticVisionV51 is used here (students who use AutoDL mirroring do not need to upload it anymore, it is already built-in), which is a real visual model.

Folders

Set the input and output directories when training the model.

  • Image folder is the directory of the training data set, which is the directory of the original image. Note that it only reaches the img level, not the directory where the image is directly stored. The full path here is: /root/autodl-tmp/train/dudu/img.
  • The Output folder is the directory where the trained Lora model is saved. The sampled images during the training process are also saved in this directory. Just use the same upper-level directory as the Image folder. The full path here is: /root/autodl-tmp/train /dudu/model .
  • Logging folder, as the name suggests, is the training log directory. It only needs to use the same upper-level directory as the Image folder. The full path here is: /root/autodl-tmp/train/dudu/log.
  • Model output name is the file name prefix of the trained Lora model.

Parameters

It’s time to enter the real parameter setting process, the front is just some appetizers.

Let’s look at the basic parameters first:

  • Train batch size: The number of sample images to be trained at the same time. The default is 1. It can be set to 2-6 if the video memory is 12G or above. Please set it according to the actual video memory usage. The larger the value, the faster the training speed.
  • Epoch: The number of rounds of training. One round is to complete training on all sample images. Generally, multiple rounds of training are required, and then the appropriate round model is selected based on the actual image generation situation. The larger the number of rounds, the more time training takes.
  • Save every N epochs: Save the trained model every few rounds. We want to test the model output in each round, so fill in 1 here.
  • Caption Extension: The suffix name of the marking file corresponding to the sample image. Previously, .txt was used for marking, so just fill it in here.
  • Mixed precision and Save precision: For precision control of floating point numbers used in calculations, just select fp16, which can save memory usage. The accuracy of bf16 is slightly lower, but the range of represented integers is larger, and data type conversion is easier, but it depends on whether the graphics card can support it.
  • Number of CPU threads per core: The number of threads in a single CPU core, which can be understood as one CPU core can do two things at the same time. It is usually 2, and the server I rented is also 2. You can check it with commands such as lscpu. Undefined is set to 1.
  • Seed: Random number used for training, just fill in any number. If you need to improve a previously built model (fill in the previously built model in LoRA network weights), use the same random number.
  • Cache latents: Check it to make training faster.
  • Learning rate: Learning rate, which can be understood as the length of each learning process. The smaller the value, the slower the training. The larger the value, the larger the step, and it is difficult to find the pattern and it is difficult for the model to converge. The so-called convergence is the process of continuously optimizing the model through training. Difficulty in convergence means that the model cannot be optimized, and the characters and sample pictures generated by the model deviate too much.

This line is all about the parameter settings of the learning rate, that is, how to make the model converge quickly and well. They are all algorithms. Use my default here first. If it is not easy to use, change it.

    • LR Scheduler: Learning rate scheduler, which automatically adjusts the learning rate. I usually use constant.
    • LR warmup (% of steps): The number of warming steps, only set when "LR Scheduler" is "constant_with_warmup", used to control the number of steps for the model to gradually increase the learning rate before training.
    • Optimizer: Mainly used to update parameters such as weights and biases in the model to better fit the data. Some optimizers can also directly affect the learning rate. Try AdamW8bit first.
  • Max resolution: The maximum resolution for training, just set it to the resolution of the sample image.
  • Enable buckets: After startup, it supports sample images of multiple resolutions. The program will automatically crop them. Here I have cropped them all to 768. It doesn’t matter if you check it or not.
  • Text Encoder learning rate: The learning rate of the text encoder. It is recommended to start from 0.00005. The subsequent Unet learning rate is larger than the previous one. Set it to 0.0001. Setting this value will cause the above Learning rate to be ignored. Keep it consistent.
  • Network Rank: The neural network parameter dimension of the model, the default is 8, and 32, 64, and 128 are recommended. The larger the value, the more refined the model will be, and the larger the generated model file will be. Just keep Network Alpha at the same value.

In the advanced parameters (Advanced), let’s take a look at these:

  • Clip skip: Default is 1. When set to a number greater than 1, some processing can be skipped during training, thereby avoiding overfitting and enhancing the generalization ability of the model. However, it is not easy to be too large and features will be lost. It is best to set the same parameters as the corresponding large model during training.
  • Memory efficient attention: It can optimize the use of graphics card memory, but it will lead to slower learning speed.
  • CrossAttention: An acceleration algorithm used to associate images and prompt words. Generally, xformers is chosen to speed up image generation and reduce video memory usage. xformers is only applicable to N cards and may be applied to other graphics cards in the future.

Finally, there is a sampling parameter (Samples), which can be used to track the training effect:

Sample every n steps: Generate a picture every N steps of learning.

Sample every n epochs: Generate a picture for every N rounds of learning. When turned on for actual testing, Sample every n steps will be covered.

Sample sampler: sampler, the same as the default sampler in SD WebUI.

Sample prompts: sampling prompt words, including prompt words, reverse prompt words, picture size, sampling steps, etc.

Complete training

After clicking "Start Training", still go to the console to check the processing progress.

Because each picture is learned for 100 steps, and the number of simultaneous trainings is 1, 2500 steps are required for training 25 pictures at a time, and the number of rounds specified is 3, so the total number is 7500 steps.

After the training is completed, you can see a 100% prompt and the model has been saved to the corresponding directory.

test

Once the model is trained, how do you know if it works well? Of course, when testing, actually draw cards.

The simpler way is to test one by one and test the performance under different weights, prompt words, large models, etc.

Here is a quick comparison test method using X/Y/Z charts.

Add variables to the prompt word and reference the Lora model as shown below:

Note here: <lora:dudu_log-NUM:WEIGHT>, NUM and WEIGHT are two variables.

  • NUM: Because I used multiple rounds of training and obtained multiple Lora models, I need to test the model performance of different training rounds. The names of these models are regular: dudu_log-000001, dudu_log-000002,... , with each additional round of training, the generated model name serial number will be increased by 1. NUM is the variable representing 000001 and 000002.
  • WEIGHT is the weight variable using the current Lora model. Here we want to test the performance of the model under different weights.

The X/Y/Z chart is at the bottom of the Vincent chart and Tusheng chart pages:

Script type selection: X/Y/Z chart :

Select X-axis type: Prompt S/R, fill in X-axis value: NUM,000001,000002

Y-axis type selection: Prompt S/R, Y-axis value filling: WEIGHT,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1

Fill in the blank margin: 2, and divide the generated pictures.

Then we generate pictures and generate charts as follows:

Then we can compare the rendering effects of different models and weights. The model 000002 here is closer to the real picture. The effect of weight 0.5-0.9 will be better, and 1.0 will be a bit overfitting.

There are many dimensions in the X/Y/Z chart that can be tested, such as large models, iteration steps, samplers, prompt word guidance systems, etc. Those who are interested can try more.

optimization

Just sharing some of my experience and understanding. Some are mentioned above, here is a summary.

  • The pictures used for training must be high-definition, not blurry, and have as many angles as possible for the characters. The original pictures have a huge difference in the training results. If you don't have a high-definition picture, you can redraw the blurred picture in the drawing of SD WebUI, or use other software to make the picture high-definition.
  • Prompt words for training: Remove the character features you want to retain when generating images from the prompt words, and add the character features you allow to be replaced when generating images. For example, if you want to keep the character's long hair, then remove the long hair in the prompt word, so that when the image is generated, there is a high probability that it will all have long hair; if you want the character's eye color to be replaceable, then train the prompt word Write black eyes in so that when generating the image, you can use the blue eye prompt to change the character's eye color.
  • Multiple rounds of training: The effect of one round of training may not be good. If the cost allows, it is recommended to train for a few more rounds, then compare the model rendering effects in different rounds and choose the most suitable one.
  • Number of training steps: How many times is appropriate to train each picture? If you train too little, the extracted features will not be enough, and if you train too much, it will be easy to overfit. It is recommended to use 15-30 times for two-dimensional images and 50-100 times for other images. If there are fewer training pictures, each picture can be trained several times. If there are more training pictures, each picture can be trained several times less.

Download

The models, plug-ins, and generated pictures used in this article have all been uploaded to the SD painting resources I have compiled, and will continue to be updated in the future. If necessary, please follow the public account: Yinghuo Walk AI (yinghuo6ai), and send a message: SD , you can get the download address.


The above is the main content of this article. If you have any questions, please leave a message.

Guess you like

Origin blog.csdn.net/bossma/article/details/132425811