Fine-tuning scheme for Stable Diffusion:

Fine-tuning scheme for Stable Diffusion:

There are many fine-tuning solutions for sd: Dreambooth, textual inversion, lora, hyoernetworks, aesthetic embedding

This article mainly explains DreamBooth , understanding it helps to understand the various solutions behind.

Figure 1: Fine-tuning process for DreamBooth (Nataniel Ruiz 2023)

In the upper part of Figure 1, the following things are done:

1. Prepare a description sentence ("a chow chow dog") and 3-5 pictures corresponding to the sentence (3-5 chow chow dogs.jpg)

2. Sentences (Text) and pictures (chow chow.jpg after adding noise) are input to the encoder in the SD model for encoding, and finally decoded into pictures by the decoder.

3. Compare the generated image with the original image, reward and punish the generator through the loss function, and update the weight.

The lower part of Figure 1 is almost the same as the upper part, except that the target feature (chow chow) in the sentence is removed to become (a dog), and the corresponding picture (dog.jpg) is provided for training. This part of the training is expressed in the paper as the retention of the prior knowledge of the diffusion model. Assuming we only provide 3-5 chow chow pictures for fine-tuning, it is easy to cause over-fitting of the generated results, resulting in the loss of the diversity of the original base model. . Simply put, after fine-tuning the Chow Chow model, the ability of the model to generate other dogs is weakened (of course this may be what we need).

The following is the overall loss function of DreamBooth, you can ignore it if you don’t understand it.

DreamBooth overall loss function

The first half is the loss function of the chow chow image: is the original image, (����+����) is the image after adding noise, ��^ is the denoising method of the diffusion model, which receives the noise image and text, Generate a denoised image. Compare the difference between x and (��+��) through L2 pixel loss .

The second half is the loss function of the dog image: the structure is the same as the former, and the influence weight is controlled by adding ��. The paper states that when ��=1, 3-5 pictures still get good results after 1000 trainings. Generalization results.

The vase is integrated into different scenes and is not limited to the background of the input image


The process of input (Text+Image) and output (Image) directly uses the ability of the diffusion model to achieve. Let's talk about what happened here in detail? You can also learn more about the diffusion model here.

Schematic diagram of the fine-tuning process of DreamBooth https://www.youtube.com/watch?v=dVjMiJsuR5o

1. For sentences, each word will be split and flattened into an array.

2. For the picture, it is noised for n steps and n-1 steps respectively to generate two noise maps �� and ��−1, �� is input into the diffusion model, and the text is used as a conditional guide, Generate a picture ���� through the decoder (VAE). ���� will be compared with ��−1 through the L2 loss function, and the next generation of the two results will be more similar as the target to adjust the weight.

3. This process is repeated continuously, the value of the loss function will continue to fluctuate and become smaller, and the final generated picture is very similar to the original picture. At the same time, the sentences used for conditional guidance are also learned by the model (this method is very similar to Condition GAN, Text is Condition). 'Denoising' has become the image referred to behind the keyword.


Finally, let's take a look at the training settings corresponding to dreambooth in sd-webui. Now you can understand what the data referred to by dataset and classificaiton are. But in practice, we still need to experiment by ourselves, whether to add classified data or not. Combining principles and ablation experiments can better understand the cause and effect of input and output. After all, machine learning is metaphysical. (In this way, my article is meaningless, I have to try it myself)

Concepts setting in web-ui

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132518059