Further observations on effective parameter tuning in diffusion models

Summary:

Large-scale diffusion models like Stable diffusion [31] are very powerful and can find a variety of real-world applications, while customizing such models through fine-tuning can reduce memory and time efficiency. Motivated by recent advances in natural language processing, we study efficient parameter tuning in large diffusion models by inserting small learnable module adapters (called adapters). Specifically, we decompose the adapter's design space into orthogonal factors—input location, output location, and functional form—and perform an analysis of variance (ANOVA), which is an analysis of discrete variables (design options) and continuous variables (evaluation Classic statistical method of correlation between indicators). Our analysis shows that the input position of the adapter is a critical factor affecting the performance of downstream tasks. We then carefully studied the selection of input locations and found that placing the input locations after the cross-attention block resulted in the best performance, which was verified by additional visual analysis. Finally, we provide a method for efficient parameter tuning in diffusion models that outperforms, if not outperforms, fully fine-tuned baselines (such as DreamBooth) on various custom tasks with only 0.75% additional parameters. Our code can  be found at https://github.com/Xiang-cd/unet-finetune

introduce:

Diffusion models have recently become popular due to their ability to generate high-quality and diverse images. The diffusion model has excellent performance in conditional generation tasks by interacting with conditional information during the iterative generation process, which inspires its application in downstream tasks such as text-to-image generation, image-to-image translation, and image restoration. .

Armed with knowledge gained from massive data, large-scale diffusion models demonstrate strong prior capabilities in downstream tasks. Among them, DreamBooth adjusts all parameters in a large-scale diffusion model to generate the specific objects the user wants. However, fine-tuning the entire model is inefficient in terms of computational, memory, and storage costs. Another method is the parameter efficient transfer learning method (this article), which originated in the field of natural language processing (NLP). These methods insert small trainable modules (called adapters) into the model and freeze the original model. However, parameter-efficient transfer learning has not been deeply studied in the field of diffusion models. Compared with transformer-based language models in NLP, the widely used U-Net architecture in diffusion models contains more components, such as residual blocks with down/up sampling operators, self-attention, and cross- attention . This enables a larger design space for parameter-efficient transfer learning than transformer-based language models.

This paper provides the first systematic study of the design space for efficient parameter tuning in large-scale diffusion models. We use Stable Diffusion as a specific case because it is currently the only open source large-scale diffusion model. In particular, we decompose the adapter's design space into orthogonal factors—input positions, output positions, and functional forms. By analyzing between-group differences in these factors using ANOVA in an experimental study, we found that input location is a key factor affecting downstream task performance . We then carefully studied the selection of input locations and found that placing the input locations after the cross-attention block can maximize the encouragement of the network to perceive changes in input cues (see Figure 11), resulting in the best performance.

Based on our research, our optimal settings can achieve comparable results to the fully fine-tuned approach (which is about as good as Dreambooth anyway)

2. Related introduction:

2.1 Diffusion model:

2.2 Stable structures in diffusion:

image 3. background. The figure in the upper left corner shows the overall architecture of the unet-based diffusion model. The upper right shows how the diffusion model removes noise from noisy data via a T - 1 step. The lower part of the figure shows the structure of the residual block and transformer block. Adapters (red blocks in the figure) are modules with fewer parameters inserted into the model for efficient transfer learning of parameters. 

Currently, the most popular diffusion model architecture is the U-Net based architecture. Specifically, the u-net-based architecture in Stable Diffusion is shown in Figure 3 . U-Net consists of stacked basic blocks, each containing a transformer block and a residual block. In the transformer block, there are three sub-layers: self-attention layer, cross-attention layer and fully connected feed-forward network. The attention layer operates on the query Q∈Rn×dk, the key-value pair K∈Rm×dk, V∈Rm×dv

 Where n is the number of queries, m is the number of key-value pairs, dk is the dimension of the key, and dv is the dimension of the value. In the self-attention layer, x∈Rn×dx is the only input. In the cross-attention layer of the conditional diffusion model, there are two inputs x∈Rn×dx, c∈Rm×dc, where x is the output of the prior block and c represents the condition information. A fully connected feedforward network consists of two linear transformations with ReLU activation functions:

 Among them, W1∈Rd×dm, W2∈Rdm×d are learnable weights, b1∈Rdm, b2∈Rd are learnable deviations. The residual block consists of a series of convolutional layers and activations, where the temporal embedding is injected into the residual block through an additive operation

2.3 Parameter efficient transfer learning:

Transfer learning is a technique that uses knowledge learned from one task to improve performance on related tasks. The method of pre-training downstream tasks and then performing transfer learning is widely used. However, traditional transfer learning methods require a large number of parameters, which are computationally expensive and memory intensive.

Parameter-efficient transfer learning was first proposed in the field of natural language processing. The key idea of ​​parameter-efficient transfer learning is to reduce the number of updated parameters. This can be achieved by updating part of the model or adding additional small modules. Some parametrically efficient transfer learning methods (e.g. Adapter [16], LoRA [17]) choose to add additional small modules called adapters to the model. In contrast, other methods (prefix tuning [22], cue tuning [21]) place some learnable vector before activation or input. A large number of studies have proven that in the field of natural language processing, this efficient parameter fine-tuning method can achieve considerable results with fewer parameters.

3. Design space for efficient parameter learning in diffusion models

Despite the success of parameter-efficient transfer learning in natural language processing, this technique is not fully understood in diffusion models due to the presence of components such as residual blocks and cross-attention. Before analyzing efficient tuning of parameters in diffusion models, we decompose the adapter's design space into three orthogonal factors—input position, output position, and functional form. Stable Diffusion [31] was considered in this work because it is currently the only open source large-scale diffusion model (its U-Net based architecture is shown in Figure 3).

Below we detail the input locations, output locations, and functional forms based on the stable diffusion architecture.

3.1 Input position and output position:

The input location is the source of the adapter's input, and the output location is the location of the adapter's output. For ease of understanding, as shown in Figure 4, locations are named according to adjacent layers. For example, SAin represents the position corresponding to the input of the self-attention layer, Transout represents the output of the transformer block, and CAc represents the conditional input of the cross-attention layer.

 Figure 4. Activate location description. Typically, the primary name of the activation location is an alias for a specific block in the model, and the subscript of the activation location explains the relationship between the activation and the block.

In our framework, the input location can be any of the activation locations described in Figure 4. So, there are 10 different input position options in total. As for the output, since addition is commutative, some positions are equivalent. For example, putting the output into SAout is equivalent to putting the output into CAin. Therefore, the options for output location are reduced to 7 in total. Another constraint is that the output location must be placed after the input location.

3.2Adapter model architecture:

The functional form describes how the adapter transforms input into output. We give the functional forms of the adapter in the transformer block and the residual block respectively (see Figure 5), where both contain a downsampling operator, an activation function, an upsampling operator and a scaling factor. The downsampling operator reduces the dimensionality of the input, and the upsampling operator increases the dimensionality of the input to ensure that the output has the same dimensionality as the input. The output is further multiplied by a scaling factor s to control the strength of its influence on the original network.

Among them, the transformer block adapter uses low-rank matrices Wdown and Wup as downsampling and upsampling operators respectively, and the residual block adapter uses 3×3 convolution layers Convdown and Convup as downsampling and upsampling operators respectively. Note that these convolutional layers only change the number of channels, not the spatial size. In addition, the residual block adapter also uses the group normalization [38] operator to process its input.

Among our design choices, we included different activation functions and scaling factors. Activation functions include ReLU, Sigmoid, SiLU, and Identifier as our design choices, and scale factors include 0.5, 1.0, 2.0, 4.0.

 Figure 5. Model architecture of adapters in transformer blocks and remaining blocks.

4. Use variance analysis to discover key factors

As mentioned before, finding the optimal solution in such a large discrete search space is a challenge. To discover which factor in the design space has the greatest impact on performance, we quantify the correlation between model performance and factors using the one-way analysis of variance (ANOVA) method, which is widely used in many fields, including psychology, education, biology science and economics.

The main idea behind ANOVA is to split the total variation in the data into two parts: within-group variation (MSE) and between-group variation (MSB). MSB measures the difference in means between groups, while within-group variation measures the difference between individual observations and the mean between their respective groups. The statistical test used in ANOVA is based on the f -distribution, which compares the ratio of variation between groups to variation within groups (f -statistic). If the f-statistic is large enough, it indicates that the means between the groups are significantly different, indicating a strong correlation.


Figure 6. The relationship between the performance of the adapter (i.e., CLIP similarity↑) and the input and output positions in the DreamBooth task .

 Figure 7. The relationship between performance (i.e. FID↓) and the input and output positions of the adapter in the fine-tuning task (my focus).

5. Experiment

 We first introduce our experimental setup in Section 5.1. We then analyze which factor in the design space is the most critical in Section 5.2. After discovering the importance of input location, we conduct a detailed ablation study on it in Section 5.3. Finally, we present a comprehensive comparison between our optimal setup and DreamBooth (i.e., fine-tuning all parameters) in Section 5.4.

5.1 Settings

tasks and datasets. In the diffusion model, we consider two transfer learning tasks.

DreamBooth mission. The first task is to personalize a diffusion model for less than 10 input images, as proposed in DreamBooth [32]. For simplicity, let's call it the DreamBooth task. DreamBooth's training data set consists of two sets of data: personalized data and regularized data. (i.e. using the DreamBooth method for fine-tuning?) The personalization data is an image of a specific object (for example, a white dog) provided by the user. Regularized data are images of general objects similar to personalized data (e.g., dogs of different colors). Personalized data size is less than 10 and the model can collect or generate regularized data. DreamBooth uses rare tags [V] and class words Cclass to distinguish between regularized data and personalized data. In particular, with regularized data, the prompt will be "Photo of class C"; with personalized data, the prompt will be "Photo of class [V]". Among them, Cclass is a word that describes the general category of data (such as dog). We collect personalized data from the Internet and live photography, but also from DreamBooth (33 in total). We use stable diffusion itself to generate corresponding regularized data, conditional on the prompt "a photo of class".

Fine-tune tasks. Another task is to fine-tune a small set of text-image pairs. For simplicity, we call it the fine-tuning task. Following [39], we consider fine-tuning on the flower dataset [27] with 8189 images and use the same settings. We use the prompt "a photo of Fname" to add a title to each picture, where Fname is the flower name of the picture class.

We use the AdamW [23] optimizer. For the DreamBooth task, we set the learning rate to 1e-4, which allows DreamBooth and our method to converge in around 1k steps. The adapter size is fixed to 1.5M (0.17% of the UNet model) and trained using 2.5k steps. For the task of fine-tuning on a small set of text-image pairs, we set the learning rate to 1e-5, fixed the adapter size to 6.4M (0.72% of the UNet model), and trained for 60k steps.

In order to improve the sampling efficiency, we chose DPM-Solver [24] as the sampling algorithm, the sampling step size is 25 steps, and the classifier free guide (cfg) [15] scale is 7.0. In some cases we use a cfg scale of 5.0 for better image quality. 

For the DreamBooth task, we use image distances in the CLIP space proposed in [10] to evaluate authenticity. Specifically, for each personalization target, we generate 32 images using the prompt “A photo of [V] Cclass”. The metric is the average pairwise CLIP-space cosine similarity (CLIP similarity) between the generated images and the images of the personalized training set 

For the task of fine-tuning on a small set of text-image pairs, we use FID scoring [13] to evaluate the similarity between training and generated images. We randomly sample 5k tips from the training set, use these tips to generate images, and then compare the generated images with the training images to calculate the FID.

5.2 Analysis of variance (ANOVA) in the design space

Recall that we decomposed the design space into factors such as input location, output location, and functional form. We perform an ANOVA approach on these design dimensions (see Section 4 for details). We consider the efficiency of the DreamBooth task as it requires fewer training steps. As shown in Figure 8, when grouped by input position, the f statistic is larger, indicating that the input position is a key factor affecting model performance. When grouped by output location, the correlation is weaker. When grouped by functional form (two activation functions and a scaling factor), its f-statistic is around 1, indicating that the variability between groups is similar to the variability within groups, indicating that there is no significant difference between group means. We further visualize the performance under different input positions and output positions. Figure 6 is the result of the DreamBooth task. Figure 7 shows the FID results of the fine-tuning task.

Figure 8. Perform f-statistics on ANOVA by grouping input locations, output locations, activation functions, and scaling factors. The f-statistic is larger when grouped by input position, indicating a significant relationship with the input position.

As mentioned above, we conclude that the input position of the adapter is a key factor affecting the performance of efficient transfer learning of parameters.

5.3 Ablation experiment of input position

As shown in Figure 6 and Figure 7, we found that adapters with input positions CAc or CAout have good performance on both tasks. In Figure 9, we present the samples generated in the personalized diffusion model for different adapter input positions. Adapters with inputs at CAc or CAout are able to generate personalized images comparable to fine-tuning all parameters, while adapters with inputs elsewhere are not.

Figure 9. Samples of personalized diffusion models for different adapter input positions were generated. All samples are conditioned on "a photo of [V] class". It is worth noting that successful methods will generate correct images, while failed methods are likely to generate images similar to the regularized data. 

We further calculate the difference between the noise predictions given the hints “a photo of [V] Cclass” and “a photo of Cclass”. The pipeline is shown in Figure 10. In this pipeline, we first add noise to an image from the regularized data, use U-Net to predict the noise when given these two cues, and divide the difference between the two predicted noises. Value visualization. As shown in Figure 11, there is a significant difference between adapters with input positions CAc or CAout and noise predictions.

Figure 10. Visualization of noise prediction differences for experimental pipelines. 

Figure 11. Differences in noise prediction for different settings. The "atonal" method uses the original stable diffusion model without any fine-tuning. All adapter methods are documented in input-output format. We found that adapters with input positions CAout and CAc responded better to prompt changes. 

5.4 compared to DreamBooth

We show the results for each case in the DreamBooth task in Figure 12, which shows that our method is better in most cases.

We also compare our best settings with fully fine-tuned methods on the fine-tuning task on the flower dataset. The FID of this method is 24.49, which is better than the 28.15 of the fully tuned method.

 6.Related work

personalise. Large-scale text-image diffusion models trained on web data can generate high-resolution and diverse images with image content controlled by input text, but often lack the ability to personalize the generation for specific objects desired by the user.

 Recent work, such as Text Inversion [10] and DreamBooth [32], aim to address this problem by fine-tuning diffusion models on a small set of images of objects. Text inversion only adjusts the embedding of one word. To obtain stronger performance, DreamBooth tunes all parameters with regularization loss to prevent overfitting.

Parameter-efficient transfer learning. Parameter-efficient transfer learning originated from the field of nlp, such as adapters [16], prefix tuning [22], hint tuning [21] and LoRA [17]. Specifically, the adapter [16] inserts a small low-rank multilayer perceptron (MLP) with a nonlinear activation function f(·) between transformer blocks; the prefix tuning [22] adds an adjustable prefix vector to each Attention layer keys and values; Hint Tuning [21] simplifies prefix tuning by adding tunable input word embeddings; LoRA [17] injects tunable low-rank matrices into the query and value projection matrices of the transformer block.

While these parameter-efficient transfer learning methods vary in form and motivation, recent work [12] proposes a unified approach to these methods by specifying a set of factors to describe the design space of parameter-efficient transfer learning in pure transformers [37] View. These factors include modified representation, insertion form, function form, and compound functions. In contrast, our approach focuses on U-Net, which has more components than a pure transformer and therefore a larger design space. Furthermore, we use a simpler approach to decompose the design space into orthogonal factors, namely input locations, output locations, and functional forms.

Figure 12. Performance compared to DreamBooth. Our method performs better in most cases. 

Transfer learning for diffusion models. There are methods to identify specific objects by passing the diffusion model, or by adjusting the entire model for semantic editing [19, 32]. Previous work [39] attempted to convert a large diffusion model into an image-to-image model on a small dataset, but the total number of parameters tuned was almost half of the original model. Subject the diffusion model to new conditions and introduce more parameters than our model. Parallel work [1] also performs parameter-efficient transfer learning on Stable Diffusion. Their method can achieve comparable results to the fully fine-tuned method on the DreamBooth [32] task, and their method is based on adding adapters at multiple locations simultaneously, This results in a more complex design space.

7. Conclusion 

This paper conducts a systematic study on the design space of parameter-efficient transfer learning by inserting adapters into the diffusion model. Decompose the design space of the adapter into three orthogonal factors: input position, output position and functional form. . Through analysis of variance (ANOVA), it was found that the input position of the adapter is a key factor affecting the performance of downstream tasks. We then carefully studied the selection of input locations and found that placing the input locations after the cross-attention block resulted in the best performance, which was verified by additional visual analysis. Finally, we provide a method for efficient parameter tuning in diffusion models that outperforms, if not outperforms, fully fine-tuned baselines (such as DreamBooth) on various custom tasks with only 0.75% additional parameters. 

Guess you like

Origin blog.csdn.net/zcyzcyjava/article/details/133099844