Vincent's prompt is no longer stinky and long! LLM enhanced diffusion model, simple sentences can generate high-quality images

picture

Xi Xiaoyao's technology sharing
source | Xinzhiyuan

The parameter-efficient fine-tuning method SUR-adapter can enhance the ability of the text-to-image diffusion model to understand keywords.

The diffusion model has become the mainstream text-to-image generation model, which can generate high-quality and content-rich images based on the guidance of text cues.

However, if the input prompts are too concise, the existing models have limitations in both semantic understanding and commonsense reasoning, resulting in a significant drop in the quality of the generated images.

In order to improve the model's ability to understand narrative cues, Lin Ling's team from the HCP Laboratory of Sun Yat-sen University proposed a simple and effective parameter-efficient fine-tuning method SUR-adapter, that is, semantic understanding and reasoning adapter, which can be applied to the pre-trained diffusion model .

picture

Paper address:
https://arxiv.org/abs/2305.05189

Open source address:
https://github.com/Qrange-group/SUR-adapter

    Large model research test portal

GPT-4 Portal (free of wall, can be tested directly, if you encounter browser warning point advanced/continue to visit):
Hello, GPT4!

To achieve this goal, the researchers first collected and labeled a dataset SURD, containing more than 57,000 semantically corrected multimodal samples, each containing a simple narrative prompt, a complex keyword-based prompt and a high-quality image.

Then, researchers align the semantic representation of narrative cues with complex cues, and transfer the knowledge of a large language model (LLM) to the SUR adapter through knowledge distillation, so that powerful semantic understanding and reasoning capabilities can be obtained to construct high-quality text semantics Representation for text-to-image generation.

picture

Experiments are conducted by integrating multiple LLMs and pre-trained diffusion models, and the results show that the method can effectively enable diffusion models to understand and reason about concise natural language descriptions without degrading image quality.

This method can make text-to-image diffusion models easier to use with better user experience, and can further advance the development of user-friendly text-to-image generative models, bridging the gap between simple narrative prompts and complex keyword-based prompts semantic gap.

background introduction

At present, the text-to-image pre-training diffusion model represented by Stable diffusion has become one of the most important basic models in the field of AIGC, and plays an important role in tasks including image editing, video generation, and 3D object generation. Huge effect.

However, the semantic capabilities of these current pre-trained diffusion models mainly depend on text encoders such as CLIP, and their semantic understanding capabilities are related to the generation effect of the diffusion models.

This article first constructs the corresponding text prompts of the commonly used question categories in the visual question answering task (VQA), such as "Counting (count)", "Color (color)" and "Action (action)" to manually count and test the graphic-text matching of Stable diffusion Accuracy.

The following table gives examples of the various prompts constructed.

picture

The results are shown in the table below. The article reveals that the current Wensheng graph pre-training diffusion model has serious semantic understanding problems. The matching accuracy of graphs and texts for a large number of problems is less than 50%, and even for some problems, the accuracy is only 0%.

picture

Therefore, it is necessary to find a way to enhance the semantic ability of our encoder in the pre-trained diffusion model to obtain images eligible for text generation.

Method overview

1. Data preparation

Firstly, a large number of image-text pairs were obtained from the commonly used diffusion model online websites lexica.art, civitai.com, and stablediffusionweb, and cleaned and screened to obtain more than 57,000 high-quality (complex prompt, simple prompt, image) triplet data, and constitute SURD data set.

picture

As shown in the figure, complex prompt refers to the text prompt conditions required by the diffusion model when generating an image. Generally, these text prompts have complex formats and descriptions. The simple prompt is a text description generated by BLIP for an image, and it is a language format that conforms to human description.

Generally speaking, the simple prompt that conforms to the description of normal human language is difficult to make the diffusion model generate enough semantic images, while the complex prompt (which users also jokingly call the "spell" of the diffusion model) can achieve satisfactory results.

2. Large language model semantic distillation

This paper introduces a transformer-structured Adapter to distill the semantic features of the large language model in a specific hidden layer, and linearly combines the information of the large language model guided by the Adapter and the semantic features output by the original text encoder to obtain the final semantic features.

Among them, the large language model uses LLaMA models of different sizes. The parameters of the UNet part of the diffusion model are frozen throughout the training process.

picture

3. Image Quality Restoration

Since the structure of this paper introduces a learnable module in the inference process of the pre-training large model, the quality of the original image generation of the pre-training model is destroyed to a certain extent. Therefore, the quality of image generation needs to be brought back to the level of the original pre-training model.

picture

This paper uses triples in the SURD dataset to introduce the corresponding quality loss function in training to restore the quality of image generation. Specifically, this paper hopes that the semantic features obtained by the simple prompt through the new module can be aligned as much as possible with the semantic features of the complex prompt. .

The figure below shows the fine-tuning framework of the SUR-adapter for the pre-trained diffusion model. The network structure of the Adapter is on the right.

picture

Experimental results

This paper examines the performance of SUR-adapter from two perspectives of semantic matching and image quality.

On the one hand, as shown in the table below, the SUR-adapter can effectively alleviate the common semantic mismatch problem in the Vinsen graph diffusion model under different experimental settings. Under different categories of semantic criteria, there is a certain improvement in accuracy.

On the other hand, this paper uses commonly used BRISQUE and other commonly used image quality evaluation indicators to statistically test the quality of the pictures generated by the original pretrain diffusion model and the diffusion model after using the SUR-adapter. We can find that there is no significant difference between the two. difference.

We also tested this with a human preference questionnaire.

The above analysis shows that the proposed method can alleviate the inherent image-text mismatch problem inherent in pre-trained text-to-image while maintaining the quality of image generation.

picture

picture

In addition, we can also qualitatively show an example of image generation as shown in the figure below. For more detailed analysis and details, please refer to this article and the open source repository.

picture

picture

Introduction to HCP Laboratory

Sun Yat-sen University Human-Computer-Physical Intelligence Fusion Laboratory (HCP Lab) was founded by Professor Lin Ji in 2010. In recent years, it has made rich academic achievements in multi-modal content understanding, causal and cognitive reasoning, embodied intelligence, etc., and has won several awards. Domestic and foreign science and technology awards and best paper awards, and is committed to creating product-level AI technology and platforms.

Guess you like

Origin blog.csdn.net/xixiaoyaoww/article/details/132570379