ImageReward: Human preference learning in text-to-image generation

 Overview of ImageReward and ReFL. (Part 1) ImageReward annotation and training, including data collection, annotation and preference learning. (Bottom) ReFL uses the feedback from ImageReward to directly optimize the diffusion model of the random denoising step.

The ImageReward solution consists of the following steps:

  • Professional large-scale data set ImageRewardDB: about 137,000 comparison pairs, completely open source.

  • ImageReward, a universal model that reflects human preference from text to image: the pioneer of Vincentian graph reward model, superior to existing text-image scoring methods, such as CLIP, Aesthetic and BLIP; it is also a new automatic evaluation of Vincentian graphs index.

  • ReFL: Improving diffusion generation models with human preferences using ImageReward’s direct optimization method.

The specific method is introduced below.

1. Why do Vincentian diagrams also need RLHF?

What are the pain points of Vincent Diagram?

Text-to-image generation models, including autoregressive (mainly a series of models based on Transformer architecture such as DALLE, CogView, Parti, etc.) and diffusion-based (models based on Diffusion architecture such as DALLE-2, Stable Diffusion, etc.) methods , has achieved rapid development in recent years. Given an appropriate text description (i.e., prompt), these models can generate corresponding high-quality images on a wide range of topics, attracting widespread public attention for their potential applications and impacts.

Despite some progress, existing self-supervised pre-training image generation models are far from perfect, and there are a series of widely reflected problems in the generated images, including but not limited to:

  • Image-text consistency: The generated image does not accurately describe all numbers, attributes, and object relationships described in the text prompts. (For example, a, b in the picture above, "the little girl was blocked by the sunflowers on the ground" and "the sun god was crowned")

  • Limb problem: The generated images present distorted, incomplete, repeated or abnormal body parts (such as limbs, etc.). This problem may occur in humans or animals. (For example, e and f in the picture above)

  • Aesthetic issues: The generated images deviate from the average or mainstream human preference for aesthetic style. (For example, c and d in the picture above)

  • Harmful and biased content: The generated images contain content that is harmful, violent, sexually related, discriminatory, illegal, or causing psychological distress. (For example, e and f in the picture above)

However, these common challenges are difficult to solve by improving model architecture and pre-training data alone.

2. Review of RLHF in LLM

In natural language processing, researchers use reinforcement learning based on human feedback (RLHF: Reinforcement Learning from Human Feedback) to guide large language models to approximate human preferences and values. This approach relies on training a reward model (RM) that captures human preferences through a large number of professional annotations of different model outputs. Although this method is very effective, the annotation process can be very costly and challenging because it requires several months of efforts to establish annotation standards, recruit and train annotation experts, and verify annotations. The results and final training generate RM.

2.1 Why does RLHF hope to solve the current problems of Vincentian graphs?

The existing problems of Vincentian graphs are largely limited by the quality of pre-training data: the distribution of pre-training data is noisy and different from the distribution of actual user prompt data.

RM (reward model): Modeling human preferences.

RM as metric: Currently, Vincentian graphs do not have satisfactory automatic evaluation indicators (FID and CLIPScore each have their own problems). RM makes it possible to introduce human preferences for evaluation. A better evaluation system determines better development and guides the solution of current pain points.

RLHF: Use RM to make the model consistent with human preference distribution.

3. ImageRewardDB: Human Preference Dataset Construction

  1. prompt sampling and generating image collection. Our dataset uses real user prompts from the open source dataset DiffusionDB. To ensure the diversity of the selected tips, we employ a graph-based algorithm that exploits textual similarity based on language models. Through this method we selectively generated 10,000 candidate prompts, each prompt was accompanied by 4 to 9 sample images from DiffusionDB, thus generating < a i=2> candidate pairs for labeling. 177,304

  2. Manual annotation design: prompt annotation (classify prompts; point out problems in prompts); image rating (evaluate images based on text-image consistency, image quality and harmlessness); image ranking: in order of preference Rank images.

  3. Manual annotation analysis. After 2 months of annotation, we collected valid annotations for 8,878 prompts and generated 136,892 comparison pairs. A detailed analysis of the annotated data is provided in the appendix of the original paper.

4 ImageReward: Human Preference Prediction Model

4.1 Training process

Model structure: BLIP (ViT-L as image encoder, 12-layer Transformer as text encoder) + MLP (scorer)

Training method: For the k pictures corresponding to the same prompt, pairs are obtained according to their sorting results. Each pair has two pictures that are relatively more preferred and less preferred. . The objective function used in ImageReward training is as follows, where T represents prompt and x represents the generated image.

Training tips:

  1. During training, whether all the parameters of BLIP are fixed or not fixed, satisfactory accuracy cannot be achieved; in fact, we found that fixing 70% of the Transformer layer is the most effective;

  2. Training is very sensitive to hyperparameters. We searched for hyperparameters and found that a learning rate of 1e-5 and a batch size of 64 were most suitable.

4.2 Model performance

 Consistency analysis of human preferences. In most cases, people will agree on which image is better. However, considering that some of the images generated by the model may have advantages and disadvantages, in some cases, what different people think is a better image may be different. Before measuring the performance of the model, we first need to measure the likelihood that people will agree on choosing a better image. We use the other 40 prompts (778 pairs) to calculate the preference agreement between different annotators and researchers. As shown in the table above, the average agreement between annotators is 65.3%; and our model has achieved such an agreement, far exceeding other models.

 Preference accuracy. The accuracy of the model in predicting human preferences is calculated by whether the model is consistent with the human judgment of the quality of the pictures for two different pictures. On the test set, the preference accuracy of ImageReward reaches 65.14% (about the same size as the agreement between people), which is higher than 50% (random) a> (BLIP Score). 15.14%, which is about twice as high as 7.76%

Human review. In order to evaluate ImageReward's ability to select the most human-preferred images among a large number of generated images, we constructed another data set, collecting 9/25/64 generated images corresponding to the same prompt from DiffusionDB, and used ImageReward Different methods, including the following, are used to select the top3 results from these images. The annotators sort these top3 images, and the above figure shows the winning rate of ImageReward. ImageReward can select images that are more aligned with text, have higher fidelity, and avoid toxic content.

5 ImageReward as an automatic evaluation indicator

Training text-to-image generative models is difficult, but evaluating these models rationally is even harder.

Currently, FID and CLIP Score are two popular assessment methods.

FID (fine-tune or zero-shot): Using the MS-COCO image caption dataset, the text-to-image generation model is evaluated against real images.

CLIP Score: Use CLIP to measure the match between the images and text generated by the model.

FID has the following problems:

  1. Zero-shot usage: Since generative models are now mainly used by the public in a zero-shot manner without fine-tuning, the fine-tuned FID may not truly reflect the actual performance of the model in actual use. Furthermore, despite the adoption of zero-shot FID in recent popular practices, the possible leakage of MS-COCO in the pre-training data of some models would make it a potentially unfair setting.

  2. Human preference: FID measures the average distance between generated images and reference real images, thus failing to include human preferences that are critical for text-to-image synthesis in the evaluation. Furthermore, FID relies on the average of the entire dataset to provide an accurate assessment, whereas in many cases we need this metric as a selector for individual images.

Seeing these challenges, this paper proposes ImageReward as an effective zero-shot automatic evaluation metric for text-to-image model comparison and image selection.

 Model evaluation indicators that are more consistent with human preferences. The researchers evaluated six popular high-resolution (approximately 512 × 512) available text-to-image models: CogView 2, Versatile Diffusion ( VD), Stable Diffusion (SD) 1.4 and 2.1, DALL-E 2 (via OpenAI API) and Openjourney. The 100 prompts used in the test were sampled from the prompts used by real users, and each model generated 10 images as candidates based on each prompt. To compare these models, we first select the best image from each model's 10 outputs for each prompt. The annotator then sorts the images from different models under each prompt according to the sorting rules described in ImageRewardDB. The final win count for each model against all other models is shown in the table above. As shown in the table: ImageReward is very consistent with human ranking, while FID and CLIP of zero samples are inconsistent.

 

Better discrimination. Another highlight is that we observe that ImageReward can better differentiate the quality between individual samples compared to CLIP. The above figure shows the box plot of ImageReward and the score distribution of CLIP on 1000 images of each model. Normalize the distribution to 0.0~1.0 using the minimum and maximum ImageReward and CLIP scores for each model and discard outliers. As shown in the figure, ImageReward's scores in each model have a larger variance than CLIP's scores, which means that ImageReward can distinguish the image quality of each other well. Furthermore, in terms of cross-model comparisons, we find that the median ImageReward scores are also roughly consistent with human rankings. This is particularly surprising given that the distribution has been normalized. In contrast, CLIP's median fails to render the property.

6 ReFL: Reward Feedback Learning of Vincentian Diffusion

6.1 Review of existing algorithms

The RLHF of the language model cannot be directly applied in Diffusion. In NLP, there are research works using reinforcement learning algorithms such as PPO to guide the language model to be consistent with human preferences. Representative works include OpenAI's InstructGPT and ChatGPT, which update the entire model based on a completely generated likelihood. However, unlike Transformer-based language models, the multi-step denoising generation of Latent Diffusion Models cannot produce the likelihood of a complete generation, so the same RLHF method cannot be adopted.

Existing LDM human feedback fine-tuning methods are indirect. Currently available optimization methods for fine-tuning LDM with human feedback can basically be divided into two categories. One is to use reward models to obtain new data sets, and the other is to use reward models to change the coefficients of the loss function. However, these methods are indirect and have limited effects, which can be reflected in the experimental results section later.

6.2 Algorithm design

 

By looking at the ImageReward scores during the denoising step, we made an interesting discovery (see image above, left). For a denoising process, for example, when the number of denoising steps is 40, the original image corresponding to the intermediate denoising result is directly predicted in the middle of the denoising process:

  • When t ≤ 15: The consistency between the ImageReward score and the final result is very low;

  • When 15 ≤ t ≤ 30: the ImageReward scores of high-quality generated results begin to stand out, but overall we still cannot clearly judge the final quality of all generated results based on the current ImageReward scores;

  • When t ≥ 30: The ImageReward scores corresponding to different generated results can be distinguished.

Based on the observations, we conclude that after 30 steps of denoising (the total number of steps is 40) without going to the last step of denoising, the ImageReward score can be used as reliable feedback for improving LDM. Therefore, we propose an algorithm to directly fine-tune LDM. The algorithm flow can be seen on the right side of the figure above. Treat the score of RM as a human preference loss and backpropagate the gradient to a randomly chosen next step t in the denoising process (t ranges from 30 to 40 in our example). The reason for choosing t randomly instead of using the last step is that if only the gradient of the last denoising step is retained, the training proves to be very unstable and the results are bad. In practice, in order to avoid rapid overfitting and stabilize fine-tuning, we reweight the ReFL Loss and regularize it with the Pre-training Loss.

6.3 Experimental results of ReFL

Our experiment is based on Stable Diffusion v1.4. The reward model uniformly uses ImageReward, PNDM is used as the noise scheduler, and the classifier free guidance uses the default value7.5. Compared with Stable Diffusion v1.4 before fine-tuning, the ReFL fine-tuning model is the most popular and has the highest winning rate. When compared with other methods, the results of ReFL are always preferred.

Paper link: arxiv.org/abs/2304.05977

Code homepage:

https://github.com/THUDM/ImageRewardgithub.com/THUDM/ImageReward

Model download: huggingface.co/THUDM/ImageRewar

Dataset: huggingface.co/datasets/THUDM/ImageRewardDB

Guess you like

Origin blog.csdn.net/Angelina_Jolie/article/details/134049241