stable diffusion model evaluation framework

GhostReview: The world's first set of AI painting ckpt evaluation framework code - Zhihu Hello everyone, I am _GhostInShell_, the author of GhostMix, which ranks second in All Time Highest Rated (the highest rating in global history) on the global AI painting model website Civitai. In the last article, I mainly discussed my views on the development direction of ckpt. In short, checkpoint... icon-default.png?t=N7T8https://zhuanlan.zhihu.com/p/647150677

CUHK and SenseTime proposed HPS v2: providing more reliable evaluation indicators for text generation image models_Amusi (CVer)'s blog - CSDN blog click on the card below to follow the "CVer" public account for AI/CV heavy news, the first time Click to enter -> [Diffusion Model and Transformer] Exchange Group TL;DR This article proposes a data set with the largest amount of data, the widest coverage, and reflecting human beings' love for generated images: HPD v2, and based on this data Collection, and proposed the most generalizable "human preference evaluation model": HPS v2. HPS v2 can be compared to the reward model in ChatGPT, which can be used to match the image generation model... https://blog.csdn.net/amusi1994/article/details/131566719 I think the author's idea is okay, and it agrees with me The views are basically the same. Generative sd does not require so many directional models, and only a few basic models are needed to provide strong generalization capabilities. Most of the plug-in capabilities can be completed through tools such as lora and controlnet. Therefore, the evaluation mainly The sd model does need a system. Some conventional indicators are indeed difficult to measure the model's generation ability. At present, the most important image generation indicator is FID/IS/Clip score, but these evaluation indicators often cannot fully reflect the image generation. quality.

GhostReview evaluation index: The model drawing is divided into two parts, one is the systemic impact, that is, the model impact, and the other is the individual impact, that is, the impact brought by the random seeds, to evaluate the systemic risk of the model. 1. The compatibility of the model (painting style, Lora, prompts, etc.), 2. The quality of the generated pictures, 3. The good drawing rate of the model.

1. Model drawing quality and generalizability analysis

1.1 Aesthetic evaluation

GitHub - christophschuhmann/improved-aesthetic-predictor: CLIP+MLP Aesthetic Score PredictorCLIP+MLP Aesthetic Score Predictor. Contribute to christophschuhmann/improved-aesthetic-predictor development by creating an account on GitHub.icon-default.png?t=N7T8https://github.com/christophschuhmann/improved-aesthetic-predictor

Based on laion-aesthetics v1, laion-5B uses 17.6W image rating pairs, 1.5W laion-logos image rating pairs and 250,000 AVA data as aesthetic scores. The standard deviation of the aesthetic scores is used to numerically measure the good image of the model. Rate.

1.2 prompt compatibility

https://github.com/openai/CLIPicon-default.png?t=N7T8https://github.com/openai/CLIP

It mainly measures whether the pictures generated by the model can correctly reflect the prompt input, using the clipscore in clip.

In terms of the prompts used, GhostReview uses the 25 non-political, non-meme, non-pornographic or soft-pornographic prompts with the most Image Reactions on Civitai. In order to ensure that the prompts cover real, animation and artistic painting styles, 5 stylized prompts were added, for a total of 30 prompts (all excluding LoRA). Each ckpt generates 32 pictures (batch4, iter8) in each Prompts. Therefore, for a single ckpt, a total of 960 highres fix pictures were generated in the first project.

2. Style compatibility analysis

The test method is to input stylization-related prompts to allow the model to generate a large number of stylized pictures, and then compare them with a large number of existing style pictures to obtain the numerical results of style compatibility. The feature map of the generated picture and the target picture is extracted through vgg19. , then calculate the gram matrix of the feature map of each layer, and calculate the specific styleloss.

In terms of stylized prompts, refer to the SDXL style, remove those that cannot be directly implemented by existing ckpt, such as PaperCut, and finally choose 9 different styles: Anime, Manga, Photographic, Isometric, Low_Poly, Line_Art, 3D_Model, Pixel_Art, Watercolor.

3. Compatibility analysis of lora

Styleloss is calculated by generating images and target images.

Choice of Prompts and LoRA. Since the character generated by each ckpt using character LoRA will be inconsistent with the sample image, the LoRA compatibility test selects stylized LoRA. The selection criteria are Civitai All Time Highest Rated's Top 16 stylized LoRA. The target pictures and prompts used are all the pictures and prompts of the LoRA header image. Here are some details of the processing. 1. For a picture with multiple LoRAs, the corresponding LoRA will be completed (for example: Mo Xin’s header picture). 2. For Prompts without a LoRA field, a LoRA weight of 0.8 will be added by default for generation. (For example: the header image of 3D rendering style) 3. For the header image that uses the wrong version of the LoRA field, replace it with a new version of the LoRA field (for example: the header image of Gacha splash) 4. Because some LoRA header images themselves use large The models are also tested models, such as REV and majic realistic, so the scores for the GhostLoRALoss_NoTM version are made. When calculating these models, these LoRA scores are not considered.

In other words, there are three parts. The first part uses the aesthetic evaluation model to evaluate the image score, and uses clipscore to evaluate the correlation between the prompt and the output image. The second part calculates the styleloss between the image generated by the input prompt and the existing style image. The third part, combined with lora, uses lora's own pictures and prompts, inputs the prompt to calculate the styleloss between the pictures and lora pictures, and evaluates the model in these three dimensions.

4.Code

# 图片的美学分数
model = CLIPModel.from_pretrained("laion/CLIP-VIT-L-14-laion2B-s32B-b82K")
clip_processor = CLIPProcessor.from_pretrained("laion/CLIP-VIT-L-14-laion2B-s32B-b82k")

rating_model = load_model()
artifacts_model = load_model()

def predict(img):
    inputs = clip_processor(img,)
    with torch.no_grad():
        vision_output = vision_model()
    embedding = preprocess(pooled_output)
    with torch.no_grad():
        rating = rating_model(embedding)
        artifact = artifacts_model(embedding)

Guess you like

Origin blog.csdn.net/u012193416/article/details/133243419