The practical application of AI image-generating aesthetics in Taobao

[Live broadcast preview] Will large models replace programmers? "

This article introduces how to formulate and apply aesthetic standards to evaluate and improve the quality of images generated by artificial intelligence, especially in the field of e-commerce. It is mainly divided into four categories: formulating aesthetic standards, training aesthetic models, applying aesthetic models, and upgrading Taobao style models. step.

Definition and analysis of aesthetics

Image quality standards: Under the modern design framework, the defined image quality standards are basically unified. Focusing on the definition of skills and techniques also extends to the quality evaluation of pictures, paintings, photos, and images. On this basis, there will be requirements and emphasis on the characteristics of the means of making pictures.
Image content standards: The requirements for expression quality under ideology are extensive, and image quality standards will be broken to serve the needs of content expression. It is usually defined and interpreted by authoritative figures such as critics or judges in the industry.

Aesthetics Project Goals

The first step is to formulate aesthetic standards : formulate AI drawing standards and AI style standards, and jointly research with China Academy of Art and professors. Highlight professionalism, pertinence, objectivity and authority.
Step 2 - Training aesthetic model: Cultivate an aesthetic judgment model based on AI aesthetic standards so that the machine can automatically judge and score.
Step 3 - Apply the aesthetic model: Guide the optimization and upgrade of the Taobao AI image generation model based on the aesthetic model capabilities.
Step 4 - Upgrade Taobao style model: Establish a Taobao style model library based on style standards, so that merchants have a rich and diverse style model to choose from. Create Taobao style model.

Step One: Develop Aesthetic Standards

The criterion framework is defined based on the components of "image", while focusing on " AI-generated characteristics " to build aesthetic standards:

Image composition: object shape/environment/composition/light and shadow/texture

AI generation characteristics: element authenticity & scene rationality

AI aesthetic standards: 5 guidelines, 19 standards

Step 2: Train the aesthetic model

Aesthetic model goal: Improve the accuracy of automatic machine scoring and judgment of images.
Accuracy rate: The same picture is subjected to aesthetic AI scoring and manual scoring, and the overlap rate between human and machine scores is taken.

▐Immersive experience

Our AI aesthetic evaluation model adopts multi-modal aesthetic pre-training and multi-task fine-tuning learning methods. The advantages of doing this are as follows:

Our model has fewer parameters, allows for fast training iterations, fast inference speed, can quickly screen high-aesthetic images, and can also evaluate the generation effects of different generation models, reducing manual annotation and review costs;
Compared with models that only output aesthetic scores, our model can output abnormal attributes of generated images, which has higher interpretability;
The abnormal attributes output by our model can be used as a pre-discriminator for image restoration, and can also be used to optimize the generation model for abnormally generated image marking;

▐Training process

Develop scoring specifications based on aesthetic standards and establish a 5-point scoring rule, which is marked by designers to accumulate high-quality AI training data:

Formulate scoring rules: scoring specifications for AI generated images (5 levels), and scoring rules for original image screening (3 levels).
Ability to evaluate the aesthetics of the original mannequin image: Based on the preference for image quality such as the mannequin, environment, composition, light and shadow, texture, etc., a specialized aesthetic model of the original mannequin image is trained for aesthetic layering. Filterable low-aesthetic types include blurry images, white-bordered images or textures, incomplete or cropped human faces, heavily blocked human bodies, poor backgrounds or poor overall aesthetics, etc.
AIGC Aesthetic Evaluation Capability of Raw Pictures: Our AIGC Aesthetic Evaluation of Raw Pictures is mainly aimed at raw pictures containing characters. Starting from two aspects, focusing on the rationality of the picture and focusing on the integration of the picture, the score is formulated based on 5 major criteria and 19 standard requirements. rules, and at the same time mark the abnormal attributes of the raw graph. The abnormal attributes currently supported by our model include abnormal integration between people and the background (characters hanging in the air, poor background texture, etc.), hand abnormalities, facial abnormalities, limb abnormalities, other abnormalities, etc. The output aesthetic score ranges from 1 to 5 points.

Figure: Pictures of different aesthetic scores predicted by AIGC raw image aesthetic evaluation

Reasonable training: multiple rounds of matching verification between humans and machines to ensure high quality data.

1 round of scoring test: Take the average score of 3 people to accumulate data to ensure objective scoring. The difference section reinterprets the specific problem points presented by the difference. Perform verification again. Ensure that different people’s interpretations of the Code are consistent and stable (5-point system).
2 rounds of AI scoring verification: take the average score of 3 people and proofread it with the machine. If there is a difference in score, reinterpret the specific problem points of the difference to clarify whether it is a human problem or a machine problem, ensuring that the two are gradually consistent and ensuring machine understanding. accuracy. (This will start after the first version of the AI judgment model is available).

technical framework

AIGC raw drawing aesthetic evaluation: based on the 5-point aesthetic criteria defined by the designer, mapped to five quality levels. At the same time, we conducted an inductive analysis of the generated data and summarized five major attributes: normal, abnormal fusion of person and background, hand abnormality, facial collapse, body abnormality, and other abnormalities. The quality level and attribute reasons are combined to form an aesthetic evaluation prompt word, which is used as the input of the multi-modal pre-training model. The loss function uses aesthetic score regression loss and attribute reason multi-label classification loss.
Aesthetic evaluation of the original mannequin image: CLIP has a good zero-shot capability of good/bad classification in terms of aesthetic evaluation of image quality, color, lighting, composition, abstract concepts, etc. Therefore, in the pre-training stage, we improve the aesthetic representation ability of backbone by distilling CLIP's image encoder. The fine-tuning stage uses the improved backbone to predict the normalized aesthetic score. The loss function is weighted by L1 loss and binary cross-entropy loss to improve the performance and robustness of the model. After the model training is completed, by selecting different thresholds, human model pictures with different aesthetic levels can be layered.

▐Testing phase

Based on the test situation, analyze current machine problems or human problems, and continuously tune the accuracy of the model. Continuously evolve and tune in this process.

Tuning versatility: Test Taobao’s internal [Qianniu Intelligent Model] and Taobao’s external third-party models on the Qianniu platform . The same type of mannequins were evaluated and found to be compatible, but there were significant differences. When crawling specific image issues, we found that the quality of the uploaded original image will have an impact on the accuracy. To ensure fairness, standards for test atlases need to be developed.
Authenticity test of machine scoring : The accuracy rate will fluctuate to a certain extent every week, and a standard test set will be constructed based on the model conditions. Use 1,200 standard test sets for AI and manual scoring (considering that the difficulty of the original pictures will affect AI judgment, the test set is divided into three levels: easy, medium, and difficult, with a ratio of 1:1:1).
Rigorous test of machine scoring: The tuned scoring model will automatically score newly generated images and compare them with human scores.

Step Three: Apply the Aesthetic Model

Goal: Use aesthetic models to improve the rate of good drawings of Taobao AI large models.

▐Aesthetic model version 1.0 - application of AI image evaluation capabilities:

Goal: Use the aesthetic model to evaluate the Taobao generation model, determine picture scoring and picture problems, and repair the identified picture problems.
Judgment ability: You can score pictures (1-5 points), screen out good pictures and bad pictures, and guide subsequent optimization suggestions for the model.
Recognition ability: Currently, 5 key screen attributes can be fed back. (1. Abnormalities in the hands. 2. The person does not blend with the background. 3. Abnormalities in the face. 4. Abnormalities in the body. 5. Others).
修复能力：AIGC生成人物时画好的手一直是难点，人的手部自由度高且姿态复杂多变、图中占比小且细节多，导致画手的成功率不高。特别地，在实际业务中，由于用户上传的图片手部细节不明显或者手中拿着物品等复杂场景，在进行换模特换背景时，生成模型往往不能学到手部的准确细节特征导致画出不好的手。我们探索全新的手部修复技术方案。由 AI美学评价模型判断生成异常的手，对异常的手，利用3D手部状态重建模型保持正确的手指数量与手的形状，同时能够自适应生成图像中所需的手势。基于我们内部基底模型，融合Text Embedding，根据重建后的手部姿态重新绘制正常的手。经过反复调试参数和场景适配，我们的手部修复方案在业务数据上测试，修复成功率超过50%，可大幅度提高整体的生图良图率。手部修复的case如下：

▐ 美学模型2.0版本-应用原图评测能力

目标：调优淘宝基地模型，目前有混杂的原图数据集，数据集质量参差不齐，需要进行有效的筛选优化。
背景：目前原图数据集来源核心是两部分：视觉中国和淘宝模特图。
视觉中国的摄影图核心是供给给新闻稿做新闻配图，因此大量的图片为了营造故事性对人物和场景有独特的表达。淘宝模特图商家已经做了后期处理，有些诸如模特的处理已经比较夸张。
筛选优质原图：通过原图判定模型，筛选优质摄影图，调优自研模型等数据集效果。提升生图的良图率。（如多人混乱、背景混乱，场景融合感等效果可提升）。
收集专业摄影原图：目前通过设计团队搜集优质的摄影模特图。
1.0版本的AI美学评价模型影响生成模型，使生成模型自适应对齐人类偏好：AI美学评价可用于指导基于扩散的生成模型，不仅指导生成模型要生成高美学图像，也需要减少生成低美学图像的概率。为了解决这个问题，我们利用AI美学评价模型在低美学异常生成图像加上异常属性标签，增强模型学习异常生成图像概念的能力，可以在推理阶段避免。

第四步：升级淘宝风格模型

目标：打造淘宝特色风格模型。

风格标准的归纳：风格框架已经设定完成，内容量较大，将联动校企合作研究生，根据我们的要求逐步填充风格内容。

▐ 风格的背景情况

目前风格选择的丰富性不足，生图的场景和人物集中在特定的几个类型上。原先对于风格的设定采用穷举的方式。如背景生成的场景基本上是泳池、花园、商场、海滩、森林、雪山。
因为原图本身的来源关系，图片的地域场景特色基本是西式。诸如东南亚的海滩、欧式花园、美式商场、美式泳池、北欧雪山。
因为采用穷举的方式，导致工具的选择项过多，体验比较复杂，商家使用过程中会选择困难，采用不断尝试的方式。