AIGC’s exploration of Tmall product poster generation





The Inspiration Artist Project aims to use AIGC's drawing capabilities to unite merchants to create a low-threshold + highly interesting promotional poster design competition. This article shares our plan and optimization direction. It is recommended for students in engineering and algorithm fields who are interested in AIGC to read it.



Background of the project

The inspiration artist project aims to use AIGC's drawing capabilities to unite merchants to create a low-threshold and highly interesting promotional poster design competition to promote and build momentum for new products . At the same time, it is also to provide consumers with a channel to participate in new product announcements.

Target breakdown

The GPT part uses the Tongyi Qianwen language model. Please see its technical documentation for details. This article focuses on the image generation part of poster style, which is divided into four styles: commercial poster, Pixar, two-dimensional, and realistic:



The three styles of Pixar, 2D, and realism have relatively clear implementation ideas. They are standard Vincentian diagrams and can be realized based on MJ and SD. There are many articles analyzing the advantages and disadvantages of MJ and SD. Without going into details, we finally chose SD as the algorithm solution for Vincentian diagrams. The core is that SD is open source and has strong plasticity. Based on diffusers, we rewrote a set of SD implementations to support functions such as VAE, ControlNet, Lora, and Embedings. Based on business characteristics, we customized capabilities such as warmup and auto_predict. It solves the generation problems of these three styles relatively easily.

The difficulty of the algorithm lies in the style generation of product posters. Brands require products to be highly restored, and the generated posters have clear pixels, rich details, and a high-end feel. The requirements are full, but the reality is skinny. Product details are complex, especially when they contain text, which is difficult to generate. Moreover, the drawing inspiration is randomly input by the user's text, and the drawing effect is almost uncontrollable. To this end, we have conducted a lot of research and made some optimization attempts.

Program research

Taking Chanel No. 5 perfume as an example, we initially tried 4 plans.

▐Plan 1 SD + Outpainting  


Brief description: Fix the position of the product and redraw the area outside the product.
Advantages: Does not affect the appearance of the perfume.
Disadvantages: The positional relationship between the characters, background and perfume in the picture is difficult to control, and there is an obvious sense of violation.

▐Option 2 SD Inpainting + Reference Only  


Brief description: Inject product image information into the attention layer to control unet to generate similar images.
Advantages: Pre-generated backgrounds are fully preserved.
Disadvantages: Low perfume reduction degree.

▐Option 3 Diffusion algorithm based on Reference  


Brief description: Generate relatively similar products based on a reference product image
Representative: PBE, IP Adapter, Anydoor…
Advantages: Strong generalization, no need to train each product separately
Disadvantages: The product details are still not enough to restore the copy&paste.


▐Plan 4 SD + Lora/Dreambooth  


Brief description: Fine-tune the model and inject product appearance information
Advantages: The appearance of the product is highly restored and the rendering rate is relatively stable.
Disadvantages: The degree of restoration of details such as text is still not high enough; and the smaller the details, the more serious the distortion.
Option 4 is closest to the desired effect, but there is still a big gap between it and our requirements.

Optimization direction

▐Explore a VAE enhancement  


After analyzing the structure of the LDM (the main cited paper of SD) model, it is initially suspected that the core reason for insufficient detail restoration is the loss of detail information during the mutual conversion process of VAE from pixel space to latent space.


In order to verify the conjecture, we conducted a test and performed ten encoder and decoder operations on an image. Image details such as text have begun to blur. We thought of a way to compensate for the information lost by VAE, and the restoration degree was significantly enhanced.


But there is still a gap between perfect restoration.


▐Explore two image super scores  


Since details are difficult to restore, can the degree of restoration be improved by enlarging the details? In order to verify this conjecture, we conducted the following experiment.
At 256*256 resolution the text is almost illegible.

At 512 512 resolution, there is a significant improvement compared to 256 256, and the restoration degree of version 2.X is better than that of version 1.X.



After upgrading to SDXL, text restoration is further enhanced.

The pixels have been improved and the degree of restoration has indeed been improved. Naturally, we thought that we could super-redefine details such as text on the generated image, and then train a dedicated controlnet in the refiner stage to further improve the degree of restoration.


After many adjustments, the detail restoration degree can reach more than 90%. But there is still a little gap between perfect restoration.


▐Explore three stickers  


Since text and other details are very difficult to restore, is it possible to directly copy and paste the text back?
By extracting the text area of ​​the original product and mapping it to the corresponding area of ​​the generated product, the text details are perfectly restored.

Online program


After exploring in the above directions, we have initially solved the problem of poster generation for products such as perfume, but it is still difficult to restore products with complex graphics and text, such as:

It is necessary to restore it perfectly, but also to increase its generality. Simply paste the whole picture back [dog head]. The plan is as follows:

  1. The offline module generates a background gallery through Vincentian images.
  2. The offline module is preset with multi-angle product images to solve the problem of diversity of product angles.
  3. Select a picture from the background gallery that is most relevant to the current product as the guide picture. Solve the problem of inconsistency between products and backgrounds and improve the rendering rate.
  4. From the product image and background image together, the wireframe image, product white background image and corresponding mask are generated.
  5. Generate preliminary product posters through Stable Diffusion+Canny Controlnet+Reference.
  6. Use SAM and LAMA to erase the product to prevent the edges from being misaligned when applying images later.
  7. Use the image of the erased product, the white background image of the product in step 4, and the corresponding mask as input to synthesize a new image.
  8. Extract the light and shadow information of the product generated in step 5 and project it onto the product in step 7 to generate the final product poster.

in conclusion:
  1. Through Copy&Paste, indiscriminate restoration is guaranteed.
  2. By presetting guide images, complete randomness is solved and the image output rate is improved.
  3. Problems such as reflections are solved through two-step generation. The images are beautiful and have a high-end feel.
  4. Through erasure reconstruction and image fusion technology, the burr problem at the edge of the product is alleviated.
  5. By extracting light and shadow from the generated map and mapping it to the texture, the problem of light and shadow disharmony is solved.


Test effect



Online effect


The picture output rate is over 95%, basically every picture can be viewed, and most pictures can withstand beating. On the A10 GPU, the image output speed of a single card is 3-5 seconds.


Next step to explore

At first glance, the effect is acceptable, but there is still some room for improvement, such as:


How to further improve the generation effect of complex posters and increase the occlusion relationship?


How to solve the problem of harmonious proportion of goods and background, GLIGEN may be the answer?


Textures always seem less algorithmic. Is there any opportunity to continue to improve the capabilities of VAE, or remove VAE? Can you try Consistency Decoder?


Finally, exploration never stops and AIGC never sleeps.


Quote


[1] IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
[2] Paint by Example: Exemplar-based Image Editing with Diffusion Models
[3] AnyDoor: Zero-shot Object-level Image Customization
[4] High-Resolution Image Synthesis with Latent Diffusion Models
[5] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
[6] GLIGEN: Open-Set Grounded Text-to-Image Generation
[7]
https://github.com/openai/consistencydecoder


团队介绍

我们是大淘宝FC技术智能策略团队,负责手机天猫搜索、推荐、拍立享等业务研发和技术平台建设,综合运用搜推算法、机器视觉、AIGC等前沿技术,致力于依靠技术的进步支持场景的提效和产品的创新,为用户带来更好的购物体验。


¤  拓展阅读  ¤

3DXR技术 |  终端技术 |  音视频技术
服务端技术  |  技术质量 |  数据算法


本文分享自微信公众号 - 大淘宝技术(AlibabaMTT)。
如有侵权,请联系 [email protected] 删除。
本文参与“OSC源创计划”,欢迎正在阅读的你也加入,一起分享。

博通宣布终止现有 VMware 合作伙伴计划 deepin-IDE 版本更新,旧貌换新颜 WAVE SUMMIT 迎来第十届,文心一言将有最新披露! 周鸿祎:鸿蒙原生必将成功 GTA 5 完整源代码被公开泄露 Linus:圣诞夜我不看代码,明年再发布新版 Java 工具集 Hutool-5.8.24 发布,一起发发牢骚 Furion 商业化探索:轻舟已过万重山,v4.9.1.15 苹果发布开源多模态大语言模型 Ferret 养乐多公司确认 95 G 数据被泄露
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4662964/blog/10319793