NeurIPS 2023 | Diffusion model is renewed! Microsoft proposes TextDiffuser: the text part of image generation can also be done! ...

Click on the card below to follow the " CVer " public account

AI/CV heavy-duty information, delivered as soon as possible

Click to enter -> [Target Detection and Transformer] communication group

Reprinted from: Heart of the Machine

In recent years, the field of Text-to-Image has made great progress, especially in the era of AIGC (Artificial Intelligence Generated Content). With the rise of the DALL-E model, more and more Text-to-Image models have emerged in the academic community, such as Imagen, Stable Diffusion, ControlNet and other models. However, despite the rapid development of the Text-to-Image field, existing models still face some challenges in stably generating images containing text.

After trying the existing Sota Vincent diagram model, it can be found that the text generated by the model is basically unreadable and similar to garbled characters, which greatly affects the overall aesthetics of the image.

a63e9ca39ee015b033cae8ca4ce8eade.png

The text information generated by the existing sota vison diagram model is poorly readable

After investigation, there is little research in this area in academic circles. In fact, images containing text are very common in daily life, such as posters, book covers, and street signs. If AI can effectively generate such images, it will help assist designers in their work, inspire design inspiration, and reduce design burden. In addition, users may only wish to modify the text portion of the Vincent diagram model results and retain the results in other non-text areas.

Therefore, researchers hope to design a comprehensive model that can not only generate images directly from prompts provided by users, but also receive images given by users and modify the text in them. This research work has been accepted by NeurIPS 2023.

ed8da9f8301763b25b3898d265b90a69.jpeg

  • Paper address: https://arxiv.org/abs/2305.10855

  • Project address: https://jingyechen.github.io/textdiffuser/

  • Code address: https://github.com/microsoft/unilm/tree/master/textdiffuser

  • Demo address: https://huggingface.co/spaces/microsoft/TextDiffuser

eb0c7c1146adf377b4a6cd58edae17f8.png

Three functions of TextDiffuser

This article proposes the TextDiffuser model, which consists of two stages. The first stage generates Layout, and the second stage generates images.

9053602ef149cea00faca8fc64f7caeb.png

 TextDiffuser frame diagram

The model accepts a text Prompt, and then determines the Layout (that is, the coordinate frame) of each keyword based on the keywords in the Prompt. The researcher adopted Layout Transformer, used an encoder-decoder form to autoregressively output the coordinate frame of keywords, and used Python's PILLOW library to render the text. In this process, you can also use Pillow's ready-made API to obtain the coordinate box of each character, which is equivalent to obtaining the character-level Box-level segmentation mask. Based on this information, the researchers tried to fine-tune Stable Diffusion.

They considered two situations. One is that the user wants to directly generate the entire image (called Whole-Image Generation). Another situation is Part-Image Generation, also called Text-inpainting in the paper, which means that the user gives an image and needs to modify certain text areas in the image.

In order to achieve the above two purposes, the researchers redesigned the input features, and the dimensions were changed from the original 4 dimensions to 17 dimensions. It contains the characteristics of 4-dimensional noisy image, 8-dimensional character information, 1-dimensional image mask, and the characteristics of 4-dimensional unmasked image. If it is Whole-image generation, the researcher sets the mask area to the entire image. On the contrary, if it is part-image generation, only a part of the image is masked. The training process of the diffusion model is similar to LDM. Interested partners can refer to the description in the method section of the original text.

In the Inference stage, TextDiffuser is very flexible and can be used in three ways:

  • Generate images based on instructions given by the user. Moreover, if the user is not satisfied with the layout generated by Layout Generation in the first step, the user can change the coordinates and the content of the text, which increases the controllability of the model.

  • Start directly from the second stage. The final result is generated based on the template image, where the template image can be a printed text image, a handwritten text image, or a scene text image. The researchers specially trained a character set segmentation network to extract Layout from template images.

  • Also starting from the second stage, the user gives an image and specifies the area and text content that needs to be modified. And, this operation can be performed multiple times until the user is satisfied with the generated results.

f871ba17ae45ea6f76401a1f46690421.png

Constructed MARIO data

In order to train TextDiffuser, the researchers collected 10 million text images, as shown in the figure above, including three subsets: MARIO-LAION, MARIO-TMDB and MARIO-OpenLibrary.

The researchers considered several aspects when filtering the data: for example, after the images were OCR, only images with the number of texts [1,8] were retained. They filtered out texts with more than 8 texts, because these texts often contain a large amount of dense text, and the OCR results are generally less accurate, such as newspapers or complex design drawings. In addition, they set the text area to be greater than 10%. This rule is set to prevent the text area from being too small in the image.

After training on the MARIO-10M data set, the researchers quantitatively and qualitatively compared TextDiffuser with other existing methods. For example, as shown in the figure below, in the Whole-Image Generation task, the image generated by this method has clearer and readable text, and the text area and the background area are integrated to a high degree.

c844ff401e013bce9d4710dcef7d1755.png

Comparing text rendering performance with existing work

The researchers also conducted qualitative experiments, as shown in Table 1. The evaluation indicators include FID, CLIPScore and OCR. Especially for the OCR index, the method in this paper has been greatly improved compared to the comparative method.

4d9b1dd6d9666e71d68fecdb056e0dc4.png

Table 1: Qualitative experiment

For the Part-Image Generation task, the researchers tried to add or modify characters on a given image, and the experimental results showed that the results generated by TextDiffuser were very natural.

ac070d717a3abcf73446b05f8f7e80f5.png

Text repair function visualization

Overall, the TextDiffuser model proposed in this paper has made significant progress in the field of text rendering and can generate high-quality images containing readable text. In the future, researchers will further improve the effect of TextDiffuser.

Click to enter -> [Target Detection and Transformer] communication group

ICCV/CVPR 2023 paper and code download

 
  

Backstage reply: CVPR2023, you can download the collection of CVPR 2023 papers and code open source papers

后台回复:ICCV2023,即可下载ICCV 2023论文和代码开源的论文合集
目标检测和Transformer交流群成立
扫描下方二维码,或者添加微信:CVer333,即可添加CVer小助手微信,便可申请加入CVer-目标检测或者Transformer 微信交流群。另外其他垂直方向已涵盖:目标检测、图像分割、目标跟踪、人脸检测&识别、OCR、姿态估计、超分辨率、SLAM、医疗影像、Re-ID、GAN、NAS、深度估计、自动驾驶、强化学习、车道线检测、模型剪枝&压缩、去噪、去雾、去雨、风格迁移、遥感图像、行为识别、视频理解、图像融合、图像检索、论文投稿&交流、PyTorch、TensorFlow和Transformer、NeRF等。
一定要备注:研究方向+地点+学校/公司+昵称(如目标检测或者Transformer+上海+上交+卡卡),根据格式备注,可更快被通过且邀请进群

▲扫码或加微信号: CVer333,进交流群
CVer计算机视觉(知识星球)来了!想要了解最新最快最好的CV/DL/AI论文速递、优质实战项目、AI行业前沿、从入门到精通学习教程等资料,欢迎扫描下方二维码,加入CVer计算机视觉,已汇集数千人!

▲扫码进星球
▲点击上方卡片,关注CVer公众号
整理不易,请点赞和在看

Guess you like

Origin blog.csdn.net/amusi1994/article/details/133503684