Have you successfully tamed the AI painting that swipes the screen? You may not know the AIGC model behind it

Article directory

foreword

With the development and improvement of artificial intelligence technology, AIGC (AI generated content) has brought unprecedented help to people's work and life in the creation of content. Specifically, it can help humans improve the efficiency of content production and enrich the diversity of content production. performance, providing more dynamic and interactive content. In the past two years, AIGC has continued to shine in the fields of AI painting, AI composition, and AI poetry.

2022 is the year when AI painting gradually moves to the center of the stage.

Text-generated image (AI painting) is a new production method that generates images based on text. Compared with human creators, text-generated images show the advantages of low creation cost, fast speed and easy mass production.

In the past year, this field has developed rapidly, with international technology giants and start-ups scrambling to enter, and many products that generate images from text have also appeared in China. These products mainly use models such as dall-e-2 and stable diffusion based on the diffusion generation algorithm.

Just a few years ago, it was impossible to predict whether a computer would be able to generate an image from such a textual description. At present, AI has begun to be able to complete some creative work, not just mechanical repetitive work.

Recently, Kunlun Wanwei, Baidu, Meitu and other companies that are focusing on using AI technology to empower ecological business have also launched Chinese versions of text generation image algorithms. Overall, this field is still in a stage of rapid development.

This article aims to give readers an overview of the current development of many excellent text-guided image generation models and some of the current online API experiences for text image generation that have received widespread attention: including OpenAI, Hugging Face, Baidu, and Kunlun Wanwei, which is dedicated to AIGC and game business, etc. A company with strong investment in AI technology and ecological development.

Conditional Text Image Generation Based on CLIP Latents

Hierarchical Text-Conditional Image Generation with CLIP Latents
Contrastive models like CLIP have been shown to be good at learning image representations that capture both semantics and style.
The author proposes a two-stage model: the prior model generates CLIP image features given the text title, and then sends the image features to the decoder to generate an image;

The article link is as follows

https://cdn.openai.com/papers/dall-e-2.pdf

1-0

1-1

BLIP

The basic information of the paper is as follows

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Paper address: https://arxiv.org/pdf/2201.12086.pdf
Code address: https://github.com/salesforce/BLIP
Trial URL: https://huggingface.co/spaces/akhaliq/BLIP

Visual language comprehension and generation, trial play process is as follows

Upload an image of your choice
Click the submit button below
Wait for a few seconds, and the corresponding image content description will be generated on the right side

1-1

The network structure adopts multiple encoder-decoder

2-1

model architecture

We use a vision transformer as an image encoder that decomposes an input image into patches, encodes these patches as sequence embeddings, and uses an additional [CLS] token to represent global image features. Compared to methods that use pretrained object detectors for visual feature extraction, using ViT is more computationally friendly and has been adopted by many recent methods.

In order to pre-train a unified model with understanding and generation capabilities, the researchers proposed a multi-task model MED (mixture of encoder-decoder), which can perform any of the following three functions:

Unimodal encoder
Image-based text encoder
Image-based text decoder

pre-training target

The researchers jointly optimized three objectives during the pre-training process, two comprehension-based objectives and one generation-based objective. Each image-text pair requires only one forward pass through the computationally heavier (computational-heavier) visual transformer, and three forward passes through the text transformer, where different functions are activated to calculate the following 3 losses, respectively:

Image-text contrastive loss (ITC), which activates a unimodal encoder, aims to align the feature space of the visual and text transformers by encouraging positive image-text pairs (rather than negative pairs) to have similar representations;
Image-text matching loss (ITM), which activates image-based text encoders, aims to learn multimodal representations of image-text that capture fine-grained alignment between vision and language;
Language modeling loss (LM), which activates an image-based text decoder, aims to generate a text description given an image.

To achieve efficient pre-training while leveraging multi-task learning, the text encoder and decoder must share all parameters except the self-attention (SA) layer. Specifically, the encoder uses bidirectional self-attention to construct a representation for the current input token, while the decoder uses causal self-attention to predict the next token.

In addition, embedding layers, cross attention (CA) layers, and FFNs function similarly between encoding and decoding tasks, so sharing these layers can improve training efficiency and benefit from multi-task learning .

Experimental results

The researchers implemented the model in PyTorch and pretrained the model on two 16-GPU nodes. Among them, the image transformer is derived from ViT pre-trained on ImageNet, and the text transformer is derived from BERT_base.

Mainstream datasets: COCO, Flickr

Hugging Face

Hugging 官方提供了很多有趣的基于自然语言处理和计算机视觉的AI应用在线体验API，还有非常多高star的开源项目收到很多开发者的持续关注

https://huggingface.co/spaces/stabilityai/stable-diffusion

1-2

Singularity

奇点智源开放了非常多有趣的 AI 应用 API 接口，大家有兴趣可以去体验试用

https://openapi.singularity-ai.com/index.html#/examplesIndex

3-1

Chinese-CLIP

This project is the Chinese version of the CLIP model, using large-scale Chinese data for training (~200 million image-text pairs), aiming to help users quickly realize image-text feature & similarity calculation, cross-modal retrieval, and zero-sample images in the Chinese field classification tasks. The project code is based on the open_clip project, and has been optimized for Chinese domain data and to achieve better results on Chinese data.

https://github.com/OFA-Sys/Chinese-CLIP

国内也有很多科技公司开放了 AIGC（AI generated content）众多体验或者支持商用的 API 试用接口，接下来主要以全面展现强大AI能力的百度、游戏和AIGC业务蓬勃发展的昆仑万维两家行业内 AI 技术积累深厚的公司 AI 绘画展开介绍和试用体验

baidu

Not long ago, Baidu also launched a text-based style image generation technology Wenxin painting experience link:

Text-based image generation technology experience link: https://wenxin.baidu.com/ernie-vilg

After opening the link, simply enter the target vocabulary and select the desired style to complete a unique painting, and you can see that Baidu's official online API experience currently supports more than ten styles of text-to-image generation.

示例图像如下，的确强大哇

5-0
It can be seen that Baidu's API experience generation effect is still good. Baidu's text is provided in the form of multiple keywords. I don't know if the current model provided can also understand the semantics of long text very well. If you are interested, you can To experience, another obvious feeling is that the current algorithm model generation takes a long time;

Kunlun Wanwei AI Painting

With the development and improvement of artificial intelligence technology, AIGC has brought unprecedented help to human beings in the creation of content. Specifically, it can help human beings improve the efficiency of content production, enrich the diversity of content production, and provide more dynamic and interactive content. content.

Kunlun Wanwei AI painting is another AIGC painting practice following the AI composition of StarX series products, marking a qualitative breakthrough in the construction of Kunlun Wanwei in the field of AI painting.

As a huge concept, virtual technology spans a huge industrial chain in multiple fields such as AI, games, and social networking. In the medium and long term, it is expected to bring innovation in the virtual world, promote the common prosperity of all links in the industrial chain, and then bring new growth.

The development of this technology requires not only the promotion of technology, but also the support of the content side. Kunlun Wanwei will complete the layout and the application of AI technology in social entertainment, information distribution and other fields in 2021. With the gradual maturity and advancement of AIGC, this technology will also become one of the breakthroughs for Kunlun Wanwei to achieve "power growth".

Kunlun Wanwei’s AI painting model developers have built a high-quality data set of hundreds of billions for the Chinese field. Through high-performance a100-gpu clusters, training (200 graphics cards, training for 4 weeks, and subsequent optimization for a total of 2 weeks) has achieved hundreds of Large generative models with hundreds of millions of parameters;

昆仑万维AI绘画模型在模型训练过程中主要采取了如下策略

While increasing the ability to input Chinese prompt words, it is also compatible with the original stable_diffusion English prompt word model. The English prompt word manuals accumulated by users before can still be used on our model;
Using 150 million levels of parallel corpus to optimize the prompt word model to achieve Chinese-English comparison, not only involves translation task corpus, but also includes Chinese and English materials for prompt words that are frequently used by users, Chinese and English materials for ancient poems, subtitles corpus, encyclopedia corpus, and picture text A massive corpus collection of multi-scenario and multi-task description corpus;
The model distillation scheme and the bilingual alignment scheme are used for training, and the teacher model is used to distill the student model while supplementing the decoder language alignment task to assist model training.

体验昆仑万维之AI绘画小程序，惊艳到我

Input content: A girl in a red dress reads a book in the sun
The output image is as follows, I don’t know if it amazes readers

9-7

昆仑万维官网链接如下，感兴趣的小伙伴，去下载他家产品进行体验AIGC强大生产力吧

http://www.kunlun.com/