AIGC image resolution too low? Come and try the pixel-aware diffusion super-resolution model, all the details you want are here

 FaceChain photo open source project interlude:

       The latest FaceChain supports group photos and hundreds of single photo styles. Project information summary: ModelScope Magic Community .
       Direct access to github open source (click a star if you find it interesting.): GitHub - modelscope/facechain: FaceChain is a deep-learning toolchain for generating your Digital- Twin.

Summary

Alibaba's latest self-developed pixel-aware diffusion super-resolution model has been open sourced. It combines the powerful generation capabilities of the diffusion model with pixel-level control capabilities, and can be adapted to various image enhancement tasks and various tasks from old photo restoration to AIGC image super-resolution. image styles, with the ability to control the generation intensity and enhancement style. One of the direct applications of this technology is the post-processing enhancement and secondary generation of AIGC images, which can bring considerable effect improvements.

Papers & Code

论文链设计:[arxiv]

Daiko: [Motoushagu][github ]

Background introduction

With the rapid development of large models, especially AIGC large models represented by Vincentian graphs and ChatGPT, artificial intelligence has entered a new era and fast lane of development. Taking Wenshengtu as an example, the model trained based on big data and large models shows amazing generation capabilities and can output realistic natural images based on text prompts, to the extent that they are fake and real. The startup StabilityAI trained and open sourced the Stable Diffusion (SD) Vincent graph pre-training model based on the latent diffusion framework, giving the general public the opportunity to access and use large models. Its superior performance has also brought about an upsurge in academic research and open source communities. It has been widely used and has a profound impact on downstream tasks including controllable generation, personalized definition, image editing, etc. This article focuses on super-resolution and restoration algorithms in low-level vision tasks. Such tasks require special reliance on the generation ability of the model to restore lifelike and realistic texture details. This is what generative models such as SD are good at. Therefore, applying SD to super-resolution tasks is becoming a research hotspot. Works including LDM and StableSR have emerged. This paper introduces a new image super-resolution and repair algorithm based on SD-generated priors, which can be used in multiple tasks. All have SOTA performance.

Take a look at the finished product first

research Foundation

Before introducing image super-resolution and restoration, let us first review the SD-based controllable image translation task (Image-to-Image Translation), that is, given a control image such as canny, pose, depth, etc., generate a consistent control image structural results. We can understand large-scale textural graph models such as SD as having the ability to generate arbitrary images in nature. Then the controllable image translation task is essentially to find results consistent with the control image in the latent space of SD. Therefore, representative works such as ControlNet, T2I-Adapter and others introduce control conditions into the SD main network by introducing additional branch networks to stimulate its potential. The super-resolution task and the image translation are essentially the same, both are Image-to-Image mapping, but the difference is that the control condition of the super-resolution task is a low-resolution image, and the expected output result needs to be the same as This low-resolution image corresponds at the pixel level and is therefore a more constrained image translation. With this in mind, we can take inspiration from previous work such as ControlNet. An initial idea is to directly use ControlNet for super-scoring, but unfortunately, experiments have found that using ControlNet for super-scoring often cannot achieve precise control at the pixel level, and there will be semantic structures between the output high-definition image and the input low-definition image. The difference is as shown in the figure below:

This is mainly because ControlNet only uses the additive method to pass in control condition information, and the control in this method is relatively weak and cannot achieve pixel-level perception.

our way

PACA module

Therefore, the core problem we need to solve is how to enhance SD’s perception of pixel-level control information. The main framework diagram of our design is as follows:

Different from the simple addition method adopted by ControlNet, we introduce a special Pixel-Aware Cross Attention (PACA) module to enhance the transmission of pixel-level information. Its form is similar to the classic cross attention:

Among themThe source of Q is the feature obtained by SDx, and Q and V are derived from ControlNet-like branch networks The obtained featuresy. Here y has exactly the same size as x. We will y is mapped into an embedding of length h*w. The length h*w here contains pixel-level information, because the ControlNet-like branch does not use the encoder in VAE, we believe that y still retains the control image Raw pixel information. Because of this, we believe that PACA enhances the perception of pixel-level information.

Degradation removal module

Especially for real super-resolution scenes, because input low-definition images often have various degradation factors, and we hope that the SD-based module focuses on generation capabilities, so a pre-degradation removal module is introduced to perform real degradation A simple operation to remove degradations is performed on the image. Our experiments also found that such a structure is beneficial to improving the effect of real super-resolution.

High-Level information

In order to further enhance the effects of super-resolution and repair, our experiments have found that high-level semantic information often has a positive impact on the results, so we introduce classification, detection, image labeling and other networks to provide additional semantic information, and This information is structured into text prompts and input into SD. At the same time, according to the Classier-free Guidance theory, we introduce some negative prompts including noise, blurry, lowres, etc. Experiments also show that these information factors are also helpful to the results.

Experimental results

Image super-resolution

We have verified our algorithm on multiple synthetic and collected benchmarks, and it has SOTA performance on multiple indicators:

Similar findings were found in visual contrast experiments:

Custom stylization

In addition to super-resolution and repair tasks, we also found that by switching the base model, our algorithm can easily achieve arbitrary style transformation:

This essentially separates the generation capabilities of Image-to-Image mapping and stylization. The branch network we introduced solves the pixel-wise image-to-image mapping, while the base model solves the generation of stylization. This opens up a whole new way of thinking about image stylization.

Image colorization

Because the algorithm we propose is essentially a Pixel-Aware image translation, it is suitable for any related tasks, including image coloring. We also trained on the image coloring task, and preliminary experiments also showed better results than SOTA:

references

Robin Rombach and Andreas Blattmann and Dominik Lorenz and Patrick Esser and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. Arxiv. 2021.

Lvmin Zhang and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. Arxiv. 2023.

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, Xiaohu Qie. T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models. Arxiv. 2023.

Wang, Jianyi and Yue, Zongsheng and Zhou, Shangchen and Chan, Kelvin CK and Loy, Chen Change. Exploiting Diffusion Prior for Real-World Image Super-Resolution. Arxiv. 2023.

Vision Algorithm Recruitment

Vision algorithm intern and regular employee positions are open for a long time. Welcome to add WeChat (309107918) to contact us!

About us:Tongyi Open Vision Intelligence is Alibaba’s applied vision capability R&D and opening center , has developed and opened hundreds of visual capabilities, models and practical suites in the two major technical directions of visual perception understanding and visual generation and editing; and provides large model services in various visual fields and application scenarios through Tongyi Wanxiang, ModelScope, etc. are open to developers and various industries, and continue to promote application innovation and product research and development of visual technology, thereby bringing large-scale visual AI user ecology and cloud intelligent service value.

Guess you like

Origin blog.csdn.net/sunbaigui/article/details/133343427