A Unified Conditional Framework for Diffusion-based Image Restoration

A Unified Conditional Framework for Diffusion-based Image Restoration (Paper reading)

Yi Zhang, CUHK, CN, arXiv2023, Cited:0, Code , Paper

1 Introduction

Recently, Diffusion Probabilistic Models (DPMs) have shown remarkable performance in image generation tasks, capable of generating highly realistic images. When using DPMs for image restoration tasks, a key point is how to integrate conditional information to guide DPMs to generate accurate and natural output, which is often neglected in existing studies. In this paper, we propose a unified conditional framework based on diffusion models for image restoration. We exploit a lightweight UNet to predict the initial bootstrap and learn the residual part of the bootstrap using a diffusion model. By carefully designing the basic modules and integrated modules of diffusion model nuggets, we integrate bootstrapping and other auxiliary condition information into each diffusion model nugget, achieving spatially adaptive generative conditions. To handle high-resolution images, we propose a simple yet effective step-by-step partitioning strategy that can generate images of arbitrary resolution without grid artifacts. We evaluate our conditional framework on three challenging tasks: extremely low-light denoising, deblurring, and JPEG restoration, and demonstrate significant improvements in perceptual quality and generalization to the restoration task.

2. Holistic thinking

Not a new idea, augmenting the initial estimate with a diffusion model. The innovation of this article should be on the network of the diffusion model. This article is a network designed by myself, which uses dynamic convolution.

3. Method

insert image description here
Our goal is to design a unified conditioning framework for image restoration tasks. The input condition information of this framework consists of two components: degraded image and auxiliary scalar information. The degraded image represents the image to be restored, while auxiliary scalar information can include the degradation type, intensity, or other details relevant to each restoration task.

To enhance the integration of conditional information, we first adopt a lightweight U-Net to predict the initial output, as shown in Figure 1 (left). This initial output captures low-frequency and deterministic aspects of the final recovered image that are easier to recover and contain key structural information. We use this initial output as a spatial guide for the diffusion model. Combined with auxiliary scalar information (e.g. degradation type, diffusion time step), we inject them into each block of the diffusion model, enabling better control and guidance of the diffusion model. This injection not only provides comprehensive context, but also enhances the flexibility of our framework. We adopt a diffusion model to capture the residual distribution of the initial output.

Basic Module: In our approach, we design a basic module for the diffusion model used in image restoration tasks. We aim to make the module as simple as possible by leveraging existing image restoration backbones. We try to avoid using complex operators and instead adopt the existing image restoration backbone to make it as simple as possible. For each block, we use two convolutional layers. Before each convolutional layer, we introduce LayerNorm to stabilize the training process. We use Swish as the activation function. We apply a shortcut to implement residual learning. In order to be able to inject conditional information, the design of the second convolution kernel is based on the dynamic change of conditions.

Conditional Injection Module: To better integrate conditional information into blocks, we propose a Conditional Integration Module (CIM). In CIM, the guiding information is first scaled to match the resolution of the feature map within the block. Then, this scaled guidance information is passed through two convolutional layers and uses the SimpleGate activation function, effectively adjusting the number of channels and generating the feature map GGG

SimpleGate(x) = sigmoid(x) * x

At the same time, the auxiliary scalar information is branched through two linear layers, where it is processed using the Swish activation function to generate the feature map SSS. _ Next, the feature mapGGG andSSS is passed to the Adaptive Kernel Guidance Module (AKGM), which is used to generate dynamic convolution kernels for the second convolutional layer in the basic module, as shown in Figure 1. The key idea of ​​AKGM is to adaptively fuse convolutional kernel bases, enabling each spatial location to process feature maps according to fused multi-source conditional information.
insert image description here
As shown in Figure 2 (left side), each AKGM hasNNN learnable convolution kernel bases, expressed asW b ∈ RC × C × k × k W_b ∈ R^{C×C×k×k}WbRC × C × k × k , whereCCC represents the number of channels,kkk represents the size of the convolution kernel. These convolutional kernel bases are trained to handle different situations and scenarios. Feature mapG ∈ RH × W × NG ∈ R^{H×W×N}GRH × W × N andS ∈ R 1 × 1 × NS ∈ R^{1×1×N}SR1 × 1 × N generates multi-source fusion weightsM ∈ RH × W × NM ∈ R^{H×W×N}MRH × W × N . Here,HHHwaWW __W represents the height and width of the feature map,NNN represents the number of convolution kernel bases. For a specific position( i , j ) (i, j)( i , j ) fusion convolution kernelF ( i , j ) F(i, j)F ( i , j ) is obtained by linearly fusing the multi-source fusion weights at this position. Specifically, it can be expressed as:
F i , j = ∑ N − 1 b = 0 M i , j [ b ] W b F_{i,j}=\textstyle \sum_{N-1}^{b=0 }M_{i,j}[b]W_bFi,j=N1b=0Mi,j[b]Wb

4. Experiment

insert image description here
insert image description here

Guess you like

Origin blog.csdn.net/qq_43800752/article/details/130996114